Panic in Multicore Land
MOBE2001 writes "There is widespread disagreement among experts on how best to design and program multicore processors, according to the EE Times. Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks. Others disagree on the ground that heterogeneous processors would be too hard to program. The only emerging consensus seems to be that multicore computing is facing a major crisis. In a recent EE Times article titled 'Multicore puts screws to parallel-programming models', AMD's Chuck Moore is reported to have said that 'the industry is in a little bit of a panic about how to program multicore processors, especially heterogeneous ones.'"
I think "panic" is a bit of an over-reaction. I use a multicore CPU. I write software that runs on it. I'm not panicking.
Follow me
AMD's Chuck Moore presumably has a lot of self interest in pushing heterogeneous cores. They are combining ATI+AMD cores on a single die and selling the benefits in a range of environments including scientific computing etc.
So take it all with a grain of salt
--Q
Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.
Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.
The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.
The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.
Can I have... errr... Two floating point, one generic math with extra cache and two RISC's.
If you mod this up, your slashdot background will turn into a beautiful sunset!
It is portable, scalable, standardized and supports many languages.
nemesis. Home of an experimental fe code.
That's why it's so important that languages begin to adopt threading primitives and immutable data structures. Java does a good job. Newer languages, like Clojure are built from the ground up with concurrency in mind.
This article is referring to AMD's Charles R. "Chuck" Moore, who worked on the POWER4 and PowerPC 601, not the language and chip designer Charles H. "Chuck" Moore who invented Forth, ColorForth, et al. and was interviewed on slashdot.
o/~ Join us now and share the software
...functional programming languages? Or flow programming?
Sounds wasteful, I know (data replication everywhere). But there is a reason for that. The process becomes resilient to unexpected changes (corruption). The bus is the enzymes, the cpu is the cell and thread of execution is, well, the DNA. The replication and communication process is autonomous.
- these are not the droids you are looking for -
What Mr Moore is saying does have a grain of truth, that generic will be beaten by specific in key functions. The Amiga proved that in 1985, being able to deliver a better graphical solution than workstations costing tens of thousands more. The key now is to figure out which specifics you can use without driving up the cost nor without compromizing the design ideal of a general purpose computer.
Karma Whoring for Fun and Profit.
or go take threads 101. don't punish developers who know what they are doing just because the ruby/rails/java/python fad language crowd doesn't understand how their language bastardizes pthreads.
I've been doing some scientific computing on the Cell lately, and heterogeneous cores don't make life very easy. At least with the Cell.
The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.
The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.
The idea of having to use Microsoft APIs to program future computers because the vendors only document how to get DirectX to work doesn't exactly thrill me. I think panic is perhaps too strong a word, but sheesh...
1. Change operating systems to be able to use the all the available CPU power even when running single threaded applications.
2. Change programming languages to make multicore programming easier.
3. Both 1 and 2.
What the end user should be able to dictate however is how many cores should be in use. It's not for the programmer of the application to dictate how processing of any data should occur.
As I demonstrated in my thesis a parallel application can be shown to have certain critical and less critical parts. An optimal processing platform matches those requirements. The remainder of the platform will remain idle and burn away power for nothing. One should wonder what is better: a 2 GHz processor or 2x 1 GHz processors. My opinion is that, if it has no impact on performance, the latter is better.
There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).
nosig today
When we wrote the OpenAMQ messaging software in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project.
Today, we're making a new design - ØMQ, aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.
From one of the ØMQ whitepapers:
We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.
My blog
Just build both and let the market decide.
rooooar
If you have 80 or more cores, I'd rather have 20 of them support specialty functions and be able to do them very fast (it would have to be a few (1-3) orders of magnitude faster than the general counterpart) and the rest do general processing. This of course needs the support of operating systems, but that isn't very hard to get. With 80 cores caching and threading models have to be rethought, especially caching - the operating system has to be more involved in caching than it currently is, because otherwise cache coherency won't be able to be done.
.NET will be much more popular as it is possible for them to take care a lot of these issues, which isn't or only possible in a limited way for languages like C and friends with static source code inspection.
This also means that programs will need to be written not just by using threads, "which makes it okay for multi-core", but with cpu cache issues and locality in mind. I think VMs like JVM, Parrot and
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
There is this view held by some (of which some are posting here) that somehow CPUs are primitive brains and that improving them will eventually result in a non-primitive brain. Hello, there is nothing remotely human about what my computer has done for me lately. Computers and humans *do* very different things, and *are* very different things.
I beg that the distinction between acquiring hints from brain structure vs creating brain structure not be blurred, and that no moderator marks "brains are like this so chips should be like that" type posts as informative or insightful.
No one at Intel has their chipset blueprints confused with an x-ray of Einstein's brain.
This trend with multiple cores on the CPU is only an intermediate phase,
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.
In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.
Kim0
I have a 4-core workstation and ALREADY I get crap usage rates out of it.
Flick the CPU monitor to aggregate usage rate mode, and I rarely clear 35% usage, and I've never seem it higher than about 55% (and even that for only a second or two once an hour). A normal PC, even fairly heavily loaded up with apps, just can't use the extra power.
And since cores aren't going to get much faster, there's no real chance of getting big wins there either.
Unless you have a specialized workload (heavy number crunching, kernel compilation, etc) there's going to simply be no point having more parallelism.
So as far as I can tell, for general loads it seems to be inevitable that if we want more straight line speed, we'll need to start making hardware more attuned for specific tasks.
So in my 16-core workstation of the future, if my Photoshop needs to apply some relatively intensive transform that has to be applied linearly, it can run off to the vector core, while I'm playing Supreme Commander on one generic core (the game) two GPU cores (the two screens) and three integer-heavy cores (for the 3 enemy AIs), and the generic System Reserved Core (for interrupts, and low-level IO stuff) hums away underneath with no pressure.
Hetrogeny also has economics on it's side.
There's very little point having specialized cores when you've only got two.
Once there's no longer scarcity in quantity, you can achieve higher productivity by specialization.
Really, any specialized core that you can keep the CPU usage rates running higher than the overall system usage rate, is a net win in productivity for the overall computer. And over time, anything that increases productivity wins.
Take an advise from mother nature: as far as I know, our brain works like a heterogeneous multicore processor. We don't have multiple generic mini-brains in our head, we have one brain with highly specialized brain areas for different tasks. Seems to be the right concept for a computer processor.
Strange, it seems to me that Google would have some ideas about how to utilize massively parallel processing, as would the supercomputing crowd.
Is the issue here how to scale supercomputing concepts down to desktop applications? Well, for starters, you can dedicate a couple of cores to run all of the background processes (on the order of 70) that my IT department insists must reside on my system, so that I might get at least one which can work on the application(s) at hand.
Looking back at history, we see that as clock speeds and memory capacity increased, software writing became simplified by the use of higher level languages whos output, while not as optimal as machine code programming, ran at a similar speed to previous generation hardware using well optmised machine code. And so, the "problem" of writing for faster machines was solved.
For the multicore problem, I propose a similar strategy. Simply write a natural language programming interface which uses n-x cores to interpret and compile the code into a mish mash of bloated machine code, which then runs on the remaining x number of cores. Of course, several remaining cores would be needed to run this bloated mess at speeds comparable to 486's - but at least the new hardware could be widely sold, thus supporting industry!
Its not like the users really need faster software, they just need a reason to upgrade to better hardware, right?? right??
I went to a presentation by Sun last Friday (by Don Kretsch and Liang Chen), on "High Powered Computing". Sun's idea of HPC is, logically, multicored/cluster solutions. They talked about some of their abstraction ideas on how to take advantage of a bunch of cores. Some interesting stuff, but it was still pretty similar to traditional single core approach, only branching for some stuff, like loops. I'm not sure if any of their abstraction ideas were radical enough to get excited about, but it was still interesting to see. Task specific hardware and low level programming seems like the best approach for me. Like graphics cards in games. Once we're comfortable with that it then maybe build up some APIs. Sun's presentation convinced me that its the biggest challenge of modern computing.
Perhaps, panic is a little strong. At the same time, programing languages such as Occam, that are built from the ground up seem very provocative now. Perhaps Occam's syntax could modified to a Python-type syntax for a more popularity.
[Although, personally, I prefer Occam's syntax over that of C's.]
http://en.wikipedia.org/wiki/Occam_programming_language
I think that a tread aware programming language would be good in our multi-core world.
https://www.youtube.com/c/BrendaEM
I'm curious how having specialized multi-core processors is different from having a single-core processor with specialized subunits. Ie, a single core x86 chip has a section of it devoted to implementing MMC, SSE, etc. Isn't having many specialized cores just a sophisticated way of re-stating that you have a really big single-core processor, in some sense?
You see? You see? Your stupid minds! Stupid! Stupid!
Call me what you will, but personally I *still* prefer the performance of a super fast single core (~3.5ghz+) over this over-hyped multi-core phenomenon. I've yet to see any *major* differences between two machines I have that are the same clock speed, one single core, one dual. The difference I do experience is similar to what I'd expect from a .5ghz jump. In other words, the architecture *does* need to change if they have any desire to have any significant performance increases.
The way I see it, to get max. performance out of these chips, you need a deeper understanding of them, i.e. it requireshigher skills, i.e. better quality jobs, better money, the works. Consider the fact that a lot of programmers have a really hard time dealing with concurrency at a thread level, these coming chips will only make it harder.
I don't think most concurrency problems can be automated away, it's the concepts and implementation of the concurrent algoritms that are hard, not so much the implementation (although that is where the bugs bite you when the stars are just right (wrong?)).
I'm rambling a bit I see, but I'm looking forward to interesting times ahead.
See, the thing to do with all these cores is run a physics simulation. Physics can be easily distributed to multiple cores by the principle of locality. Then insert into your physics simulation a CPU -- something simple like a 68k perhaps. Once you have the CPU simulation going, adjust the laws of physics in your simulation (increase the speed of light to 100c, etc) so that you can overclock your simulated 68k to 100Ghz. Your single-threaded app will scream on that.
P.S.: I know why this is impossible, so please don't flame me.
I have seen the future, and it is inconvenient.
Genuine question that I don't know the answer to:
How are heterogeneous CPU cores different conceptually to a modern PC system with say:
2 x General purpose cores (in the CPU)
100 x Vector cores (in the GPU)
n x Vector cores (in a physics offload PCI card)
How is moving the vector (or whatever) cores onto the CPU die different to the above setup, apart from allowing for faster interconnects?
Ugg is smart.
Ugg can program a CPU.
Two Uggs can program two CPUs.
Two Uggs working on the same task program two CPUs.
Uggs' program has a race condition.
Ugg1 thinks, it's Ugg2's fault.
Ugg2 thinks, it's Ugg1's fault.
Ugg1 hits Ugg2 on the head with a rock.
Ugg2 hits Ugg1 on the head with an axe.
Ugg1 is half as smart as he was before working with Ugg2.
Ugg2 is half as smart as he was before working with Ugg1.
Both Uggs now write broken code.
Uggs' program is now slow, wrong half the time, and crashes on that race condition once in a while.
Ugg does not like parallel computing.
Ugg will bang two rocks together really fast.
Ugg will reach 4GHz.
Ugg will teach everyone how to reach 4GHz.
Contrary to the popular belief, there indeed is no God.
Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks
And let's give the cores names like Paula, Agnus, Denise...
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
The height of optimism: posting proof in the form of a 70-odd page thesis on a Slashdot. ;-)
I don't think we'll be Slashdotting your server any time soon, CBravo
Antiquis temporibus, nati tibi similes in rupibus ventosissimis exponebantur ad necem.
I have read many times that some algorithms are difficult or impossible to multi-thread. I envisage the next logical step is a two socket motherboard, where one socket could be used for a 8+ core cpu running at low clock rate (e.g. 2-3Ghz) and another socket for a single core running at the greatest frequency achievable to the manufacturing process (e.g. x2 to x4 the clock speed of the multi-core) with whatever cache size compromises are required.
This help get around yield issues of getting all cores to work at a very high frequency and the related thermal issues . This could be a boon to general purpose computer that have a mix of hard to multi-thread and easy to multi-thread programs - assuming the OS could be intelligent on which cores the tasks are scheduled on. The cores could or could not have the same instruction sets, but having the same instruction sets would be the easy first step.
Doesn't the OMG have anything to help with this ... suggested patterns ... specs etc ?
For servers the real problem is I/O. Disks are slow, network bandwidth is limited (if you solve that then memory bandwidth is limited ;) ).
;).
;).
For most typical workloads most servers don't have enough I/O to keep 80 cores busy.
If there's enough I/O there's no problem keeping all 80 cores busy.
Imagine a slashdotted webserver with a database backend. If you have enough bandwidth and disk I/O, you'll have enough concurrent connections that those 80 cores will be more than busy enough
If you still have spare cores and mem, you can run a few virtual machines.
As for desktops - you could just use Firefox without noscript, after a few days the machine will be using all 80 CPUs and memory just to show flash ads and other junk
The experts in these articles keep forecasting processors with powers-of-2 cores (32, 64, 128). Is there a reason that the number of cores can't be some value in between, like 6?
And is the doubling time really 18 months? Aren't we due for the Intel Core 4 Quad already? If the doubling is slower, then I'd like to see the in-between core counts come sooner rather than wait for the next power of 2.
I am TheRaven on Soylent News
As long as there are only a few different "Heterogeneous" configurations then it shouldn't be that bad. We essentially already have that with GPU's and to a lesser extent Physics Accelerators. What will really be a nightmare is if they start making the cores modular and we start getting hundreds of different configurations. I can just see the game requirements on the side of a box now - must have a CPU with 2 general purpose cores, 4 vector cores, 2 VC1 cores and 3 vertex units.
Oh, and anyone who thinks software development cycles are long and expensive now, just wait until the code needs to be written for and tested against every possible combination :-)
I'm interested in cell programming, I do scientific computing (CFD) and I have a code that is highly parallelizable, in C++, and I've often thought after this semester about possibly porting to to the PS3 for kicks. But what you say is kind of discouraging. Would you recommend even trying?
Also, can you point out any good references you used to learn? Beyond a few intro docs from IBM, I'm pretty clueless. I'd appreciate it, thanks.
Tom's Hardware has a great web page just for cpu doubters. It allows you to choose the two cpus to compare, the task you are wondering about and then you get an exhaustive list of how fast several dozen processors would be, with your chosen two in red.
/S" and press enter. On my 3.2gHz HT Pentium, I get 100% cpu on at least one of the cores (I can't test my own system at the moment, sorry). So, my fan kicks in, my ears get deafened and I don't like it. On the Q6600, two of the cpus get zero load change, and two get a 35-50% increase. So, thermally, there is little to no change (.LT. 25% increase in overall cpu usage) -- half or less of my system -- and so the fan may not even kick in (I didn't hear it in the store, anyway), I don't get my ears blasted, and I am noticeably happier (because I am a simpleton who only types DIR /S all day long).
I have been using the charts to compare various CPUs that PCClub.com offers across their various families of computers. The Q6600 looks to be, as they themselves said when I visited the store, the sweet spot as of March, 2008.
I think you should realize that quad+ cores are not going to offer as visible a performance increase as you are used to. In other words, unless you dig out three or four stopwatches and run your test tasks on both single and multi-cpu setups, you aren't going to see the differences.
For what it is worth, I think your mistake is when you say you are looking for "*major*" differences. These are not at all necessary for the user experience to be improved. Try typing on a 110cps teletype into a mainframe to see what I mean -- plenty of raw power goes wanting because the PBKAMainframe. Multiple cores reverse the situation -- average core speed is often lower, but the average task no longer pulls down the whole system.
Consider the following trivial test I did at the store. Open a cmd.exe window, change to root directory, type "DIR
I come here for the love
Unless I'm missing a major class of computational problem, we have only two types of problem in computing at the moment.
1) We have very computationally intensive tasks, which are inevitably trivial to parallelise. Examples include calculating graphics, running sims, and compiling. These tasks all involve many, many small tasks that can be completed independantly. Ffor example, we can compile individual source files independantly of each other, we can calculate the dot product of a vector with each element calculated in parallel, etc.
2) And then we have the rest. Messy, real world types of problems that are sequential by their very nature. I've tried, but I just can't find any real-world example of this type of problem that is very computationally intensive. Maybe somebody else can think of one? Video compression? But even there, we insert full frames every second or so, to limit propagation errors... Databases? I know that transactions impose an order, but even there the limit seems to be only on accessing one record/table simultaneously by multiple clients.
When I read discussions on massively parallel computing, people always talk about how hard it is. And it's true that parallelising problems of type 2 is difficult. But I just can't think of any real world use case, which leads me to conclude that we don't actually have a problem at all. In other words, Much Ado About Nothing, move along, nothing to see here...
Both show how many active processes on a box? Computers don't run a single thing any more, they are already federations of dozens of concurrently running programs, both in Unix and in Windows. Multicore makes the whole desktop feel crisper and faster.
This is my sig.
"to make effective use of multicore hardwre today, you need a PhD in computer science."
BAH!
I don't have a PhD and yet I can program multi core. Threading, message passing, heterogeneous or homogeneous. What is really required is thought. Now I realize a good part of the population is opposed to thought... but sometimes you just gotta bite the bullet.
The most basic skill in computer science is breaking a problem down into smaller pieces.
If you have multiple processors then you simple break your problem down into pieces for each core. Granted some problems don't decompose well, but many do.
These guys are freaking out over nothing. As long as the cores are not all tied to a specific process (which would be STUPID), then the current computing models will work fine.
In other words, do you run one program on your system? No, on a slow day I have about 150 concurrent processes on my desktop. On my web servers and database servers, I have a lot of processes competing for CPU. The only thing that will have to happen is a modification to the linux process and scheduling code to accommodate many more processors than the SMP code currently does. Everyone focuses on one application, but everyone runs a multi-user multi-processing system these days. Multiple cores removes CPU contention! We *already* have systems that inherently use multi-core CPUs.
It looks more like they're worried about some fictional single application benchmark where they can measure throughput. That ship has sailed. As it is, processors are as fast as they can practically get (or need to get with RAM and I/O speeds) without a breakthrough or two. (That's why the computer sales slump) There is little Intel or AMD can do to speed up the processing of a single threaded application. So, how do you compete on the practical speed of multi-core CPUs? That's what they are really worried about.
Ok, but how are your examples different to a cryto-offload board (e.g. SSL accelerator, that's just really a single core on a PCI board), specialist sound-card with DSP processor etc (same again)?
I would certainly listen to what Chuck Moore has to say on the topic of CPU trends. For one thing, his name is a combination of Chuck Norris and Gordon Moore. How can you be any more of an expert than that? I expect his company to put his ideas into practice soon. Expect to see the AMD "Roundhouse" architecture take the computing world by storm.
Dewey, you fool! Your decimal system has played right into my hands!
I've got 106 processes running on my XP laptop and I'm not even doing work right now. (At which point add another good dozen++ processes.)
And lots of these processes are already multi-threaded. (Including most of the tools and frameworks I use and some of the code I'm writing.)
So even though some of this sounds theoretical, I don't think I even need any kind of software upgrade to benefit from having an 80 core processor today just for scheduling processes. (Though, the memory bandwidth issues others have pointed out would need some attention.)
Cheers,
Richard
Not my server, of course ;-)
nosig today
so max crunching will occur on a whole beastload of weak processors.. if we can use them in a respectable fasion.
oh, most software doesnt run well as a single thread, otherwise it wouldnt take so bloody long for the address bar to keep up with my typing when I get stuck at some god forsaken web page that I really want to get off my screen because I mistyped microsoft or google. watching the letters slowly come up one at a time is horrible on a core2duo when IE is the only app open. apps need some massive re-engineering.
I'm not the writer of the original post, but I already used a Cell for CFD programming, and I confirm that it's a pain. If your code highly depends on memory access (such as many CFD codes), you will face huge amount of problems :
- the SPE have a very limited memory space, so you'll have to constantly move data between SPE and PPE.
- synchronizing SPEs is sometimes hard
- Moving chunks of memory means align correctly your data (believe me, it's not that simple).
- and don't forget that your code and data share the same memory space : if your code is large, then small amount of data will fit in memory (-> even more communications)
If the code complexity (operations per chunk of data that fits in SPE memory) is less than O(n^2), the speedup will be very poor (sometimes less than 1...) because moving data is very expensive and it's worse when using double precision.
I may be wrong, but it seems that CFD codes are not the best ones to port on Cell. Give it a try : programming on Cell is very interesting and sometimes funny.
a multiplicity of cores optimized for various tasks
I guess I'm okay with it as long as they include the morality core. Too much trouble, otherwise.
That software runs fine as a single thread because if it didn't then you wouldn't be running it. There are lots of things that computers could do that 99% of people don't do because they run too slow. Some of those things that we don't do are limited by CPU and are highly parallelizable.
We computer geeks seem to have found lots of fun and useful things to do with 2 GHz processors, 2 GB of RAM, 160 GB hard drives, and 1.5 Mbps networking. Nobody needed those capabilities before, but they seem essential now. If you multiply that processor by 80 then we will write software so cool and useful that 99% of people will need it.
Been there, done that, already. The 8087 and its 80x87 follow-on co-processors were exactly that. Specialized processors for specific tasks. Guess what? We managed to use them just fine a mere 27 years ago. DSP's have come along since and been used as well. Graphic card GPU's are specialized co-processors for graphic intensive functions, and we talk to them just fine. They're already on the chipsets, and soon to be on the processor dies. I don't think this is anything new, or anything that programming can't handle.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Heterogeneous/distributed microkernels. Each process runs on a core that supports its instruction set, and communicates with other processes using lightweight messages. Could see QNX suddenly becoming much more important.
The problem is when you have a single CPU-intensive task, and you want to split that over multiple processors. That, in general, is a difficult problem. Various solutions, such as functional programming, threads with spawns and waits, etc. have been proposed, but none are as easy as just using a simple procedural language.
Yes, the reason is that we are trying to fit a square peg into a round hole. Multithreading was not originally intended to be the basis of a parallel programming model but as a mechanism to run sequential (not parallel) programs concurrently. What is needed is an inherently parallel model where parallelism is implicit and sequential order is explicit. Read Parallel Programming, Math and the Curse of the Algorithm and Nightmare on Core Street for more.
I am surprised no one has mentioned this. When we have mutliCPU and multiGPU systems it should be somewhat trivial for motherboard makers to add another mouse and keyboard port. Then with the right OS, multiple users can use the same machine.
I regret that I only have one mod point to give per post.
IO bandwidth breakup -- that might be easy, just switch all the (sometimes heterogeneous) chips over to all-interval math based instruction sets, then allow the embarrassingly parallel nature of intervals to divide and conquer your workload. Obviously if only 4 chips out of 80 are working at 100%, then comes the harder part: analyze the specific situation and break up the instructions "where appropriate". If you can't break it up any farther quick and easy, that's ok. If designing parallel algorithms was easy, we'd be done with this already.
stuff |
I'm not convinced that this can't be tackled in hardware (probably because I don't know anything about hardware). Stick with me for a minute though.
I imagine a single CPU core as having a vertical pipeline for calculation. What we're trying to do is take (let's say) four of these vertical pipelines and figure how best to use them, simultaneously in software. Except for the embarrassingly parallel problems, this is quite hard.
Can't we do _something_ in hardware to sort of stack those vertical pipelines? A single branch of execution (I'm not sure if that makes sense) would travel all the way up through all four cores before completing.
The bottom line is that I don't think it's feasible to expect developers to write maximally efficient code for computers that have a dozen cores. It makes more sense to have those dozen 2GHz cores to appear as one 24GHz core to the OS due to the way the hardware is created. I realize it would only operate close to 2GHz for single operations, but would scale up toward 24GHz when multi-tasking.
Excuse my ignorance of the subject matter. My intent is only to contribute something to the conversation.
Kevin
Most of the bad ideas in multiprocessing have already been explored in supercomputers. From the nCube to the Connection Machine to the BBN Butterfly, we have a good idea of what doesn't work.
We know three things that work - clusters, symmetrical shared memory multiprocessors. and highly parallel graphics-type engines. Everything else that's been tried, from hypercubes to perfect shuffle machines, has been a dud.
Clusters make the computing world go round. All big server farms are clusters of relatively independent machines communicating over I/O channels. "Web services" are provided by clusters. So we know that works. The job of the server designer is to make clusters cheaper, smaller, and less power-hungry. There's a ready market for hardware that does that. With Google building data centers in former aluminum smelter locations just to get cheap power, there's no question that this is a very real problem.
Machines within a cluster can be symmetrical multiprocessors. That works fine. Asymmetrical multiprocessors are usually a pain. There's a long history of that idea in large computers, and they've consistently been disappointing. In clusters, each CPU has plenty of memory and its own private disks. Intercommunication is limited and slow, yet not usually the bottleneck. There are faster interconnection schemes, like Infiniband, but most clusters stick with some form of Ethernet.
It's worth noting that while the Cell gets attention, the XBox 360 is more successful, and it's quite conventional. It's a 3-CPU shared memory symmetrical multiprocessor (PowerPC), with a conventional GPU (nVidia) on the back end. On the Cell, brilliant people (I know some of them) struggle to cram problems into the architecture . On the XBox, game designers develop games. Weird architecture hurts your time to market.
The real choke points today are CPU-to-memory and memory-to-disk. We may see more memory move to the CPU chip, in the form of larger caches. We may see machines with a modest number of cores and main memory on the CPU chip. This is easy to design and improves memory access times. When a gigabyte or so can be crammed onto the CPU chip, this will look like a good option for desktop machines. One big chip will be the whole computer. That's the low-end PC of the near future.
As non-volatile memory, ("flash" and its friends) becomes cheaper, we'll face a new architectural challenge. To date, such memory has usually been treated as a fast disk. But this is suboptimal. Flash is near random access, but we're not using it that way. We need a new way to talk to flash memory, something that has file system like protection, doesn't require OS intervention, and has finer granularity than disk blocks. An interesting concept would be a flash memory/CPU combo optimized for running SQL-type databases.
It has been quite obvious to several people in the usenet news:comp.arch newsgroup that the future should give us chips that contain multiple cores with different capabilites:
:-(
As long as all these cores share the same basic architecture (i.e. x86, Power, ARM), it would be possible to allow all general-purpose code to run on any core, while some tasks would be able to ask for a core with special capabilites, or the OS could simply detect (by trapping) that a given task was using a non-uniform resource like vector fp, mark it for the scheduler, and restart it on a core with the required resource.
An OS interrupt handler could run better on a short pipeline in-order core, a graphics driver could use something like Larrabee, while SPECfp (or anything else that needs maximum performance from a single thread would run best on an Out-of-Order core like the current Core 2.
The first requirement is that Intel/AMD must develop the capability to test & verify multiple different cores on the same chip, the second that Microsoft must improve their OS scheduler to the point where it actually understands NUMA principles not just for memory but also cpu cores. (I have no doubt at all that Linux and *BSD will have such a scheduler available well before the time your & I can buy a computer with such a cpu in it!)
So why do I believe that such cpus are inevitable?
Power efficiency!
A relatively simple in-order core like the one that Intel just announced as Atom delivers maybe an order of magnitude better performance/watt than a high-end Core 2 Duo. With 16 or 80 or 256 cores on a single chip, this will become really crucial.
Terje
PS As other posters have noted, keeping tomorrow's multi-core chips fed will require a lot of bandwith, this is neither free nor low-power.
"almost all programming can be viewed as an exercise in caching"
I know everyone want to develop a new language or model that will make parallel programming easy and cheap. Maybe they will succeed where everyone else has failed. Adding increased support for development and debug tools to the silicon isn't sexy, but it will have a real impact when it comes to making parallel software development cheaper and quicker.
The common way we use threads today is broken. It's far too easy to deadlock them, for instance. The coming explosion of cores, heterogeneous or homogeneous, gives us the opportunity to learn that there are other concurrency models.
See "The Problem with Threads" in Spectrum, May 2006 for a primer.
Then go crack out a PS-300 (homogeneous example) manual if yours has not yet crumbled to dust. Or an Amiga (heterogeneous example) manual, if you must. Those two machines got it right (mostly). The PS-300 was too easy to break via injudicious use of a clock data source, but demonstrated the rendezvous model quite well.
One of the great things about Linux/Unix is that it is really easy to write quick, simple scripts to accomplish little tasks as they occur. Whether written for SH/BASH/etc., for PHP, Python, Perl or what have you, a few lines of code can provide a time or labor saving solution to a sysadmin or skilled end-user. This is one of the things that has encouraged me to convert more and more machines from Windows Server to Linux. These scripts, however, are almost impossible to easily and quickly write in a way that leverages multiple cores. Some interpreters do not support multi-threading and others have funky threading implementations that do not seem to be of much use aside from handling asynchronous IO.
Though I studied programming years ago (late '80s and early '90s), I am far from a skilled programmer. I do, however, have enough of a grasp of the subject to be able to create purpose-specific scripts to make my life as an admin easier or to solve situationally unique problems. Since they often are used to automate repetitive tasks, they tend to have a good degree of parallelism by nature.
I have spent significant time Googling and reading online docs; and, I have not found a reasonably performant threading implementation that even remotely maintains the ease of coding that non-threaded scripts have. While I know that most of this discussion is focused on the software created by developers for distribution; I have a suspicion that having a multitasking script interpreter that is as easy for admins to use as what we have now would greatly improve server performance.
After all, if there are a few poky script interpreters hogging a few cores, even the best optimized daemons will not be able to work to their potential.
WARNING: Smoking this sig may cause lowered IQ, insanity or short term memory loss. It is also really bad for your monit
Compatibility, flexibility, ease of use, no problem.
Back in 2000 I realized that 50 Million transistors of 4004 the first processor ever created, would out perform a P4 with the same transistor count done in the same fab running at the same clock rates. it would be over 10x faster I work out. But how to use such a device?
I had been working with a 100 PC cluster of P4 based systems to do H.264 HDTV compression in realtime. I spread the compression function across the cluster using each system to work on a small part of the problem and flow the data across the CPU's.
Based on this I wanted to build an array of processors on one chip, but I am not a silicon person, just software, driver and some basic electronics. So I looked at various FPGA cores, Arm, MIPS, etc. Then I went to a talk giving by Chuck Moore, author of the language FORTH. He had been building his own CPU's for many years using his own custom tools.
I worked with Chuck Moore for about a year in 2001/2002 on creating a massive multi core processor based on Chucks stack processor.
The Idea was instead of having 1,2 or 4 large processor to have 49 (7 * 7) small light but fast processors in one chip. This would be for tacking a different set of problems then your classic cpus'. It wouldn't be for running and OS or word processing, but for Multimedia, and cryptography, and other mathematic problems.
The idea was to flow data across the array of processors.
Each processor would run at 6Ghz, with 64K word of Ram each.
21 Bit wide words and bus (based off of F21 processor)
this allows for 4x 5bit instructions on a stack processor that only has 32 instructions.
Since it's a stack processor they run more efficiently. So in 16K transistors, 4000 gates,
the F21 at 500 Mhz performed about the same as a 500Mhz 486 with JPEG compress and decompress.
With the parallel core design instead of a common bus or network between the processors there would only be 4 connections into and out of each processor. These would be 4 registers that are shared with it's 4 neighboring processors that are laid out in a grid. So each chip would have a north, south, east and west register.
Data would be processed in whats called a systolic array, where each core would pick up some data, perform operations on it and pass it along to the next core.
The chips with a 7x7 grid of processors would expose the 28(4x7) bus lines off the edge processors, so that these could be tiled into a much larger grid of processors.
Each chip could perform around 117 Billion instructions per second at 1 Watt of power.
Unfortunately I was unable to raise money, partly because I couldn't' get any commitment from Chuck.
below is some links and other misc information on this project. Sorry it's not better organized.
This was my project.
---------
http://www.enumera.com/chip/
http://www.enumera.com/doc/Enumeradraft061003.htm
http://www.enumera.com/doc/analysis_of_Music_Copyright.html
http://www.enumera.com/doc/emtalk.ppt
--------
This was Jeff foxes independent web site, he work on the F21 with Chuck.
http://www.ultratechnology.com/ml0.htm
http://www.ultratechnology.com/f21.html#f21
http://www.ultratechnology.com/store.htm#stamp
http://www.ultratechnology.com/cowboys.html#cm
------
http://www.colorforth.com/ 25x Multicomputer Chip
Chucks site. 25x has been pulled down, but it's accessible on archive.org.
http://web.archive.org/web/*/www.colorfo
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
The ideal CPU would be designed for Linux.
It would have several dozen really small CPUs that are Linux commands/daemons literally burned into the chip.
So, much of the operating system would be hardware-based (maybe EEPROM microcode) that does much of the "core guts" of the Linux kernel (or maybe GCC libraries?).
The rest of the chip could be two or four X86 type multi-purpose CPUs.
A Microsoft CPU chip could use the same idea, but who would want it? (Winmodems, etc already have)
- I live the greatest adventure anyone could possibly desire. - Tosk the Hunted
Multicore machines could solve a big problem with microkernel architectures -- high context switch costs. If you lock down the microkernel to one of 8 cores -- let it monopolize the core -- then there is no context switch cost! You could then use a microkernel to implement Capability security architectures, which can provide mathematically provable security!
http://video.google.com/videoplay?docid=1762847950860111011
"However, what multiple cores might do is enable previously impractical tasks to be done on modest PCs. Things like NP problems, optimizations, simulations."
:p
Like hell. NP means non-polynomial -- exponential growth. This means if you have a problem with 2 items in it, it takes 4x the effort of 1 item. 4 items takes 16x the effort. 8 items takes 256x the effort. Want to solve a problem like the travelling salesman problem? It's trivial if you visit one or two cities. However, were you to want to visit the 30,000 or so cities in the US, you're looking at something like 30,000 to the power of 30,000 things to examine (type it into a bc in a terminal -- you might want to time how long it takes to print a number that large). Having 2 CPUs is not going to solve that any faster than having 1,000 CPU cores in a box would -- you need an exponential speedup, which means either a new algorithm, or quantum computing. That, or patience to see if the universe ends in heat death or a big crunch before you get your answer.
Optimized scheduling and goods flow (with more than 2 restrictions) is NP as well. You can approximate NP solutions with heuristics and clever algorithms, even doing some fancy work with stats and running approximations in parallel to get arbitrarily close to a solution in some cases, but you're still not solving the NP problem
Simulations and particle physics could be done in parallel, potentially, but there are limits there as well. If you have a scene with 32 items, you do still need to synchronize their interactions (you can only split the parallelism so far). The reason we're seeing multiple cores is because it's getting harder to make a single CPU a significant amount faster. Multiple cores just means we spend less time faking multiple cores, and won't solve problems that require more than a linear speedup to become computable in reasonable or real-time.
Multicore is not a panacea. The trouble people whine about is because multithreaded programming is hard in a lot of languages and environments due to side effects. If you can convince people to switch over en-masse to Scheme, Haskell, SML, Prolog, you might solve this problem -- or at least make it less of a big deal. I doubt that's going to happen (but I'd love to be wrong).
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Cell was a fairly radical design departure. If IBM continues to refine Cell, and as more experience is gained, the challenge will likely diminish.
For one thing, IBM will likely add double precision floating point support.
The reason why x86 never died the thousand deaths predicted by the RISC camp is that heat never much mattered.
The Cell does sound pretty good, but for now I'll stick to Intel. You see, if you were to tell me about your heterosexual experiences with the Cell, I'd buy into it in a New York minute. That's where Intel is winning hands down.
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
The issue of the lack of progress in creating tools to simplify multithreaded programming has been a topic of discussion for well over a decade. Most programmers just don't make much use of multithreading. They take advantage of multithreading because their Web server and database support it and the Web server runs each request in a separate thread. Even then, some activity is complex and is usually not further parallelized. Operating systems programmers and some realtime programmers tend to be good a multithreading and parallel programming, but this is a small minority of programmers. Heck, look st Rails, one of the most popular Web frameworks - it isn't thread safe!
Look at most people's screens. Even if they have multiple programs running, they tend to have the one they are working on full screen. Studies have shown that people who multitask are less efficient than people who do one job at a time. Perhaps we are not educated to look at problems as solvable in a parallel fashion or perhaps there is some other human based problem. Maybe like many other skills, being able to think and program in a multithreaded fashion is a talent that only a small fraction of the population has.
This "panic" isn't going away and there is NO quick fix on the programming horizon. The hardware designers can stuff more cores in the box, but programmers won't keep up. what can consume the extra CPU power are things like speech recognition, hand writing and gesture recognition and rich media. Each of the can run in its 1-4 cores and help us serial humans interact with those powerful computers more easily.
80 cores on a chip? So what. That's just an exercise in integration. As the number of available transistors continues to increase, so it will be easier to shove more simple cores on the die.
If you want to see real innovation in this field look at Sun's Niagara (UltraSPARC T2 and T2) and ROCK. They are a bit more clever.
When I first learned how to write a server I learned how to split off threads for each client. Trust me...If you have a degree in computer science you know how to do this. Many OS can schedule threads on each processor utilizing them all. The only people who are nervous are those guys with legacy software who didn't have the foresight to program their code using well known techniques.
This is a good thing, hopefully a much needed new development systems will be they fallout from this 'panic'
The Kruger Dunning explains most post on
Very interesting, I appreciate the feedback. It's a finite element method, you can break the work into chunks for 90% of the code until you need to solve the system of equations at the end of the iteration. And even in the solver, to a point you can parallelize if you are careful. I'm not sure about the memory requirements, to be honest with you it's still in 2D, I'm working on breaking into 3D right now. (then adding in all the fun stuff like reacting flows, etc.)
What resources did you use to learn cell programming?
Thanks again for your insights.
Whoever runs the fastest.
Now your problem is to parallelize the linear system solver. Normally this task takes up to 90% of total execution time so it's a good candidate for running on SPEs. For the other 10%, leave it on the PPE. And don't forget : adding "fun" stuff increases the code size, which means less space for data on SPE.
Resources are available on IBM's developerWorks site (http://www.ibm.com/developerworks/power/cell/docs_articles.html, see also the forums some interesting issues are discussed) and on Barcelona supercomputing center (http://www.bsc.es/plantillaH.php?cat_id=326).
Do appreciate it.
I'm bogged down in school right now (working on my PhD... CFD, heat transfer, etc.) but hoping this summer/fall to do something a little more "fun". Have to do some research to see if this is it.
Thanks again.
I work with Cavium Networks Octeon processors. These are 16-core MIPS beasts that are capable of running different OSes/applications on different cores. You can run Linux on a few cores, your TCP/IP stack on another four cores, a crypto engine on another core, etc.
I support the Center for Consumer Freedom
-- Did you try Tao3D? http://tao3d.sourceforge.net