Panic in Multicore Land

Panic? by jaavaaguru · 2008-03-10 23:34 · Score: 4, Insightful

I think "panic" is a bit of an over-reaction. I use a multicore CPU. I write software that runs on it. I'm not panicking.

--
Follow me

Re:Panic? by shitzu · 2008-03-10 23:42 · Score: 4, Insightful

Still, the fact remains that the x86 processors (due to the OS-s that run on them, actually) have not gone much faster in the last 5-7 years. The only thing that has shown serious progress is power consumption and heat dissipation. I mean - the speed the user experiences has not improved much.
Re:Panic? by leenks · 2008-03-10 23:57 · Score: 5, Insightful

How is an 80-core cpu a cut down version of a dual-CPU box? This is the kind of technology the authors are discussing, not your Core2 duo MacBook...
Re:Panic? by Cutie+Pi · 2008-03-11 00:03 · Score: 5, Informative

Yeah, but if you extrapolate to where things are going, we're going to have CPUs with dozens if not hundreds of cores on them. (See Intel's 80 core technology demo as an example of where their research is going). Can you write or use general purpose software that takes advantage of that many cores? Right now I expect there is a bit of panic because it's relatively easy to build these behemoths, but not so easy to use them efficiently. Outside of some specialized disciplines like computational science and finance (that have already been taking advantage of parallel computing for years), there won't be a big demand for uber-multicore CPUs if the programming models don't drastically improve. And those innovations need to happen now to be ready in time for CPUs of 5 years from now. Since no real breakthroughs have come however, the chip companies are smart to be rethinking their strategies.
Re:Panic? by Chrisq · 2008-03-11 00:04 · Score: 4, Insightful

Yes panic is strong, but the issue is not with multi-tasking operating systems assigning processes to different processors for execution. That works very well. The problem is when you have a single CPU-intensive task, and you want to split that over multiple processors. That, in general, is a difficult problem. Various solutions, such as functional programming, threads with spawns and waits, etc. have been proposed, but none are as easy as just using a simple procedural language.
Re:Panic? by ObsessiveMathsFreak · 2008-03-11 00:47 · Score: 4, Insightful

That works very well. The problem is when you have a single CPU-intensive task, and you want to split that over multiple processors. That, in general, is a difficult problem.

It is in general, an impossible problem.

Most existing code is imperative. Most programmers write in imperative programming languages. Object orientation does not change this. Imperative code is not suited for multiple CPU implementation. Stapling things together with threads and messaging does not change this.

You could say that we should move to other programming "paradigms". However in my opinion, the reason we use imperative programs so such is because most of the tasks we want accomplished are inherently imperative in nature. Outside of intensive numerical work, most tasks people want done on a computer are done sequentially. The availability of multiple cores is not going to change the need for these tasks to be done in that way.

However, what multiple cores might do is enable previously impractical tasks to be done on modest PCs. Things like NP problems, optimizations, simulations. Of course these things are already being done, but not on the same scale as things like, say, spreadsheets, video/sound/picture editing, gaming, blogging, etc. I'm talking about relatively ordinary people being able to do things that now require supercomputers, experimenting and creating on their own laptops. Multi core programs can be written to make this feasible.

Considering I'm beginning to sound like an evangelist, I'll stop now. Safe money says PCs stay at 8 CPUs or below for the next 15 years.

--
May the Maths Be with you!
Re:Panic? by divisionbyzero · 2008-03-11 00:47 · Score: 5, Funny

Developers aren't panicking. Their kernels are! Ha! Oh, that was a good one. Where's my coffee?
Re:Panic? by Saurian_Overlord · 2008-03-11 00:51 · Score: 5, Insightful

"...the speed the user experiences has not improved much [in the last 5-7 years]."

This may almost be true if you stay on the cutting edge, but not even close for the average user (or the power-user on a budget, like myself). 5 years ago I was running a 1.2 GHz Duron. Today I have a 2.3 GHz Athlon 64 in my notebook (which is a little over a year old, I think), and an Athlon 64 X2 5600+ (that's a dual-core 2.8 GHz, for those who don't know) in my desktop. I'd be lying if I said I didn't notice much difference between the three.
Re:Panic? by johannesg · 2008-03-11 01:56 · Score: 4, Insightful

Let's not be too harsh on ourselves. In most systems today, the bottleneck is the hard disk, not the CPU. No amount of threading will rescue you if your memory has been swapped out.

I write large and complex engineering applications. I have a few threads around, mostly for the purpose of doing calculation and dealing with slow devices. But I'm not going to add in more threads just because there are more cores for me to use. I'll add threads when performance issues requires that I add threads, and not before.

Most software today runs fine as a single thread anyway. The specialized software that requires maximum CPU performance (and is not already bottle-necked by HD or GPU access) will be harder to write, but for everything else the current model is just fine.

If anything, Intel should worry about 99% of all people simply not needing 80 cores to begin with...
Re:Panic? by Penguin+Follower · 2008-03-11 01:56 · Score: 4, Informative

Unless you're speaking of AMD SMP systems, the Intel systems up until recently share the FSB among all the CPUs. So from the Intel side of things, SMP vs multi-core is nearly the same (save for L2 cache sharing and whatnot). The only notable exception, on the Intel side, that I have noticed is that the recent Xeon systems (within like the last two years) seem to be using two "northbridges". For example, my "quad-core" Mac Pro tower that I bought in April of 2007. It has two dual-core Xeons and the motherboard has two northbridges (though Intel doesn't refer to their chipsets that way last I checked. They like to talk about "hubs".).
Re:Panic? by coats · 2008-03-11 02:57 · Score: 5, Informative
Ditto. And the principles are pretty generic; they haven't changed since a decade before a seminar I gave six years ago at EPA's Office of Research and Development .
And frankly, it helps a lot to write code that is microprocessor-friendly to begin with:
- Algorithms are important; that's where the biggest wins usually are.
- Memory is much slower than the processors, and is organized hierarchically.
- ALU's are superscalar and pipelined;
- Current processors can have as many as 100 instructions simultaneously executing in different stages of execution, so avoid data dependencies that break the pipeline.
- Parallel "gotchas" are at the bottom of this list...
If the node-code is bad enough, it can make any parallelism look good to the user. But writing good node-code is hard;-( As a reviewer, I have recommended rejection for a parallel-processing paper that claimed 80% parallel efficiency on 16 processors for the author's air-quality model. But I knew of a well-coded equivalent model that outperformed the paper's 16-processor model-result on a single processor -- and still got 75% efficiency on 16 processors (better than 10x the paper-author's timing).
fwiw.
--
"My opinions are my own, and I've got *lots* of them!"
Re:Panic? by Alsee · 2008-03-11 03:16 · Score: 5, Insightful

spreadsheets, video/sound/picture editing, gaming, blogging

Odd selection of examples. The processing of cells can almost trivially be allocated across 80 cores. Media work can almost trivially be split into chunks across 80 cores. Games usually relatively easy to split, either by splitiing the graphics into chunks or parallelizable physics or other parallelizable simulation aspects.

Oh, and blogging.
My optical mouse has enough processing horsepower inside for blogging.

OPTICAL MOUSE CIRCUITRY:
Has the user pressed a key?
No.
Has the user pressed a key?
No.
Has the user pressed a key?
No.
(repeat 1000 times)
Has the user pressed a key?
No.
Has the user pressed a key?
No.
Has the user pressed a key?
Yes.
OOOO! YES!
QUICK QUICK QUICK! HURRY HURRY HURRY! PROCESS A KEYPRESS! YIPEE!

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.

Should Mimick The Brain by curmudgeon99 · 2008-03-10 23:39 · Score: 5, Interesting

Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.

Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.

The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.

The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.

Re:Should Mimick The Brain by El_Muerte_TDS · 2008-03-11 00:00 · Score: 4, Funny

If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.

I think it's pretty obvious there are serious design flaws in the human brain. And I'm not only talking about stability, but also reliability and accuracy.
Just look at the world.

Not *that* Chuck Moore by Hobart · 2008-03-10 23:42 · Score: 4, Informative

This article is referring to AMD's Charles R. "Chuck" Moore, who worked on the POWER4 and PowerPC 601, not the language and chip designer Charles H. "Chuck" Moore who invented Forth, ColorForth, et al. and was interviewed on slashdot.

--
o/~ Join us now and share the software ...

Re:Not *that* Chuck Moore by Hal_Porter · 2008-03-10 23:52 · Score: 5, Funny

Those +1 Informative links go to wikipedia, an online encyclopedia.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

Re:Self Interest by The_Angry_Canadian · 2008-03-10 23:47 · Score: 5, Informative

The article covers many point of views. Not only the one from Chuck Moore.

The future is here by downix · 2008-03-10 23:48 · Score: 5, Insightful

What Mr Moore is saying does have a grain of truth, that generic will be beaten by specific in key functions. The Amiga proved that in 1985, being able to deliver a better graphical solution than workstations costing tens of thousands more. The key now is to figure out which specifics you can use without driving up the cost nor without compromizing the design ideal of a general purpose computer.

--
Karma Whoring for Fun and Profit.

My heterogeneous experience with Cell processor by DoofusOfDeath · 2008-03-11 00:01 · Score: 5, Interesting

I've been doing some scientific computing on the Cell lately, and heterogeneous cores don't make life very easy. At least with the Cell.

The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.

The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.

So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.

Re:My heterogeneous experience with Cell processor by nycguy · 2008-03-11 00:16 · Score: 5, Interesting

I agree. While a proper library/framework can help abstract the difficulties associated with a heterogeneous/asymetric architecture away, it's just easier to program for a homogeneous environment. This same principle applies all the way down to having general-purpose registers in a RISC chip as opposed to special-purpose registers in a CISC chip--the latter may let you do a few specialized things better, but the former is more accomodating for a wide range of tasks.
And while the Cell architecture is a fairly stationary target because it was incorporated into a commercial gaming console, if these types of architectures were to find their way into general purpose computing, it would be a real nightmare, since every year or so a new variant of the architecture would come out that would introduce a faster interconnect here, more cache memory there, etc., so that one might have to reorganize the division of labor in one's application to take advantage (again a properly parameterized library/framework can handle this sometimes, but only post facto--after the variation in features is known, not before the new features have even been introduced).
Re:My heterogeneous experience with Cell processor by epine · 2008-03-11 01:01 · Score: 5, Interesting

So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version.
Your points are valid as things stand, but isn't it a bit premature to make this judgment? Cell was a fairly radical design departure. If IBM continues to refine Cell, and as more experience is gained, the challenge will likely diminish.

For one thing, IBM will likely add double precision floating point support. But note that SIMD in general poses problems in the traditional handling of floating point exceptions, so it still won't be quite the same as double precision on the PPU.

The local-memory SPE design alleviates a lot of pressure on the memory coherence front. Enforcing coherence in silicon generates a lot of heat, and heat determines your ultimate performance envelop.

For decades, programmers have been fortunate in making our own lives simpler by foisting tough problems onto the silicon. It wasn't a problem until the hardware ran into the thermal wall. No more free lunch. Someone has to pay on one side or the other. IBM recognized this new reality when they designed Cell.

The reason why x86 never died the thousand deaths predicted by the RISC camp is that heat never much mattered. Not enough registers? Just add OOO. Generates a bit more heat to track all the instructions in flight, but no real loss in performance. Bizarre instruction encoding? Just add big complicated decoders and pre-decoding caches. Generates more heat, but again performance can be maintained.

Probably with a software architecture combining the hairy parts of the Postgres query execution planner with the recent improvements in the FreeBSD affinity-centric ULE scheduler, you could make the nastier aspects of SPE coordination disappear. It might help if the SPUs had 512KB instead of 256KB to alleviate code pressure on data space.

I think the big problem is the culture of software development. Most code functions the same way most programmers begin their careers: just dive into the code, specify requirements later. What I mean here is that programs don't typically announce the structure of the full computation ahead of time. Usually the code goes to the CPU "do this, now do that, now do this again, etc." I imagine the modern graphics pipelines spell out longer sequences of operations ahead of time, by necessity, but I've never looked into this.

Database programmers wanting good performance from SQL *are* forced to spell things out more fully in advance of firing off the computation. It doesn't go nearly far enough. Instead of figuring out the best SQL statement, the programmer should send a list of *all* logically equivalent queries and just let the database execute the one it finds least troublesome. Problem: sometimes the database engine doesn't know that you have written the query to do things the hard way to avoid hitting a contentious resource that would greatly impact the performance limiting path.

These are all problems in the area of making OSes and applications more introspective, so that resource scheduling can be better automated behind the scenes, by all those extra cores with nothing better to do.

Instead, we make the architecture homogeneous, so that resource planning makes no real difference, and we can thereby sidestep the introspection problem altogether.

I've always wondered why no-one has ever designed a file system where all the unused space is used to duplicate other disk sectors/blocks, to create the option of vastly faster seek plans. Probably because it would take a full-time SPU to constantly recompute the seek plan as old requests are completed and new requests enter the queue. Plus if two supposedly identical copies managed to diverge, it would be a nightmare to debug, because the copy you get back would non-deterministic. Hybrid MRAM/Flash/spindle storage systems could get very interesting.

I guess I've been looking forward to the end of artificial scaling for a long time (clock freq. as the

Well, I'm panicked... by argent · 2008-03-11 00:02 · Score: 4, Interesting

The idea of having to use Microsoft APIs to program future computers because the vendors only document how to get DirectX to work doesn't exactly thrill me. I think panic is perhaps too strong a word, but sheesh...

he is right, but it depends on the application by CBravo · 2008-03-11 00:13 · Score: 5, Interesting

As I demonstrated in my thesis a parallel application can be shown to have certain critical and less critical parts. An optimal processing platform matches those requirements. The remainder of the platform will remain idle and burn away power for nothing. One should wonder what is better: a 2 GHz processor or 2x 1 GHz processors. My opinion is that, if it has no impact on performance, the latter is better.

There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).

--
nosig today

Multithreading is not easy but it's doable by pieterh · 2008-03-11 00:20 · Score: 5, Interesting

It's been clear for many years that individual core speeds had peaked, and that the future was going to be many cores and that high-performance software would need to be multithreaded in order to take advantage of this.

When we wrote the OpenAMQ messaging software in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project.

Today, we're making a new design - ØMQ, aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.

From one of the ØMQ whitepapers:

Inter-thread synchronisation is slow. If the code is local to a thread (and doesn't use slow devices like network or persistent storage), execution time of most functions is tens of nanoseconds. However, when inter-thread synchronisation - even a non-blocking synchronisation - kicks in, execution time grows by hundreds of nanoseconds, or even surpasses one microsecond. All kind of time-expensive hardware-level stuff has to be done... synchronisation of CPU caches, memory barriers etc.

The best of the breed solution would run in a single thread and omit any inter-thread synchronisation altogether. It seems simple enough to implement except that single-threaded solution wouldn't be able to use more than one CPU core, i.e. it won't scale on multicore boxes.

A good multi-core solution would be to run as many instances of ØMQ as there are cores on the host and treat them as separate network nodes in the same way as two instances running on two separate boxes would be treated and use local sockets to pass messages between the instances.

This design is basically correct, however, the sockets are not the best way to pass message within a single box. Firstly, they are slow when compared to simple inter-thread communication mechanisms and secondly, data passed via a socket to a different process has to be physically copied, rather than passed by reference.

Therefore, ØMQ allows you to create a fixed number of threads at the startup to handle the work. The "fixed" part is deliberate and integral part of the design. There are a fixed number of cores on any box and there's no point in having more threads than there are cores on the box. In fact, more threads than cores can be harmful to performance as they can introduce excessive OS context switching.

We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.

--
My blog

Multicores, but not on a chip by Kim0 · 2008-03-11 00:24 · Score: 5, Interesting

This trend with multiple cores on the CPU is only an intermediate phase,
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.

In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.

Kim0

Re:Multicores, but not on a chip by jo42 · 2008-03-11 01:44 · Score: 4, Informative

lots of smaller core/memory chips, connected in a network You mean like the Transputer back in the '80s?

Re:Let's see the menu by imikem · 2008-03-11 00:34 · Score: 5, Funny

Would you like fries with that?

--
Perscriptio in manibus tabellariorum est.

Re:Languages by chudnall · 2008-03-11 00:35 · Score: 5, Informative

*cough*Erlang*cough*

I think the wailing we're about to hear is the sound of thousands of imperative-language programmers being dragged, kicking and screaming, into functional programming land. Even the functional languages not specifically designed for concurrency do it much more naturally than their imperative counterparts.

--
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.

Re:Languages by Westley · 2008-03-11 00:47 · Score: 5, Informative

Java doesn't do a good job. It does a "better than abysmal" job in that it has some idea of threading with synchronized/volatile, and it has a well-defined memory model. (That's not to say there aren't flaws, however. Allowing synchronization on any reference was a mistake, IMO.)

What it *doesn't* do is make it easy to write verifiably immutable types, and code in a functional way where appropriate. As another respondent has mentioned, functional languages have great advantages when it comes to concurrency. However, I think the languages of the future will be a hybrid - making imperative-style code easy where that's appropriate, and functional-style code easy where that's appropriate.

C# 3 goes some of the way towards this, but leaves something to be desired when it comes to assistance with immutability. It also doesn't help that that .NET 2.0 memory model is poorly documented (the most reliable resources are blog posts, bizarrely enough - note that the .NET 2.0 model is significantly stronger than the ECMA CLI model).

APIs are important too - the ParallelExtensions framework should help .NET programmers significantly when it arrives, assuming it actually gets used. Of course, for other platforms there are other APIs - I'd expect them to keep leapfrogging each other in terms of capability.

I don't think C# 3 (or even 4) is going to be the last word in bringing understandable and reliable concurrency, but I think it points to a potential way forward.

The trouble is that concurrency is hard, unless you live in a completely side-effect free world. We can make it simpler to some extent by providing better primitives. We can encourage side-effect free programming in frameworks, and provide language smarts to help too. I'd be surprised if we ever manage to make it genuinely easy though.

Current state of software development by Alex+Belits · 2008-03-11 00:55 · Score: 5, Funny

Ugg is smart.
Ugg can program a CPU.
Two Uggs can program two CPUs.
Two Uggs working on the same task program two CPUs.
Uggs' program has a race condition.
Ugg1 thinks, it's Ugg2's fault.
Ugg2 thinks, it's Ugg1's fault.
Ugg1 hits Ugg2 on the head with a rock.
Ugg2 hits Ugg1 on the head with an axe.
Ugg1 is half as smart as he was before working with Ugg2.
Ugg2 is half as smart as he was before working with Ugg1.
Both Uggs now write broken code.
Uggs' program is now slow, wrong half the time, and crashes on that race condition once in a while.
Ugg does not like parallel computing.
Ugg will bang two rocks together really fast.
Ugg will reach 4GHz.
Ugg will teach everyone how to reach 4GHz.

--
Contrary to the popular belief, there indeed is no God.

+1 Optimistic by Sapphon · 2008-03-11 01:03 · Score: 4, Funny

The height of optimism: posting proof in the form of a 70-odd page thesis on a Slashdot.
I don't think we'll be Slashdotting your server any time soon, CBravo ;-)

--
Antiquis temporibus, nati tibi similes in rupibus ventosissimis exponebantur ad necem.

No problems for servers by TheLink · 2008-03-11 01:25 · Score: 5, Insightful

For servers the real problem is I/O. Disks are slow, network bandwidth is limited (if you solve that then memory bandwidth is limited ;) ).

For most typical workloads most servers don't have enough I/O to keep 80 cores busy.

If there's enough I/O there's no problem keeping all 80 cores busy.

Imagine a slashdotted webserver with a database backend. If you have enough bandwidth and disk I/O, you'll have enough concurrent connections that those 80 cores will be more than busy enough ;).

If you still have spare cores and mem, you can run a few virtual machines.

As for desktops - you could just use Firefox without noscript, after a few days the machine will be using all 80 CPUs and memory just to show flash ads and other junk ;).

--

Too many replies beneath your current threshold

32 of 367 comments (clear)