Panic in Multicore Land

Should Mimick The Brain by curmudgeon99 · 2008-03-10 23:39 · Score: 5, Interesting

Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.

Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.

The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.

The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.

Re:Should Mimick The Brain by radtea · 2008-03-11 01:41 · Score: 2, Interesting

Models from nature are rarely the best way to go. Heavier than air flight only got off the ground when people stopped looking to birds and bats for inspiration. Wheeled vehicles have no resemblance to horses. Interestingly, we are still trying to understand the nuanced details of the flight of birds based on the aerodynamics we learned building highly un-bird-like flying machines.

So while there's nothing wrong in looking at our radically imperfect understanding of the brain, which is in no better state than pre-flight understanding of bird aerodynamics, it is optimistic to expect that it will provide much guidance in building programmes for multi-core processors, or for building those processors themselves. Neural networks, the most famously brain-like system architecture, are famously hard to "programme" (train) and essentially impossible to debug (interpret).

The article suggests that heterogeneous multi-core architectures may be best represented to the programmer as a set of heterogeneous APIs, much as graphics-specific APIs are now. While this is vaguely consistent with the idea that "different parts of the brain do different things", I don't think the brain analogy brings anything useful to the table, and past experience should make us very wary of trying to draw any deeper inferences from it. Aeroplanes do look vaguely like birds, but that doesn't mean we should dispense with vertical stabilisers...

One could equally well argue that neurons in the brain are fairly homogeneous, and each core could be considered a neuron. We know that different parts of the brain are remarkably adaptable. Stroke patients often regain function due to other parts of the brain taking over from the bits that were destroyed. So on this analogy, homogeneous processors that could be adapted to multiple tasks is the way to go.

Demonstrating fairly conclusively that the brain analogy is pretty much useless, as it can be manipulated to appear to support whichever side of the debate you've already decided is the right one.

--
Blasphemy is a human right. Blasphemophobia kills.

Let's see the menu by Tribbin · 2008-03-10 23:39 · Score: 3, Interesting

Can I have... errr... Two floating point, one generic math with extra cache and two RISC's.

--
If you mod this up, your slashdot background will turn into a beautiful sunset!

My heterogeneous experience with Cell processor by DoofusOfDeath · 2008-03-11 00:01 · Score: 5, Interesting

I've been doing some scientific computing on the Cell lately, and heterogeneous cores don't make life very easy. At least with the Cell.

The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.

The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.

So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.

Re:My heterogeneous experience with Cell processor by nycguy · 2008-03-11 00:16 · Score: 5, Interesting

I agree. While a proper library/framework can help abstract the difficulties associated with a heterogeneous/asymetric architecture away, it's just easier to program for a homogeneous environment. This same principle applies all the way down to having general-purpose registers in a RISC chip as opposed to special-purpose registers in a CISC chip--the latter may let you do a few specialized things better, but the former is more accomodating for a wide range of tasks.
And while the Cell architecture is a fairly stationary target because it was incorporated into a commercial gaming console, if these types of architectures were to find their way into general purpose computing, it would be a real nightmare, since every year or so a new variant of the architecture would come out that would introduce a faster interconnect here, more cache memory there, etc., so that one might have to reorganize the division of labor in one's application to take advantage (again a properly parameterized library/framework can handle this sometimes, but only post facto--after the variation in features is known, not before the new features have even been introduced).
Re:My heterogeneous experience with Cell processor by epine · 2008-03-11 01:01 · Score: 5, Interesting

So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version.
Your points are valid as things stand, but isn't it a bit premature to make this judgment? Cell was a fairly radical design departure. If IBM continues to refine Cell, and as more experience is gained, the challenge will likely diminish.

For one thing, IBM will likely add double precision floating point support. But note that SIMD in general poses problems in the traditional handling of floating point exceptions, so it still won't be quite the same as double precision on the PPU.

The local-memory SPE design alleviates a lot of pressure on the memory coherence front. Enforcing coherence in silicon generates a lot of heat, and heat determines your ultimate performance envelop.

For decades, programmers have been fortunate in making our own lives simpler by foisting tough problems onto the silicon. It wasn't a problem until the hardware ran into the thermal wall. No more free lunch. Someone has to pay on one side or the other. IBM recognized this new reality when they designed Cell.

The reason why x86 never died the thousand deaths predicted by the RISC camp is that heat never much mattered. Not enough registers? Just add OOO. Generates a bit more heat to track all the instructions in flight, but no real loss in performance. Bizarre instruction encoding? Just add big complicated decoders and pre-decoding caches. Generates more heat, but again performance can be maintained.

Probably with a software architecture combining the hairy parts of the Postgres query execution planner with the recent improvements in the FreeBSD affinity-centric ULE scheduler, you could make the nastier aspects of SPE coordination disappear. It might help if the SPUs had 512KB instead of 256KB to alleviate code pressure on data space.

I think the big problem is the culture of software development. Most code functions the same way most programmers begin their careers: just dive into the code, specify requirements later. What I mean here is that programs don't typically announce the structure of the full computation ahead of time. Usually the code goes to the CPU "do this, now do that, now do this again, etc." I imagine the modern graphics pipelines spell out longer sequences of operations ahead of time, by necessity, but I've never looked into this.

Database programmers wanting good performance from SQL *are* forced to spell things out more fully in advance of firing off the computation. It doesn't go nearly far enough. Instead of figuring out the best SQL statement, the programmer should send a list of *all* logically equivalent queries and just let the database execute the one it finds least troublesome. Problem: sometimes the database engine doesn't know that you have written the query to do things the hard way to avoid hitting a contentious resource that would greatly impact the performance limiting path.

These are all problems in the area of making OSes and applications more introspective, so that resource scheduling can be better automated behind the scenes, by all those extra cores with nothing better to do.

Instead, we make the architecture homogeneous, so that resource planning makes no real difference, and we can thereby sidestep the introspection problem altogether.

I've always wondered why no-one has ever designed a file system where all the unused space is used to duplicate other disk sectors/blocks, to create the option of vastly faster seek plans. Probably because it would take a full-time SPU to constantly recompute the seek plan as old requests are completed and new requests enter the queue. Plus if two supposedly identical copies managed to diverge, it would be a nightmare to debug, because the copy you get back would non-deterministic. Hybrid MRAM/Flash/spindle storage systems could get very interesting.

I guess I've been looking forward to the end of artificial scaling for a long time (clock freq. as the
Re:My heterogeneous experience with Cell processor by Anonymous Coward · 2008-03-11 01:22 · Score: 0, Interesting

Double precison has been greatly improved in the last variants of the Cell SPU, not the ones in the PS3 though. The enhanced DP processors are only found in the recent IBM (and perhaps Mercury) blades, which are expensive, but the only difference with single precision is longer latency which leads to about half the flops (only 2 values per register instead of 4).

Next year we might even get the same processor with 32 SPU, still hard to program but it means 8MB total of data on the chip, which opens some opportunities (unfortunately the memory size per SPU seems to be set in stone at 256kB).

I'm interested in programming the Cell for doing some signal processing, most of which will be single precision FFT, an application where it seems to rock. I think that the data flow between SPU is relatively easy to organize for my purpose. OTOH, it seems nobody wants to sell bare Cell chips, which is sad, since I would love to try to interface it to high speed (1-2Gsamples/s) ADCs.
Re:My heterogeneous experience with Cell processor by Anonymous Coward · 2008-03-11 01:32 · Score: 1, Interesting

The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.

A lot of this is just due to the lack of a good platform, there is nothing that prevents demand paging of data SPEs need and the C++ features are just due to the current implementation of the runtime. I will agree that the Cell is aimed a little bit too much at the video and game markets in its current implementation though, think of it as a first step, if they can make is successful and some better platforms and tools materialize then imagine having 64 SPEs with different groups of specialized functions, perhaps some aimed at linear algebraic functions, some at more analytical, etc.. some kind of parallel multiprocessing is the future, that much is a given, it's just a matter of figuring out the right model.
Re:My heterogeneous experience with Cell processor by Karellen · 2008-03-11 02:12 · Score: 2, Interesting

"you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL)"

OK, I have to ask - why on earth can't you use C++ exceptions on them?

After all, what is an exception? It's basically syntactic sugar around setjmp()/longjmp(), but with a bit more code to make sure the stack unwinds properly and destructors are called, instead of longjmp() being a plain non-local goto.

What else is there that makes C++ exceptions unimplemenatable?

--
Why doesn't the gene pool have a life guard?

Well, I'm panicked... by argent · 2008-03-11 00:02 · Score: 4, Interesting

The idea of having to use Microsoft APIs to program future computers because the vendors only document how to get DirectX to work doesn't exactly thrill me. I think panic is perhaps too strong a word, but sheesh...

Re:Panic? by Anonymous Coward · 2008-03-11 00:03 · Score: 0, Interesting

> the speed the user experiences has not improved much.

User experience is not a useful metric for performance, unless you consider media encoding , decoding and rendering. 10 years ago I was running a P166, what kind of framerates would I get with a modern game using a software renderer? What kind of framerates would I get for decoding a HD video stream?

Do you seriously think a 12 year old P166 will provide a comparative user experience to a modern 8 core 3GHz machine? You're putting it down to "the OS's that run on them", which is interesting since user mode x86 emulation with QEMU runs W2K faster on my laptop than on the hardware I ran it on back in 1999.

he is right, but it depends on the application by CBravo · 2008-03-11 00:13 · Score: 5, Interesting

As I demonstrated in my thesis a parallel application can be shown to have certain critical and less critical parts. An optimal processing platform matches those requirements. The remainder of the platform will remain idle and burn away power for nothing. One should wonder what is better: a 2 GHz processor or 2x 1 GHz processors. My opinion is that, if it has no impact on performance, the latter is better.

There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).

--
nosig today

Multithreading is not easy but it's doable by pieterh · 2008-03-11 00:20 · Score: 5, Interesting

It's been clear for many years that individual core speeds had peaked, and that the future was going to be many cores and that high-performance software would need to be multithreaded in order to take advantage of this.

When we wrote the OpenAMQ messaging software in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project.

Today, we're making a new design - ØMQ, aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.

From one of the ØMQ whitepapers:

Inter-thread synchronisation is slow. If the code is local to a thread (and doesn't use slow devices like network or persistent storage), execution time of most functions is tens of nanoseconds. However, when inter-thread synchronisation - even a non-blocking synchronisation - kicks in, execution time grows by hundreds of nanoseconds, or even surpasses one microsecond. All kind of time-expensive hardware-level stuff has to be done... synchronisation of CPU caches, memory barriers etc.

The best of the breed solution would run in a single thread and omit any inter-thread synchronisation altogether. It seems simple enough to implement except that single-threaded solution wouldn't be able to use more than one CPU core, i.e. it won't scale on multicore boxes.

A good multi-core solution would be to run as many instances of ØMQ as there are cores on the host and treat them as separate network nodes in the same way as two instances running on two separate boxes would be treated and use local sockets to pass messages between the instances.

This design is basically correct, however, the sockets are not the best way to pass message within a single box. Firstly, they are slow when compared to simple inter-thread communication mechanisms and secondly, data passed via a socket to a different process has to be physically copied, rather than passed by reference.

Therefore, ØMQ allows you to create a fixed number of threads at the startup to handle the work. The "fixed" part is deliberate and integral part of the design. There are a fixed number of cores on any box and there's no point in having more threads than there are cores on the box. In fact, more threads than cores can be harmful to performance as they can introduce excessive OS context switching.

We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.

--
My blog

Re:Multithreading is not easy but it's doable by pieterh · 2008-03-11 02:45 · Score: 3, Interesting

It's unfair to compare blob messaging with a protocol that has to process XML, but let's look. I'm using http://www.ejabberd.im/benchmark as a basis:

- eJabberd latency is in the 10-50msec range. 0MQ gets latencies of around 25 microseconds.
- eJabberd supports more than 10k users. 0MQ will support more than 10k users.
- eJabberd scales transparently thanks to Erlang. 0MQ squeezes so much out of one box that scaling is less important.
- eJabberd has high-availability thanks to Erlang 0MQ will have to build its own HA model (as OpenAMQ did).
- eJabberd can process (unknown?) messages per second. 0MQ can handle 100k per second on one core.

Sorry if I got some things wrong, ideally we'd run side-by-side tests to get figures that we can properly compare.

Note that protocols like AMQP can be elegantly scaled at the semantic level, by building federations that route messages usefully between centers of activity. This cannot be done in the language or framework, it is dependent on the protocol semantics. This is how very large deployments of OpenAMQ work. I guess the same as SMTP networks.

0MQ will, BTW, speak XMPP one day. It's more a framework for arbitrary messaging engines and clients, than a specific protocol implementation.

I've seen Erlang used for AMQP as well - RabbitMQ - and by all accounts it's an impressive language for this kind of work.

--
My blog

Heterogenous is a natural thing to do by A+beautiful+mind · 2008-03-11 00:21 · Score: 3, Interesting

If you have 80 or more cores, I'd rather have 20 of them support specialty functions and be able to do them very fast (it would have to be a few (1-3) orders of magnitude faster than the general counterpart) and the rest do general processing. This of course needs the support of operating systems, but that isn't very hard to get. With 80 cores caching and threading models have to be rethought, especially caching - the operating system has to be more involved in caching than it currently is, because otherwise cache coherency won't be able to be done.

This also means that programs will need to be written not just by using threads, "which makes it okay for multi-core", but with cpu cache issues and locality in mind. I think VMs like JVM, Parrot and .NET will be much more popular as it is possible for them to take care a lot of these issues, which isn't or only possible in a limited way for languages like C and friends with static source code inspection.

--
It takes a man to suffer ignorance and smile
Be yourself no matter what they say

Multicores, but not on a chip by Kim0 · 2008-03-11 00:24 · Score: 5, Interesting

This trend with multiple cores on the CPU is only an intermediate phase,
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.

In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.

Kim0

Help me understand the distinction by Junior+J.+Junior+III · 2008-03-11 00:35 · Score: 2, Interesting

I'm curious how having specialized multi-core processors is different from having a single-core processor with specialized subunits. Ie, a single core x86 chip has a section of it devoted to implementing MMC, SSE, etc. Isn't having many specialized cores just a sophisticated way of re-stating that you have a really big single-core processor, in some sense?

--
You see? You see? Your stupid minds! Stupid! Stupid!

Re:Languages by TheRaven64 · 2008-03-11 01:12 · Score: 3, Interesting

For good parallel programming you just need to enforce one constraint:

Every object (in the general sense, not necessarily the OO sense) may be either aliased or mutable, but not both.

Erlang does this by making sure no objects are mutable. This route favours the compiler writer (since it's easy) and not the programmer. I am a huge fan of the CSP model for large projects, but I'd rather keep something closer to the OO model in the local scope and use something like CSP in the global scope (which is exactly what I am doing with my current research).

--
I am TheRaven on Soylent News

One Fast Core, Multiple Commodity ones by Brit_in_the_USA · 2008-03-11 01:21 · Score: 2, Interesting

I have read many times that some algorithms are difficult or impossible to multi-thread. I envisage the next logical step is a two socket motherboard, where one socket could be used for a 8+ core cpu running at low clock rate (e.g. 2-3Ghz) and another socket for a single core running at the greatest frequency achievable to the manufacturing process (e.g. x2 to x4 the clock speed of the multi-core) with whatever cache size compromises are required.

This help get around yield issues of getting all cores to work at a very high frequency and the related thermal issues . This could be a boon to general purpose computer that have a mix of hard to multi-thread and easy to multi-thread programs - assuming the OS could be intelligent on which cores the tasks are scheduled on. The cores could or could not have the same instruction sets, but having the same instruction sets would be the easy first step.

Re:Panic? by nekokoneko · 2008-03-11 02:39 · Score: 2, Interesting

Do notice that in 5 years we have barely increased the clock frequency of the CPUs Do notice that multi-cores don't increase the overall clock frequency,

Clock frequency is not an indicative of CPU performance. For example, the Core 2 chips, despite generally operating at a lower frequency than the Pentium 4's outperform them significantly.

just divide the work up among a set of lower clock frequency cores - yet most programs don't take advantage of that. ;-)

If I'm not mistaken, even if a specific program was not designed to use several cores, the OS can still run different programs in each core, improving the overall user performance. Correct me if I'm wrong.

Do notice that despite clock frequencies going from 33 mhz to 2.3 GHz, the user's perceived performance of the computer has either stayed the same (most likely) or diminished over that same time period. Do notice that programs are more bloated than ever, and programmers are lazier than ever.

Your second point in the blockquote corroborates the first one: the problem isn't that the CPU isn't getting faster, we're just throwing bigger and more bloated stuff at it.

Re:Panic? by TemporalBeing · 2008-03-11 03:13 · Score: 2, Interesting

Clock frequency is not an indicative of CPU performance. For example, the Core 2 chips, despite generally operating at a lower frequency than the Pentium 4's outperform them significantly.

Each core would perform nearly the same as a similarly clocked P4, of course, optimizations in the instructions have changed since then too. But they would still perform similarly. Of course, comparing a P4 to a Core2 is like comparing Apples to Oranges as there are architecture changes across the whole chip that would change that (like the move away from P4's netburst architecture). So there are reasons other than clock frequency for that performance difference.

If I'm not mistaken, even if a specific program was not designed to use several cores, the OS can still run different programs in each core, improving the overall user performance. Correct me if I'm wrong.

That only works across all the different programs. An OS cannot break a single program into multiple threads/processes for the program - the program has to be coded to do so.

Your second point in the blockquote corroborates the first one: the problem isn't that the CPU isn't getting faster, we're just throwing bigger and more bloated stuff at it.

It's both issues. Programmers have gotten lazier and since roughly 2000 (at least from my perspective, likely before that) have come to rely on the ever increasing sizes of hard drives, RAM, and Clock Frequency. The prime directive in the Java community is if you don't like the performance, toss more hardware at it (e.g. Processors); however, that doesn't work if your 1.6 GHz chip single core processors goes to a 1.8 GHz dual core consisting of two 1.1 GHz cores that roughly equate to a 1.8 GHz single core processor in performance. They only equate because the OS can move processes and threads between them, but a program that is designed for a single process cannot take advantage of the second core, and thus effectively runs at the 1.1 GHz instead of the full 1.8 GHz. Programs that are designed to be multi-threaded (or multi-processed) would feel the full benefit of the second core.

This also goes to the bloat - as programmers have typically stopped optimizing code. Thus there are more lines of code in delivered software - often having more and more abstraction layers in them, which doesn't help either. So the overall effect is that the software takes longer to do the same function.

In the end, despite the increase in processing power, the programs run as slow or slower than before. Numerous reasons for it. The GP of my original post in this thread is still correct.

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)

Hetereogeneous is the key word! by Terje+Mathisen · 2008-03-11 04:47 · Score: 2, Interesting

It has been quite obvious to several people in the usenet news:comp.arch newsgroup that the future should give us chips that contain multiple cores with different capabilites:

As long as all these cores share the same basic architecture (i.e. x86, Power, ARM), it would be possible to allow all general-purpose code to run on any core, while some tasks would be able to ask for a core with special capabilites, or the OS could simply detect (by trapping) that a given task was using a non-uniform resource like vector fp, mark it for the scheduler, and restart it on a core with the required resource.

An OS interrupt handler could run better on a short pipeline in-order core, a graphics driver could use something like Larrabee, while SPECfp (or anything else that needs maximum performance from a single thread would run best on an Out-of-Order core like the current Core 2.

The first requirement is that Intel/AMD must develop the capability to test & verify multiple different cores on the same chip, the second that Microsoft must improve their OS scheduler to the point where it actually understands NUMA principles not just for memory but also cpu cores. (I have no doubt at all that Linux and *BSD will have such a scheduler available well before the time your & I can buy a computer with such a cpu in it!)

So why do I believe that such cpus are inevitable?

Power efficiency!

A relatively simple in-order core like the one that Intel just announced as Atom delivers maybe an order of magnitude better performance/watt than a high-end Core 2 Duo. With 16 or 80 or 256 cores on a single chip, this will become really crucial.

Terje

PS As other posters have noted, keeping tomorrow's multi-core chips fed will require a lot of bandwith, this is neither free nor low-power. :-(

--
"almost all programming can be viewed as an exercise in caching"

Re:Specialisation is inevitable by adamkennedy · 2008-03-11 05:37 · Score: 2, Interesting

Two is totally doable. I can fill two (or the equivalent of two) of my four cores.

Trouble is, filling four cores is quite a bit more iffy.

Re:Panic? by carlmenezes · 2008-03-11 07:08 · Score: 2, Interesting

I'd like to ask a few related questions from a developer's point of view :

1) Is there a programming language that tries to make programming for multiple cores easier?
2) Is programming for parallel cores the same as parallel programming?
3) Is anybody aware of anything in this direction on the C++ front that does not rely on OS APIs?

--
Find a job you like and you will never work a day in your life.

24 of 367 comments (clear)