Panic in Multicore Land
MOBE2001 writes "There is widespread disagreement among experts on how best to design and program multicore processors, according to the EE Times. Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks. Others disagree on the ground that heterogeneous processors would be too hard to program. The only emerging consensus seems to be that multicore computing is facing a major crisis. In a recent EE Times article titled 'Multicore puts screws to parallel-programming models', AMD's Chuck Moore is reported to have said that 'the industry is in a little bit of a panic about how to program multicore processors, especially heterogeneous ones.'"
I think "panic" is a bit of an over-reaction. I use a multicore CPU. I write software that runs on it. I'm not panicking.
Follow me
AMD's Chuck Moore presumably has a lot of self interest in pushing heterogeneous cores. They are combining ATI+AMD cores on a single die and selling the benefits in a range of environments including scientific computing etc.
So take it all with a grain of salt
--Q
Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.
Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.
The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.
The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.
Can I have... errr... Two floating point, one generic math with extra cache and two RISC's.
If you mod this up, your slashdot background will turn into a beautiful sunset!
This article is referring to AMD's Charles R. "Chuck" Moore, who worked on the POWER4 and PowerPC 601, not the language and chip designer Charles H. "Chuck" Moore who invented Forth, ColorForth, et al. and was interviewed on slashdot.
o/~ Join us now and share the software
What Mr Moore is saying does have a grain of truth, that generic will be beaten by specific in key functions. The Amiga proved that in 1985, being able to deliver a better graphical solution than workstations costing tens of thousands more. The key now is to figure out which specifics you can use without driving up the cost nor without compromizing the design ideal of a general purpose computer.
Karma Whoring for Fun and Profit.
I've been doing some scientific computing on the Cell lately, and heterogeneous cores don't make life very easy. At least with the Cell.
The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.
The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.
The idea of having to use Microsoft APIs to program future computers because the vendors only document how to get DirectX to work doesn't exactly thrill me. I think panic is perhaps too strong a word, but sheesh...
As I demonstrated in my thesis a parallel application can be shown to have certain critical and less critical parts. An optimal processing platform matches those requirements. The remainder of the platform will remain idle and burn away power for nothing. One should wonder what is better: a 2 GHz processor or 2x 1 GHz processors. My opinion is that, if it has no impact on performance, the latter is better.
There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).
nosig today
When we wrote the OpenAMQ messaging software in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project.
Today, we're making a new design - ØMQ, aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.
From one of the ØMQ whitepapers:
We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.
My blog
If you have 80 or more cores, I'd rather have 20 of them support specialty functions and be able to do them very fast (it would have to be a few (1-3) orders of magnitude faster than the general counterpart) and the rest do general processing. This of course needs the support of operating systems, but that isn't very hard to get. With 80 cores caching and threading models have to be rethought, especially caching - the operating system has to be more involved in caching than it currently is, because otherwise cache coherency won't be able to be done.
.NET will be much more popular as it is possible for them to take care a lot of these issues, which isn't or only possible in a limited way for languages like C and friends with static source code inspection.
This also means that programs will need to be written not just by using threads, "which makes it okay for multi-core", but with cpu cache issues and locality in mind. I think VMs like JVM, Parrot and
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
This trend with multiple cores on the CPU is only an intermediate phase,
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.
In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.
Kim0
Perhaps, panic is a little strong. At the same time, programing languages such as Occam, that are built from the ground up seem very provocative now. Perhaps Occam's syntax could modified to a Python-type syntax for a more popularity.
[Although, personally, I prefer Occam's syntax over that of C's.]
http://en.wikipedia.org/wiki/Occam_programming_language
I think that a tread aware programming language would be good in our multi-core world.
https://www.youtube.com/c/BrendaEM
*cough*Erlang*cough*
I think the wailing we're about to hear is the sound of thousands of imperative-language programmers being dragged, kicking and screaming, into functional programming land. Even the functional languages not specifically designed for concurrency do it much more naturally than their imperative counterparts.
Disclaimer: Evolution comes with NO WARRANTY, except for the IMPLIED WARRANTY of FITNESS FOR A PARTICULAR PURPOSE.
Java doesn't do a good job. It does a "better than abysmal" job in that it has some idea of threading with synchronized/volatile, and it has a well-defined memory model. (That's not to say there aren't flaws, however. Allowing synchronization on any reference was a mistake, IMO.)
.NET 2.0 memory model is poorly documented (the most reliable resources are blog posts, bizarrely enough - note that the .NET 2.0 model is significantly stronger than the ECMA CLI model).
.NET programmers significantly when it arrives, assuming it actually gets used. Of course, for other platforms there are other APIs - I'd expect them to keep leapfrogging each other in terms of capability.
What it *doesn't* do is make it easy to write verifiably immutable types, and code in a functional way where appropriate. As another respondent has mentioned, functional languages have great advantages when it comes to concurrency. However, I think the languages of the future will be a hybrid - making imperative-style code easy where that's appropriate, and functional-style code easy where that's appropriate.
C# 3 goes some of the way towards this, but leaves something to be desired when it comes to assistance with immutability. It also doesn't help that that
APIs are important too - the ParallelExtensions framework should help
I don't think C# 3 (or even 4) is going to be the last word in bringing understandable and reliable concurrency, but I think it points to a potential way forward.
The trouble is that concurrency is hard, unless you live in a completely side-effect free world. We can make it simpler to some extent by providing better primitives. We can encourage side-effect free programming in frameworks, and provide language smarts to help too. I'd be surprised if we ever manage to make it genuinely easy though.
Ugg is smart.
Ugg can program a CPU.
Two Uggs can program two CPUs.
Two Uggs working on the same task program two CPUs.
Uggs' program has a race condition.
Ugg1 thinks, it's Ugg2's fault.
Ugg2 thinks, it's Ugg1's fault.
Ugg1 hits Ugg2 on the head with a rock.
Ugg2 hits Ugg1 on the head with an axe.
Ugg1 is half as smart as he was before working with Ugg2.
Ugg2 is half as smart as he was before working with Ugg1.
Both Uggs now write broken code.
Uggs' program is now slow, wrong half the time, and crashes on that race condition once in a while.
Ugg does not like parallel computing.
Ugg will bang two rocks together really fast.
Ugg will reach 4GHz.
Ugg will teach everyone how to reach 4GHz.
Contrary to the popular belief, there indeed is no God.
The height of optimism: posting proof in the form of a 70-odd page thesis on a Slashdot. ;-)
I don't think we'll be Slashdotting your server any time soon, CBravo
Antiquis temporibus, nati tibi similes in rupibus ventosissimis exponebantur ad necem.
Every object (in the general sense, not necessarily the OO sense) may be either aliased or mutable, but not both.
Erlang does this by making sure no objects are mutable. This route favours the compiler writer (since it's easy) and not the programmer. I am a huge fan of the CSP model for large projects, but I'd rather keep something closer to the OO model in the local scope and use something like CSP in the global scope (which is exactly what I am doing with my current research).
I am TheRaven on Soylent News
For servers the real problem is I/O. Disks are slow, network bandwidth is limited (if you solve that then memory bandwidth is limited ;) ).
;).
;).
For most typical workloads most servers don't have enough I/O to keep 80 cores busy.
If there's enough I/O there's no problem keeping all 80 cores busy.
Imagine a slashdotted webserver with a database backend. If you have enough bandwidth and disk I/O, you'll have enough concurrent connections that those 80 cores will be more than busy enough
If you still have spare cores and mem, you can run a few virtual machines.
As for desktops - you could just use Firefox without noscript, after a few days the machine will be using all 80 CPUs and memory just to show flash ads and other junk
Back in 2000 I realized that 50 Million transistors of 4004 the first processor ever created, would out perform a P4 with the same transistor count done in the same fab running at the same clock rates. it would be over 10x faster I work out. But how to use such a device?
I had been working with a 100 PC cluster of P4 based systems to do H.264 HDTV compression in realtime. I spread the compression function across the cluster using each system to work on a small part of the problem and flow the data across the CPU's.
Based on this I wanted to build an array of processors on one chip, but I am not a silicon person, just software, driver and some basic electronics. So I looked at various FPGA cores, Arm, MIPS, etc. Then I went to a talk giving by Chuck Moore, author of the language FORTH. He had been building his own CPU's for many years using his own custom tools.
I worked with Chuck Moore for about a year in 2001/2002 on creating a massive multi core processor based on Chucks stack processor.
The Idea was instead of having 1,2 or 4 large processor to have 49 (7 * 7) small light but fast processors in one chip. This would be for tacking a different set of problems then your classic cpus'. It wouldn't be for running and OS or word processing, but for Multimedia, and cryptography, and other mathematic problems.
The idea was to flow data across the array of processors.
Each processor would run at 6Ghz, with 64K word of Ram each.
21 Bit wide words and bus (based off of F21 processor)
this allows for 4x 5bit instructions on a stack processor that only has 32 instructions.
Since it's a stack processor they run more efficiently. So in 16K transistors, 4000 gates,
the F21 at 500 Mhz performed about the same as a 500Mhz 486 with JPEG compress and decompress.
With the parallel core design instead of a common bus or network between the processors there would only be 4 connections into and out of each processor. These would be 4 registers that are shared with it's 4 neighboring processors that are laid out in a grid. So each chip would have a north, south, east and west register.
Data would be processed in whats called a systolic array, where each core would pick up some data, perform operations on it and pass it along to the next core.
The chips with a 7x7 grid of processors would expose the 28(4x7) bus lines off the edge processors, so that these could be tiled into a much larger grid of processors.
Each chip could perform around 117 Billion instructions per second at 1 Watt of power.
Unfortunately I was unable to raise money, partly because I couldn't' get any commitment from Chuck.
below is some links and other misc information on this project. Sorry it's not better organized.
This was my project.
---------
http://www.enumera.com/chip/
http://www.enumera.com/doc/Enumeradraft061003.htm
http://www.enumera.com/doc/analysis_of_Music_Copyright.html
http://www.enumera.com/doc/emtalk.ppt
--------
This was Jeff foxes independent web site, he work on the F21 with Chuck.
http://www.ultratechnology.com/ml0.htm
http://www.ultratechnology.com/f21.html#f21
http://www.ultratechnology.com/store.htm#stamp
http://www.ultratechnology.com/cowboys.html#cm
------
http://www.colorforth.com/ 25x Multicomputer Chip
Chucks site. 25x has been pulled down, but it's accessible on archive.org.
http://web.archive.org/web/*/www.colorfo
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso