IBM's Chief Architect Says Software is at Dead End
j2xs writes "In an InformationWeek article entitled 'Where's the Software to Catch Up to Multicore Computing?' the Chief Architect at IBM gives some fairly compelling reasons why your favorite software will soon be rendered deadly slow because of new hardware architectures. Software, she says, just doesn't understand how to do work in parallel to take advantage of 16, 64, 128 cores on new processors. Intel just stated in an SD Times article that 100% of its server processors will be multicore by end of 2007. We will never, ever return to single processor computers. Architect Catherine Crawford goes on to discuss some of the ways developers can harness the 'tiny supercomputers' we'll all have soon, and some of the applications we can apply this brute force to."
owww... my head...
There are a couple of serious problems with this statement. The most important one is that the article doesn't say that existing software will get slower. And there's a reason for that: Existing software will continue to run on the individual processor cores. Something that they've done for a long period of time. Old software may not get any faster due to a change in focus toward parallelism vs. increased core speed, but it's not going to suddenly come to a screeching halt any more than my DOS programs from 15 years ago are.
Secondly, multicore systems are not a problem. Software (especially server software!) has been written around multi-processing capabilities for a long time now. Chucking more cores into a single chip won't change that situation. So my J2EE server will happily scale on IBM's latest multicore Xenon PowerPC 64 processor.
Finally, what the article is really talking about is the difficulties in programming for the Cell architecture. Cell is, in effect, and entire supercomputer architecture shrunk to a single microprocessor. It has one PowerPC core that can do some heavy lifting, but its design counts on the programmers to code in 90%+ SIMD instructions to get the absolute fastest performance. By that, I mean that you need to write software that does the same transformation simultaneously across reasonably large datasets. (A simplification, but close enough for purposes of discussion.) What this means is that the Cell processor is the ultimate in Digital Signal Processor, achieving incredible thoroughput as long as the dataset is conductive to SIMD processing.
The "problem" the article refers to is that most programs are not targetted toward massive SIMD architectures. Which means that Cell is just a pretty piece of silicon to most customers. Articles like this are trying to change that by convincing customers that they'd be better served by targetting Cell rather than a more general purpose architecture.
With that out of the way, here's my opinion: The Cell Broadband Architecture is a specialized microprocessor that is perpendicular to the general market's needs. It has a lot of potential uses in more specialized applications (many of which are mentioned in the article), but I don't think that companies are ready to throw away their investment in Java,
Javascript + Nintendo DSi = DSiCade
I see no need for why we would ever need anything more than 640 cores per processor in the future.
Small potatoes make the steak look bigger.
But the developers do? When these processors become prevelant, people will design their software to utilise the parallel processing capability. What am I missing here?
Simon
What the author fails to take into account is that multi-core allows each program to effectively use a separate core to do its work, regardless of how it is programmed. All it takes is the OS to be smart enough to task each program to a free core, if available. The programs don't have to be specifically written to be multi-core aware as long as the OS is smart enough to send process to the idle cores. The programs that need more power than one core can deliver will usually have the multi-core support built in, as many games are starting to do now that the technology is taking off.
:)
Notice I took the high ground and didn't make the obligatory windows virus scan jokes...
today is spelling optional day.
If you look at single-thread performance on Intel and AMD's dual/quad core chips, they meet or beat the best that single-core has to offer. I don't see why a multi-core system in the future will run single-thread apps any slower than right now. If anything I'd expect single-thread performance to increase incrementally as Intel and AMD are able to increase clock speeds.
Has Netcraft confirmed this yet?
An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
Herb Sutter wrote about this topic two years ago. A great read for anyone who is interested.t m
http://www.gotw.ca/publications/concurrency-ddj.h
Enjoy,
It's just the normal noises in here.
IMHO, multi-cores are good for multitasking, which does not cover the whole problem of parallelism. Software (at least, in principle) _is_ ready: pure functional languages, for example, are perfectly suited for parallel processing; it is the lack of the CPUs with architectures that support internal concurrency (using a single core - as opposed to those providing support for multi-threading using multiple cores) that is the problem...
Don't be silly, it's not that simple: sure you can spread processes and threads across several cores, as opposed to using just one cpu to do it all, but what distributed computing is about is arranging the code in a single thread to take advantage of the presence of several cores. It's called parallelizing code, and it's an extremely tough branch of computer sciences.
Of course OSes can do load balancing on several cores with several processes, that's trivial... What's not is real parallel code.
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Concurrency is a hard problem, and unexpected interactions between asynchronous events in concurrent environments has been a periodic bugbear for almost as long as computers have been interactive.
It's what made the Amiga look less reliable than its competitors... if you only ran one native program at a time it was a lot more stable than MacOS or MS-DOS, because the OS provided a much richer set of services so applications didn't have to replicate them... but most people took advantage of the multitasking and when something crashed in the background the lack of memory protection meant the whole thing went down, and non-native software that wasn't written with multitasking in mind could produce the most entertaining crashes.
These days we all have good protected mode multitasking operating systems, but we don't have good easy ways to distribute an application across multiple cores. Until we do, most applications are going to be written to run single-threaded and depend on the OS to use the other cores to speed up the rest of the system, both at the application level and doing things like running graphics libraries on another core.
Until we have so many cores that the OS can't make effective use of them I don't think there's even going to be much of an attempt to make use of them for more developers. And then we're going to go through a painful period like we went through before Microsoft discovered multitasking.
The argument that software will get slower assumes that most consumer software will continue to have additional CPU requirements without being coded for multi-core applications. This doesn't make sense. The average consumer uses an Office product, e-mail, and a browser. None of these use anywhere close to 100% of the CPU for very long even on a Pentium 3, let alone on a 2GHz+ core in a multi-core processor.
Workstation computing will suffer some until software vendors catch up, but this is already happening (e.g. most CAD, Animation, Video Processing are starting to come out with multi-core optimized software). Sure, some apps will continue to be single-threaded, but eventually, who would buy them? Software vendors aren't dumb.
Games will probably speed up significantly as well. Imagine the possibilities of having a game engine where each AI character utilizes 100% of a single core? Game designers aren't going to sit around desiging games that run on single core engines, they always push the boundaries and will continue to do so.
Crack - Free with every butt and set of boobs
A New Kind of Science. Converting a range of standard CS algorithms into Cellular Automata networks is the very solution our brains use; a combination of message passing and feedback loops. If we want our computers to scale in parallel, we might want to look at how biology has solved the problem. A lot of people laughed at Wolfram when he initially published that book. I think he yet might have the last laugh.
Being a simplified example, of course, it is possible to parallelize the instructions above. It will require more memory, but we can do the following instead: If the first addition is run on a separate core than the second processor, then we'll get a net increase in performance even though we used more computational cycles to compute the results. There's just one catch. CPU designers have known about these micro-optimizations for decades, and have been designing microprocessors with an ability called "Superscalar execution" for almost as long. Superscalar designs make use of CPU processing units not currently utilized by other instructions in order to offer a simplistic form of multithreaded execution on a single core. In this way, the CPU can chew on a larger number of instructions in parallel than it could if it fully serialized the execution. Which makes full multithreading mostly redundant for these situations.
Javascript + Nintendo DSi = DSiCade
Besides this, is there a solution to this in the form of new programming languages?
Erlang and Limbo have concurrency primitives built-in. Both used CSP as a launching point. Both give the programming easy-to-use, lightweight processes and message passing. Processes share nothing.
However, neither have built-in support for multiple cores or multiple CPUs at the moment. It's just not a priority for the teams behind them. You can cheat such a setup with Erlang, however, as you can spawn processes on remote machines or remote Erlang instances. If you had two Erlang instances on the same machine, each would run on its own core, so all you'd need to do was spawn a process on each and then message pass between the two.
I wonder if I use bold in my signature, people will notice my posts.
Forgive my ignorance, but can't the OS just make each new app run on it's own core? That would probably give us some overall apparent-speed-of-computer increases, without having to completely modify all existing stuff.
stuff |
If you read the article, you will see that Mrs. Crawford does not even come close to saying that "Software is at Dead End". She says software needs to catch up with the hardware.
Computers have more and more processors (and different kinds of processors, like GPUs), and currently most software isn't designed for that kind of environment. IBM has developed some clever ways to program these types of systems in a "general purpose" way.
That's the worst summary of a headline that I've ever read.
Concurrency is easy.
I wonder if I use bold in my signature, people will notice my posts.
Give it up. Protected titles ceased to be protected decades ago when industry decided that it didn't need to be regulated. Architect means nothing these days, nor does engineer or doctor. We can throw around RA and PE and MD all we want, but the common words will always be crapped on. Interstingly, accountant and lawyer seems pretty safe - I guess we just need to choose fields that nobody wants to be associated with if we want to keep our monikers pristine.
Overzeetop, PE
Is it just my observation, or are there way too many stupid people in the world?
The example that you had given above where you manual converted an algorithm from sequential to potentially parallel processed could easily be handled by a compiler. If your brain can handle the optimization so can a compiler given enough time. When writing in a higher level language (i.e. Not Assembly or Machine Code), like you used in your example, then you should be able to expect the compiler to handle those optimizations. Yes I realize all of this is in theory, but eventually reality has to catch up with theory if we expect to improve.
was an interesting article, particularly the part about the hybrid "roadrunner" architecture.
However what is more relevant to today's non-supercomputing needs is SMP scalability.
One of the challenges with SMP scalability is cache coherency; synchronizing the caches on the processors is a costly operation (this is necessary to ensure that each processor has the same view of certain memory at the same time), normally (always?) done with a cache invalidation.
So the more invalidations you do, the more often the processor has to fetch memory from main memory, and the less it's using its cache. Processing slows down dramatically.
I've tried to design the qore programming language http://qore.sourceforge.net/ to be scalable on SMP systems. The new version (released today) has some interesting optimizations that have resulting in a large performance boost on SMP machines - the optimizations involve reducing the number of cache invalidations to the minimum (more than just reducing locking, although that is a part of it too - even an atomic update - for example on intel an assembly lock and increment - involves a cache invalidation and therefore is an expensive operation on SMP platforms). There is more work to be done, but in simple benchmarks of affected code paths the performance increase was between 2 and 3 times as fast with the optimizations on the same qore code.
Anyway it would be interesting to know if other high-level programming languages have also taken the same approach (or will do so); as we go forward, it's clear that SMP scalability will be an important topic for the future...
First issue... who says "most" programs CAN be recompiled? The first gen dual cores were basically duplicates of full processors, but as multicore becomes more popular, the cores will be more efficient and may start leaving out 100% compatibility in favor of sending threads to the better processor... that could save millions of gates per chip by tailoring some cores for FPU and some for SSE3 etc. This means in the future multicore processors won't automatically handle the old code more efficently. In comment to your "multiple computer" comment, that's what happens with code that doesn't play nice NOW.. in the future, it may not be possible to have ALL the features fo a full processor on ALL the cores.
In many companies they don't have access to code... sometimes the key parts are 20 + years old and the source physically lost.. very common in business/manufacturing. Sometimes it's not "profitable".. witness how long Adobe is taking to get a version for Intel macs... Sue it's JUST a recompile, but they don't WANT TO do it.. and normal users are legally not allowed.
The problem is not NEW programing languages, it's that much low-level stuff needs to be at least looked at and tested even if it's simply recompiled... that takes TIME and MONEY! If it's copyrighted software, there's nobody but the publisher that can legally do that! That means new versions with upgrade costs (and profit scalping). Like you said, forcing people to recompile usually makes them want to rewrite parts as well from being lost, misunderstood, or inefficient. That's a great time to bring in a new language to simplify things on one base of code and tools... On the other hand it's a great time to push Linux and OSS!!! After all, the code is open so there's nothing preventing somebody from doing the simple work of recompiling and testing on their own. (it still costs TIME, which isn't free, but at least it CAN be done).
Javascript + Nintendo DSi = DSiCade
"We will never, ever return to single processor computers"
Does anyone think that's anything other than a stupid thing to say?
I mean, maybe we never will, and maybe it's really unlikely that we will anytime soon. But it seems that anytime there's a real revolutionary (rather than evolutionary) jump in processors, we may well go back to a single "core." For example, if they invented a fully optical processor that was insanely faster than anything in silicon, but they were very expensive to produce per core, and the price scaled linearly with the number of cores... sounds like we'd have single core computers around again for a while. And what about quantum computers? I don't even know what a "core" would be for a quantum computer, but are they by nature going to have a design that works on multiple problems simultaneously without being able to use that capacity to work on an individual problem faster? Even if that is the case, does the author know that, or are they just ignoring any possibility of non-silicon architectures?
Even within silicon, is it out of the realm of conceivability that someone will develop a radical new architecture that can use more transistors to make a single core faster such that it's competitive with using the same transistor count for multiple cores?
Considering how computers have spent a good 40 years continuously changing more quickly than any other technology in history, I'd be a bit more reserved in making sweeping generalizations about all possible future developments that might occur in the next forever.
Still, computer scientists seem to be in rough agreement that current software development models mostly don't produce programs that are multi-threaded enough to take optimal advantage of the current trend toward increased cores. maybe it just sounds too boring when worded that way.
Can anyone tell me how to set my sig on Slashdot?
Most apps get slow for these reasons:
1. Disk is slow
2. Network is slow
3. Junkware hogging CPU
4. Some primadona process decided against my will that it wants to run a scan, Java RTE update, registry cleaning, etc., using up disk head movements, RAM, and CPU.
CPU is usually not the bottleneck except when other crap makes it the bottleneck.
Table-ized A.I.
I believe a big part of our problem is our piss-poor set of programming langauges and their support for concurrency. C/C++ threads packages and Java's low level synchronization primitives make developing parallel/concurrent programs much more difficult than it should be. (Ada95/Ada05 gets it better, at least by raising the level of abstraction and supporting one approach to unifying concurrency synchronization, concurrency avoidance, and object-oriented programming.)
s is) will apply: The lack of a language (programming language as well as 'spoken language') to talk about concurrency will make it nearly impossible for most programmers to develop concurrent programs. This applies to both MIMD and SIMD kinds of parallelism.
Additionally, there's the related problems of understanding concurrency. In the 80's and 90s in particular, there were a lot of fundamental research results in reasoning about concurrent systems. Nancy Lynch's work at MIT (http://theory.csail.mit.edu/tds/lynch-pubs.html) comes to my mind. I'm always dismayed at how little both new CS grads and practicing programmers know about distributed systems, and how poor their ability is collectively to reason about concurrency. It seems like most of the time when I say "race condition" or "deadlock", eyes glaze over and I have to go back and explain 'concurrency 101' to folks who I think should know this.
Wasn't it Jim Gray (I sure hope he shows up safe and sound!) who coined the terms "Heisenbugs" and "Bohrbugs" to help describe concurrency and faults? (Wikipedia attributes this to Bruce Lindsay, http://en.wikipedia.org/wiki/Heisenbug) Not only is developing concurrent programs hard, debugging them is -really hard-, and our tools (starting with programming languages and emphasizing development tools/checkers), should be focused on substantially reducing or elminating the need for debugging, or development effort will continue to grow.
Until we have more powerful tools -and training- (both academic and industrial) in using those tools, the Sapir-Whorf hypothesis (http://en.wikipedia.org/wiki/Sapir-Whorf_hypothe
dave
I've met some of the architects of the Cell processor, and they have a "build it and they will come" attitude. They've designed the computer; it's up to others to make it useful. This is probably not going to fly.
The Cell is a non-shared memory multiprocessor with quite limited memory per processor. There's only 256K per processor, which takes us back to before the 640K IBM PC. There are DMA channels to a bigger memory, but no cacheing. Architecturally, it's very retro; it's very similar to the NCube of the mid-1980s. It's not even superscalar. Cell processors are dumb RISC engines, like the old low-end MIPS machines. They clock fast, but not much gets done per clock.
Yes, you get lots of CPUs, but that may not help. On a server, what are you going to run in a Cell? Not your Java or Perl or Python server app; there's not enough memory. No way will an instance of Apache fit. You could put a copy of the TCP/IP stack in a Cell, but that's not where the CPU time goes in a web server. One IBM document suggests putting "XML acceleration" (i.e. XML parsing) in the server, but that's an answer looking for a problem. It might be useful for streaming video or audio; that's a pipelined process. If you need to compress or decompress or transcode or decrypt, the Cell might be useful. But for most web services, those jobs are done once, not during playout. Even MPEG4 compression might be too much for a Cell; you need at least two frames of storage, and it doesn't have enough memory for that.
Now if they had, say, 16MB per CPU, it might be different.
The track record of non-shared memory supercomputers is terrible. There's a long history of dead ends, from the ILLIAC IV to the BBN Butterfly to the NCube to the Connection Machine. They're easy to design and build, but just not that useful for general purpose computing. Some volumetric simulation problems, like weather prediction, structural analysis, and fluid dynamics can be crammed into those machines, so there are jobs for them, but the applications are limited.
Shared-memory microprocessors look much more promising as general purpose computers. Having eight or sixteen CPUs in a shared-memory multicore configuration is quite useful. That's how SGI servers worked, and they had a good track record. Scaling up today's multicore shared-memory CPUs is repeating that idea, but smaller and cheaper.
At some point, you have to go to non-shared memory, but that doesn't have to happen until you hit maybe 16 CPUs sharing a few gigabytes of memory, which is about when the cache interconnects start to choke and speed of light lag to the far side of the RAM starts to hurt. That might even be pushed harder; there's been talk of 80 CPUs in a shared memory configuration. That's optimistic. But we know 16 will work; SGI had that years ago.
Then you go to a cluster on a chip, which is also well understood.
That's the near future. Not the Cell.
Berkeley tech report (inc. Patterson as author)
Brief summary (I heard the same talk when he spoke at PARC), computational problems are divisible into one of thirteen categories that range from matrix multiplication to finite state automata. Most existing research (academia and industry) into parallelism tends to focus on about seven of those categories that are most easily parallelized - think supercomputer cluster. Most apps that you or I use fall into the graph traversal or finite-state categories (think compilers, apps with an event loop, etc.), into which there is essentially no research. Patterson even suspects that finite state machines are inherently serial and CANNOT be parallelized.
So ... the apps that we already use can't really get faster on parallel cores without major, fundamental advances in computer science that don't seem to be approaching. Which means we'll be using our current apps for a LONG time.
Additional note: IBM (and other chip manufacturers) have a vested interest in telling everyone that parallelism is the future. They can't make faster chips anymore, they can only compete on sheer number of cores.
A witty [sig] proves nothing. --Voltaire
Perl 6 (as it is designed) introduces a new concept of "junctions", which are a bit like arrays, but can be used in clever ways. One useful way is:
... and instead of doing three consecutive eat's -
if ($fruit == ("apple"|"orange"|"pear")) {
print "sweet!";
}
But another intriguing way of using the junction will be parallel loops -
for ("apple"|"orange"|"pear") {
eat($_);
}
eat("apple");
eat("orange");
eat("pear");
the interpreter will run 3 threads, each with the eat() function.
As the whole of Perl 6, this design is not finalized. Maybe it won't be like that at all. And of course threading is all non-safe and stuff.
But having threads in a vanilla for loop, instead of setting up thread with clever functions, modules, etc. is something new. If it will happen and unsuspecting programmers will just use it, hey - that'd be something special.
This comment does not exactly apply to the question put forth about performance of existing apps under multiple cores. However, I would like to bring up that, in my opinion, given my experience with artificial Neural networks and related work, that I expect, in some form or another, that it is likely that one could fairly easily argue:
1) The number of cores is going to increase
2) The current concept of an artificial Neuron having some sort of value, with weights attributed to it is too simple for how our human brains realy work, and therefore need more than a simple value and one algorithm, such that it will likely need to be replaced with a more complex model of values and algorithms, and the work on such that requires a mini-process or in this case "a core"
I expect that given that there will be an increased amount of cores, probably with an increase similiar to hard disc, processor, or memory increases of the past (1 10MB hard disc increasing to 500GB today), that we will have thousands or even hundreds of thousands of cores.
As we learn more about how the brain works I believe that 2) will be accepted as true at some point.
So I expect that more and more new software will attempt to be more intuitive, as more and more people begin to agree that the software we have now in general is crap, in that it doesn't help the layman as much as it could do their jobs.
This intuitiveness will likely be in the form of artificial Neural Nets, paving the way for computing systems to begin to act like the science fiction computer systems we think of in "the future".
Just my two-cents guess...
And Forth's manual stack-loading is practically 1:1 to the underlying OS too, why don't you use that? Garbage collection has nothing to do with the underlying OS, but we keep it around.
Mapping 1:1 to the underlying OS is not the be-all and end-all of linguistic constructs. Consider Actors model languages, or dataflow-model languages - or the native rendezvous concepts from Ada. Im not saying that any of these are ideal approaches (I hate Ada, for example) - Im just saying that Algol-descended languages were designed to model procedure and formulas... so modelling concurrency doesnt come naturally to them.
...is the only way we are going to take advantage of multi-core cpu's and continue to improve our software. Only through purely functional code can you make guarantees about what can be executed simultaneously and let the machine sort it all out. I'm learning Haskell for this very reason.
Some tasks can't be done in parallel and this is the Achilles heel of massive parallel architectures. See for instance http://en.wikipedia.org/wiki/Amdahl's_law. No parallel hardware and no parallel algoritms - no matter how clever - will help you, if you have a task of sequential nature (and even only help you somewhat no matter how massively parallel it is, if your task is partly sequential).
If one truckdriver can drive a truck 50 miles in one hour, how far can two drivers then drive the truck in the same amount of time?