IBM's Chief Architect Says Software is at Dead End
j2xs writes "In an InformationWeek article entitled 'Where's the Software to Catch Up to Multicore Computing?' the Chief Architect at IBM gives some fairly compelling reasons why your favorite software will soon be rendered deadly slow because of new hardware architectures. Software, she says, just doesn't understand how to do work in parallel to take advantage of 16, 64, 128 cores on new processors. Intel just stated in an SD Times article that 100% of its server processors will be multicore by end of 2007. We will never, ever return to single processor computers. Architect Catherine Crawford goes on to discuss some of the ways developers can harness the 'tiny supercomputers' we'll all have soon, and some of the applications we can apply this brute force to."
Concurrency is a hard problem, and unexpected interactions between asynchronous events in concurrent environments has been a periodic bugbear for almost as long as computers have been interactive.
It's what made the Amiga look less reliable than its competitors... if you only ran one native program at a time it was a lot more stable than MacOS or MS-DOS, because the OS provided a much richer set of services so applications didn't have to replicate them... but most people took advantage of the multitasking and when something crashed in the background the lack of memory protection meant the whole thing went down, and non-native software that wasn't written with multitasking in mind could produce the most entertaining crashes.
These days we all have good protected mode multitasking operating systems, but we don't have good easy ways to distribute an application across multiple cores. Until we do, most applications are going to be written to run single-threaded and depend on the OS to use the other cores to speed up the rest of the system, both at the application level and doing things like running graphics libraries on another core.
Until we have so many cores that the OS can't make effective use of them I don't think there's even going to be much of an attempt to make use of them for more developers. And then we're going to go through a painful period like we went through before Microsoft discovered multitasking.
was an interesting article, particularly the part about the hybrid "roadrunner" architecture.
However what is more relevant to today's non-supercomputing needs is SMP scalability.
One of the challenges with SMP scalability is cache coherency; synchronizing the caches on the processors is a costly operation (this is necessary to ensure that each processor has the same view of certain memory at the same time), normally (always?) done with a cache invalidation.
So the more invalidations you do, the more often the processor has to fetch memory from main memory, and the less it's using its cache. Processing slows down dramatically.
I've tried to design the qore programming language http://qore.sourceforge.net/ to be scalable on SMP systems. The new version (released today) has some interesting optimizations that have resulting in a large performance boost on SMP machines - the optimizations involve reducing the number of cache invalidations to the minimum (more than just reducing locking, although that is a part of it too - even an atomic update - for example on intel an assembly lock and increment - involves a cache invalidation and therefore is an expensive operation on SMP platforms). There is more work to be done, but in simple benchmarks of affected code paths the performance increase was between 2 and 3 times as fast with the optimizations on the same qore code.
Anyway it would be interesting to know if other high-level programming languages have also taken the same approach (or will do so); as we go forward, it's clear that SMP scalability will be an important topic for the future...
This is an often-repeated misconception. Cell abandons the practice of having different fp, integer, and vector registers... all registers are 128-bit and any instruction can be issued on any of them, and those instructions are generated by a C++ compiler. So saying that programmers code in these SIMD instructions is like saying that "x86's design counts on programmers to shuffle values between the fp stack, integer and vector registers, and code in separate fp, integer, and vector instructions to get the absolute fastest performance".
The reality is that Cell was targeted more at solving the memory problem than just doing SIMD stream processing. Engineers looked around and decided a 32kb L1 cache was silly... not having a cache-snooping DMA engine (or prefetch engine) would be silly. Putting nine cores on a bus with 7 GB/s bandwidth would be silly. Not being able to overlap memory latency with execution is silly. To solve all these problems, you give up having a single coherent address space.
But there is even more power in Java, .NET investments now... It is completely within the realm of possibility to write a runtime that executes your Java thread on SPU, or JITs the .NET to SPU code. It's a nice benefit that these are already handle-based rather than pointer based languages, so the memory-mapping is a task of the runtime and transparent to the code. And IBM is working hard on native C++ code generation that is agnostic of the address space problem.