Ask Donald Becker
This is a "needs no introduction" introduction, because Donald Becker is one of the people who has been most influential in making GNU/Linux a usable operating system, and is also one of the "fathers" of Beowulf and commodity supercomputing clusters in general. Usual Slashdot interview rules apply, plus a special one for this interview only: "What if we made a Beowulf cluster of these?" is not an appropriate question.
Thanks for all the drivers. There are a *lot* of people (including me, with two cards that use your drivers) that really appreciate what you've done.
May we never see th
Programming MPI (i.e. message-passing) is slow, difficult and error-prone. But I'd say making the hardware and especially the operating system for a single system image computer with thousands of processors is even more difficult. Or hey, why stop at thousands of processors? IBM is designing their Blue Gene computer, with 1 million processors. How do you make a single kernel scale on a system like that?
The traditional approach is to use fine grained locking in the kernel, but this tends to lead to unmaintainable code and low performance on lower end systems. For an example of this see Solaris, or most other big iron unix kernels.
Another approach is the OS cluster idea championed by Larry McVoy (the Bitkeeper guy). The idea is that you run many kernels on the same computer, one kernel takes care of something like 4-8 cpu:s. And then they cooperate somehow so they can give the impression of SSI.
A third approach seems to be the K42 exokernel project by IBM. They claim very good scalability without complicated lock hierarchies. The basic design idea seems to be to avoid global data whenever possible. Perhaps someone more knowledgeable might shed more light on this...
But anyway, until someone comes up with a kernel that scales to zillions of cpu:s, message passing is about the only way to go. Libraries the give you the illusion of using threads but are actually using message passing underneath might ease the pain somewhat, but for some reason they have not become popular. Perhaps there is too much overhead. And some people claim that giving the programmer the illusion that all memory access is equal speed leads to slow code. The same argument also applies to NUMA systems.
And on the system administration side of things, projects like mosix and bproc already today give you the impression of a single system image. Of course your application still has to use message passing, but administration and maintenance of a cluster is greatly simplified.