My first reaction was "Wow, more CPU power!". And then, I actually though what could we do with that beast. I'm sure those would be useful in niche markets, like imaging/multimedia where special custom software could offload some huge operation to the card while the main CPU deals with the user interface. But that's not my field so I have no idea of the feasability.
Many people mentioned 'Beowulf'! Now, Beowulf is a scientific cluster, and I happen to know a fair bit on the subject, since I work for a research center. Most scientific applications need lots of CPU power, but also lots of memory bandwidth: for example, simulating the flow of air around an airplane wing what a dataset of 5 GB...
So from the start, the data cache of the CPUs are nearly useless since we cicle through huge amounts of data, the CPU constantly reads and write to memory. The net result is that a standard PC isn't able to keep more than two CPU fed with data before the system bus becomes a bottleneck. Since the mPOWER card has a standard PC bus, only two of the four CPUs would actually be used.
Next, the memory. 512MB isn't actually a lot for scientific clusters. That what you usually have for each CPU. It's a bit tight, but let's live with it.
Finally, the benefit of this kind of card would be to cram a PC box with a number of those, to actually save money by not needing additional hard drives, cases, keyboard, cheap graphic adapter, etc.
The typical PCI bus (64 bit, 66MHz) has a bandwidth of just under 4 Gbps. It is a bus, so only one device can use it at the same time (half-duplex). The usually clustering interconnect (Mirinet or SCI) offers 1 Gbps full-duplex, so let's say 2Gps to compare with the PCI bus. Let's also say that the host CPU in a multi-mPOWER card situation isn't doing any actual work to let the bus free for the mPOWER. The means you can put two mPOWER cards in a single system before each card will get lower interconnect than if you had a standard dual-CPU machine with a SCI or mirinet adapter. And that's even before the need to access any disk or network device, which would cause additional traffic on the PCI bus, reducing the overall available bandwidth. That's not much of a win.
Of course, not all application need to have gobles of memory. distributed.net-like application, where the dataset is tiny, could make use of all the 8 cards in one system. I just think that those applications are the minority is scientific computing.
PCI beats Ethernet any day, but most scientific clusters use Mirinet or SCI, which are about 1 Gbps full duplex, while PCI is about 4Gbps half-duplex with a 64bit/66MHz bus. This means that as soon as you have two of those babies in your computer, your PCI bus is actually slower than is you had a switched Mirinet or SCI interconnect.
The other problem is the bus on the card is too slow to handle four CPUs. Our experience is that anything over two CPU in a single machines will cause bottlenecks. Except on SGIs with ccNUMA, of course, which can handle eight CPU per machine easily.
Memory is also a bit tight - we usually need use about 512Mb per CPU, this thing as 512 for all 4 CPUs.
My first reaction was "Wow, more CPU power!". And then, I actually though what could we do with that beast.
I'm sure those would be useful in niche markets, like imaging/multimedia where special custom software could offload some huge operation to the card while the main CPU deals with the user interface. But that's not my field so I have no idea of the feasability.
Many people mentioned 'Beowulf'! Now, Beowulf is a scientific cluster, and I happen to know a fair bit on the subject, since I work for a research center.
Most scientific applications need lots of CPU power, but also lots of memory bandwidth: for example, simulating the flow of air around an airplane wing what a dataset of 5 GB...
So from the start, the data cache of the CPUs are nearly useless since we cicle through huge amounts of data, the CPU constantly reads and write to memory. The net result is that a standard PC isn't able to keep more than two CPU fed with data before the system bus becomes a bottleneck. Since the mPOWER card has a standard PC bus, only two of the four CPUs would actually be used.
Next, the memory. 512MB isn't actually a lot for scientific clusters. That what you usually have for each CPU. It's a bit tight, but let's live with it.
Finally, the benefit of this kind of card would be to cram a PC box with a number of those, to actually save money by not needing additional hard drives, cases, keyboard, cheap graphic adapter, etc.
The typical PCI bus (64 bit, 66MHz) has a bandwidth of just under 4 Gbps. It is a bus, so only one device can use it at the same time (half-duplex). The usually clustering interconnect (Mirinet or SCI) offers 1 Gbps full-duplex, so let's say 2Gps to compare with the PCI bus.
Let's also say that the host CPU in a multi-mPOWER card situation isn't doing any actual work to let the bus free for the mPOWER.
The means you can put two mPOWER cards in a single system before each card will get lower interconnect than if you had a standard dual-CPU machine with a SCI or mirinet adapter. And that's even before the need to access any disk or network device, which would cause additional traffic on the PCI bus, reducing the overall available bandwidth. That's not much of a win.
Of course, not all application need to have gobles of memory. distributed.net-like application, where the dataset is tiny, could make use of all the 8 cards in one system. I just think that those applications are the minority is scientific computing.
PCI beats Ethernet any day, but most scientific clusters use Mirinet or SCI, which are about 1 Gbps full duplex, while PCI is about 4Gbps half-duplex with a 64bit/66MHz bus. This means that as soon as you have two of those babies in your computer, your PCI bus is actually slower than is you had a switched Mirinet or SCI interconnect.
The other problem is the bus on the card is too slow to handle four CPUs. Our experience is that anything over two CPU in a single machines will cause bottlenecks. Except on SGIs with ccNUMA, of course, which can handle eight CPU per machine easily.
Memory is also a bit tight - we usually need use about 512Mb per CPU, this thing as 512 for all 4 CPUs.
Well, that's my NSHO and experience.