Developing a New Beowulf Architecture?
"Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe. it's like having a single lane highway between L.A. and San Francisco with each car running at 10,000 mph so that you can cope with the bandwidth, it might work but it's a damn silly solution. I therefore propose a new networking solution for use in cluster systems, parallel networking. This isn't as silly as it sounds because we use this solution at work to link two switches, two 100Mb network connections are concatenated together to form a single 200Mb link, but what I propose goes further.
The new system takes advantage of the seven-layer OSI model and separates the new hardware from the operating system. So far as the system is concerned each node has a single network card but the interface is where I propose the change. Every network card includes one or more shift registers which take the parallel information off the PCI bus and convert it to a serial bit stream so that it can be sent along the network cable and when data is received the hardware operates in reverse converting serial to parallel. The new cards replace these shift registers with thirty two (or maybe sixteen) bit latches and the network connector at the back of the card has (say) forty pins. This would allow the use of thirty two pins for data and eight for handshaking and if the new eighty-core IDE cables are used then crosstalk would not be a problem. It's a similar approach to the Digital Video Out connector on some high-end video cards that allow you to connect a flat screen monitor without going through the D to A convertors. Each node has its own cable connecting into the network switch which (as the connections are now thirty two bits wide) would be a 32 x n switch where 'n' would be the number of nodes in the cluster.
Assuming that the idea can fly we would need to develop the following:
1) The new network cards. This isn't as difficult as it seems as a lot of the work has already been done by every network card vendor. With modern ASICs the task of appearing to the system as a NIC whilst presenting the data to the port thirty two bits at a time could be dealt with by a single chip. All it needs is someone to design the chip. If we use standard forty-pin connectors then users can buy the cables off the shelf. To keep things on track we would need to implement all of the NIC functions including giving it a MAC address so that a TCP/IP stack could be implemented.
2) The network switch. A network switch handling data thirty two bits at a time is not a trivial item but I am sure that it can be done. A number of IC manufacturers have crosspoint switches as part of their catalogue and all that needs to be done is to expand the process further. Given the nature of the task it might be possible to carry out the switching using a hardware only solution which would reduce latency even further.
3) The software. Assuming that the new cards appear on the PCI bus as an ordinary NIC then drivers should not be much of a problem. These would probably have to be developed at the same time as the network card. Drivers should include all the required software so that the NIC can work with the kernel but windows drivers as well would be nice.
One final thought, this solution could also be applied to other fields. Want to build a SAN PC and wire it to a pair of servers running My SQL ? Well, you now have a nice fast communication medium.
So, there you have it. Assuming this idea works then we now have a way to increase the speed of a network by reducing the latency rather than throwing more or faster CPUs at the problem. In the spirit of Open Source I do not propose to patent this idea, I want everyone to take the ideas presented here, play around with them, and if a university student is looking for his (or her) final year project they are welcome to give this a try. Should any of you have comments regarding this idea then post away. I should however point out that I'm a great fan of practical criticism, feel free to say that the idea sucks but if you do say WHY it sucks and HOW it can be improved."
You might want to look into a custom solution using USB2.0 or Firewire. These can theoretically get you 300+ Megabytes per second (Limit of the PCI bus). It won't be an easy solution to pull off but it is definantly doable.
With 8 nodes, I don't think you even need a switch. Just put a couple intel 4 port 100 mbit cards in there and link each node with each node.
That should give you lots more bandwidth and eliminate an expensive switch, and a few nanoseconds of latency.
Reinard
I don't know is this is a pratical possibility, but IIRC Linux 2.4.X can load balance a single network connection over several physical NICs - could this not be a "quick and dirty" for your problem? This could be a starting point..
Try NetBSD... safe,straightforward,useful.
This is why a beowulf cluster is not always the answer.
Depending on what you are trying to solve, the problem may need to be split up differently. The algorithm to solve the problem and the system you are using need to match well.
Beowulf is great for high cpu intensive tasks with low network useage. Other forms of clustering are good for problems that use shared memory (but this starts to nail the network). Some tasks split up so that just a simple queuing system is all that is needed to do the work, all that you need to be able to do is have a manager job determine who does what and if it was done.
-Tim
-I just work here... how am I supposed to know?
Write a driver that passes data between machines via the SCSI interface. Put each host controller in the chain on it's own ID, tie the networking part of the kernel into the SCSI part of the kernel, wave your magic wand and - **Presto!** Fast, parallel communications (with a lot of the headaches of the communication protocol taken care of by the SCSI command set -- allows for "concurrent" connections between multiple "devices" easily).
To scale, put multiple controllers in a high bandwith machine, moving data between chains. With 8 machines, there'd be no need because they could all fit on one.