Slashdot Mirror


Developing a New Beowulf Architecture?

Peter Gant asks: "By day I'm the sys admin on a mixed Windows / Linux environment but in my spare time I work on a small Beowolf system comprising eight nodes and a small rack, it's only small but it's mine . The system is put together using 100Mb network cards and a switch and this is that part that's been bugging me, I have a system where the slowest CPU runs at 600MHz but the network links run at (assuming a single start and stop bit) ten million bytes a second. There are of course better ways of doing this. I could rip out all of the 100Mb cards and fit gigabit ethernet running over copper or fiber but most switches only have a single gigabit port and multiple-port gigabit switches are damned expensive. There's also the possibility of using Myrinet but unless I mortgage the house and sell my girlfriend into slavery this isn't a realistic option." It gets more detailed in the article. If you are interested in Beowulf discussions, maybe this question will provide some grist for the grey matter.

"Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe. it's like having a single lane highway between L.A. and San Francisco with each car running at 10,000 mph so that you can cope with the bandwidth, it might work but it's a damn silly solution. I therefore propose a new networking solution for use in cluster systems, parallel networking. This isn't as silly as it sounds because we use this solution at work to link two switches, two 100Mb network connections are concatenated together to form a single 200Mb link, but what I propose goes further.

The new system takes advantage of the seven-layer OSI model and separates the new hardware from the operating system. So far as the system is concerned each node has a single network card but the interface is where I propose the change. Every network card includes one or more shift registers which take the parallel information off the PCI bus and convert it to a serial bit stream so that it can be sent along the network cable and when data is received the hardware operates in reverse converting serial to parallel. The new cards replace these shift registers with thirty two (or maybe sixteen) bit latches and the network connector at the back of the card has (say) forty pins. This would allow the use of thirty two pins for data and eight for handshaking and if the new eighty-core IDE cables are used then crosstalk would not be a problem. It's a similar approach to the Digital Video Out connector on some high-end video cards that allow you to connect a flat screen monitor without going through the D to A convertors. Each node has its own cable connecting into the network switch which (as the connections are now thirty two bits wide) would be a 32 x n switch where 'n' would be the number of nodes in the cluster.

Assuming that the idea can fly we would need to develop the following:

1) The new network cards. This isn't as difficult as it seems as a lot of the work has already been done by every network card vendor. With modern ASICs the task of appearing to the system as a NIC whilst presenting the data to the port thirty two bits at a time could be dealt with by a single chip. All it needs is someone to design the chip. If we use standard forty-pin connectors then users can buy the cables off the shelf. To keep things on track we would need to implement all of the NIC functions including giving it a MAC address so that a TCP/IP stack could be implemented.

2) The network switch. A network switch handling data thirty two bits at a time is not a trivial item but I am sure that it can be done. A number of IC manufacturers have crosspoint switches as part of their catalogue and all that needs to be done is to expand the process further. Given the nature of the task it might be possible to carry out the switching using a hardware only solution which would reduce latency even further.

3) The software. Assuming that the new cards appear on the PCI bus as an ordinary NIC then drivers should not be much of a problem. These would probably have to be developed at the same time as the network card. Drivers should include all the required software so that the NIC can work with the kernel but windows drivers as well would be nice.

One final thought, this solution could also be applied to other fields. Want to build a SAN PC and wire it to a pair of servers running My SQL ? Well, you now have a nice fast communication medium.

So, there you have it. Assuming this idea works then we now have a way to increase the speed of a network by reducing the latency rather than throwing more or faster CPUs at the problem. In the spirit of Open Source I do not propose to patent this idea, I want everyone to take the ideas presented here, play around with them, and if a university student is looking for his (or her) final year project they are welcome to give this a try. Should any of you have comments regarding this idea then post away. I should however point out that I'm a great fan of practical criticism, feel free to say that the idea sucks but if you do say WHY it sucks and HOW it can be improved."

2 of 86 comments (clear)

  1. Well, what are you using it for? by 3-State+Bit · · Score: 5, Interesting

    Usually one runs highly parallelizable things on clusters like this. Which means that the computation can be split into nodes easily, without having to constantly share much data between nodes. If you're not highly paralleled, then 12.5 megabytes a second (because that's what 100bt is) is going to slow you down less than having a slow front-side bus. (100 mhz? -- the point is, if /that/ is what limits your computation, versus your processor speeds, because you aren't parallelized, then maybe a cluster isn't your best bet.)

    Consider:
    If your nodes need to share more than 12.5 megs of data second, then you might as well be running 100 megahertz processors.

    Of course, I could just be talking out my ass.

  2. Working on a similar problem by Outland+Traveller · · Score: 3, Interesting

    I've been experimenting with Gigabit Ethernet lately.

    The good news is that it's less expensive than you think. Decent cards are only marginally more expensive than good 100bT cards, and netgear now makes a reasonably prices 8port gigibit switch. It doesn't support jumbo frames but it's quite usuable for small networks.

    The bad news is that I'm finding that gigabit ethernet doesn't deliver the performance you might expect using traditional network protocols. NFS in particular sees only modest gains, even when using nfsv3 and increasing the block sizes and tuning the kernel buffers/TCP options. I'm still showing bandwidth bottlenecks on the network when I should be seeing bandwidth bottlenecks on the disk array.

    It would appear that something isn't scaling. Given that network benchmarking tools do show gigabit ethernet performing at a reasonable speed, it would appear that most "legacy" protocols are not architected to take advantage of it.