Slashdot Mirror


Developing a New Beowulf Architecture?

Peter Gant asks: "By day I'm the sys admin on a mixed Windows / Linux environment but in my spare time I work on a small Beowolf system comprising eight nodes and a small rack, it's only small but it's mine . The system is put together using 100Mb network cards and a switch and this is that part that's been bugging me, I have a system where the slowest CPU runs at 600MHz but the network links run at (assuming a single start and stop bit) ten million bytes a second. There are of course better ways of doing this. I could rip out all of the 100Mb cards and fit gigabit ethernet running over copper or fiber but most switches only have a single gigabit port and multiple-port gigabit switches are damned expensive. There's also the possibility of using Myrinet but unless I mortgage the house and sell my girlfriend into slavery this isn't a realistic option." It gets more detailed in the article. If you are interested in Beowulf discussions, maybe this question will provide some grist for the grey matter.

"Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe. it's like having a single lane highway between L.A. and San Francisco with each car running at 10,000 mph so that you can cope with the bandwidth, it might work but it's a damn silly solution. I therefore propose a new networking solution for use in cluster systems, parallel networking. This isn't as silly as it sounds because we use this solution at work to link two switches, two 100Mb network connections are concatenated together to form a single 200Mb link, but what I propose goes further.

The new system takes advantage of the seven-layer OSI model and separates the new hardware from the operating system. So far as the system is concerned each node has a single network card but the interface is where I propose the change. Every network card includes one or more shift registers which take the parallel information off the PCI bus and convert it to a serial bit stream so that it can be sent along the network cable and when data is received the hardware operates in reverse converting serial to parallel. The new cards replace these shift registers with thirty two (or maybe sixteen) bit latches and the network connector at the back of the card has (say) forty pins. This would allow the use of thirty two pins for data and eight for handshaking and if the new eighty-core IDE cables are used then crosstalk would not be a problem. It's a similar approach to the Digital Video Out connector on some high-end video cards that allow you to connect a flat screen monitor without going through the D to A convertors. Each node has its own cable connecting into the network switch which (as the connections are now thirty two bits wide) would be a 32 x n switch where 'n' would be the number of nodes in the cluster.

Assuming that the idea can fly we would need to develop the following:

1) The new network cards. This isn't as difficult as it seems as a lot of the work has already been done by every network card vendor. With modern ASICs the task of appearing to the system as a NIC whilst presenting the data to the port thirty two bits at a time could be dealt with by a single chip. All it needs is someone to design the chip. If we use standard forty-pin connectors then users can buy the cables off the shelf. To keep things on track we would need to implement all of the NIC functions including giving it a MAC address so that a TCP/IP stack could be implemented.

2) The network switch. A network switch handling data thirty two bits at a time is not a trivial item but I am sure that it can be done. A number of IC manufacturers have crosspoint switches as part of their catalogue and all that needs to be done is to expand the process further. Given the nature of the task it might be possible to carry out the switching using a hardware only solution which would reduce latency even further.

3) The software. Assuming that the new cards appear on the PCI bus as an ordinary NIC then drivers should not be much of a problem. These would probably have to be developed at the same time as the network card. Drivers should include all the required software so that the NIC can work with the kernel but windows drivers as well would be nice.

One final thought, this solution could also be applied to other fields. Want to build a SAN PC and wire it to a pair of servers running My SQL ? Well, you now have a nice fast communication medium.

So, there you have it. Assuming this idea works then we now have a way to increase the speed of a network by reducing the latency rather than throwing more or faster CPUs at the problem. In the spirit of Open Source I do not propose to patent this idea, I want everyone to take the ideas presented here, play around with them, and if a university student is looking for his (or her) final year project they are welcome to give this a try. Should any of you have comments regarding this idea then post away. I should however point out that I'm a great fan of practical criticism, feel free to say that the idea sucks but if you do say WHY it sucks and HOW it can be improved."

8 of 86 comments (clear)

  1. Oh my word! by Anonymous Coward · · Score: 5, Funny

    Imagine a Beowulf cluster of these!

  2. Mmmm... by darkov · · Score: 5, Funny

    sell my girlfriend into slavery

    I'll give you ten bucks ... if she's cute ... and a goer.

  3. Load balancing by OrangeSpyderMan · · Score: 4, Insightful

    I don't know is this is a pratical possibility, but IIRC Linux 2.4.X can load balance a single network connection over several physical NICs - could this not be a "quick and dirty" for your problem? This could be a starting point..

    --
    Try NetBSD... safe,straightforward,useful.
  4. Serial is faster by darkov · · Score: 4, Informative

    Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe

    As it happens, parrallel interconnection's days are numbered becuase they are fundamentally limited as tranmission speed increases. As the speed goes up you increasingly have problems with things like interactions between data lines and having the data arrive at the same time on each line. So, ironically, less lines means you can go faster and provide more bandwidth.

  5. Well, what are you using it for? by 3-State+Bit · · Score: 5, Interesting

    Usually one runs highly parallelizable things on clusters like this. Which means that the computation can be split into nodes easily, without having to constantly share much data between nodes. If you're not highly paralleled, then 12.5 megabytes a second (because that's what 100bt is) is going to slow you down less than having a slow front-side bus. (100 mhz? -- the point is, if /that/ is what limits your computation, versus your processor speeds, because you aren't parallelized, then maybe a cluster isn't your best bet.)

    Consider:
    If your nodes need to share more than 12.5 megs of data second, then you might as well be running 100 megahertz processors.

    Of course, I could just be talking out my ass.

  6. current limitations? by call+-151 · · Score: 4, Informative
    One thing that comes up a lot with Beowulf clusters that people don't always realize is deciding what the bottleneck is. There are basically three possible limitations:
    • Processor-bound clusters- these have adequate network and storage support and are held back by the number of CPUs.
    • Network-bound clusters- these have inadequate network capability and much of the time the processors are waiting for information over the network.
    • Storage-bound clusters- these can have adequate processor-to-processor network capabilities, but share a slow hard drive, so time is spent waiting for IO.

    Of course, the same cluster can be bound in different ways depending upon the applications that are being run. It is important to realize what the limitations are for your desired tasks and focus your improvements there. I have seen several clusters where they spent an ungodly amount of money on Mirrinet and a massive amount of time getting it working when they were running easily-parallelizable tasks that were really bound just by the number of CPUs.
    --
    It's psychosomatic. You need a lobotomy. I'll get a saw.
  7. Architecture matching Algorithm by tolldog · · Score: 5, Insightful

    This is why a beowulf cluster is not always the answer.

    Depending on what you are trying to solve, the problem may need to be split up differently. The algorithm to solve the problem and the system you are using need to match well.

    Beowulf is great for high cpu intensive tasks with low network useage. Other forms of clustering are good for problems that use shared memory (but this starts to nail the network). Some tasks split up so that just a simple queuing system is all that is needed to do the work, all that you need to be able to do is have a manager job determine who does what and if it was done.

    -Tim

    --
    -I just work here... how am I supposed to know?
  8. DUDE! by floydigus · · Score: 4, Funny

    Imagine one node of this thing on it's own!

    --

    All things in moderation; including moderation