Developing a New Beowulf Architecture?

← Back to Stories (view on slashdot.org)

Developing a New Beowulf Architecture?

Posted by Cliff on Thursday November 14, 2002 @09:24AM from the finding-the-bottleneck dept.

Peter Gant asks: "By day I'm the sys admin on a mixed Windows / Linux environment but in my spare time I work on a small Beowolf system comprising eight nodes and a small rack, it's only small but it's mine . The system is put together using 100Mb network cards and a switch and this is that part that's been bugging me, I have a system where the slowest CPU runs at 600MHz but the network links run at (assuming a single start and stop bit) ten million bytes a second. There are of course better ways of doing this. I could rip out all of the 100Mb cards and fit gigabit ethernet running over copper or fiber but most switches only have a single gigabit port and multiple-port gigabit switches are damned expensive. There's also the possibility of using Myrinet but unless I mortgage the house and sell my girlfriend into slavery this isn't a realistic option." It gets more detailed in the article. If you are interested in Beowulf discussions, maybe this question will provide some grist for the grey matter.

"Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe. it's like having a single lane highway between L.A. and San Francisco with each car running at 10,000 mph so that you can cope with the bandwidth, it might work but it's a damn silly solution. I therefore propose a new networking solution for use in cluster systems, parallel networking. This isn't as silly as it sounds because we use this solution at work to link two switches, two 100Mb network connections are concatenated together to form a single 200Mb link, but what I propose goes further.

The new system takes advantage of the seven-layer OSI model and separates the new hardware from the operating system. So far as the system is concerned each node has a single network card but the interface is where I propose the change. Every network card includes one or more shift registers which take the parallel information off the PCI bus and convert it to a serial bit stream so that it can be sent along the network cable and when data is received the hardware operates in reverse converting serial to parallel. The new cards replace these shift registers with thirty two (or maybe sixteen) bit latches and the network connector at the back of the card has (say) forty pins. This would allow the use of thirty two pins for data and eight for handshaking and if the new eighty-core IDE cables are used then crosstalk would not be a problem. It's a similar approach to the Digital Video Out connector on some high-end video cards that allow you to connect a flat screen monitor without going through the D to A convertors. Each node has its own cable connecting into the network switch which (as the connections are now thirty two bits wide) would be a 32 x n switch where 'n' would be the number of nodes in the cluster.

Assuming that the idea can fly we would need to develop the following:

1) The new network cards. This isn't as difficult as it seems as a lot of the work has already been done by every network card vendor. With modern ASICs the task of appearing to the system as a NIC whilst presenting the data to the port thirty two bits at a time could be dealt with by a single chip. All it needs is someone to design the chip. If we use standard forty-pin connectors then users can buy the cables off the shelf. To keep things on track we would need to implement all of the NIC functions including giving it a MAC address so that a TCP/IP stack could be implemented.

2) The network switch. A network switch handling data thirty two bits at a time is not a trivial item but I am sure that it can be done. A number of IC manufacturers have crosspoint switches as part of their catalogue and all that needs to be done is to expand the process further. Given the nature of the task it might be possible to carry out the switching using a hardware only solution which would reduce latency even further.

3) The software. Assuming that the new cards appear on the PCI bus as an ordinary NIC then drivers should not be much of a problem. These would probably have to be developed at the same time as the network card. Drivers should include all the required software so that the NIC can work with the kernel but windows drivers as well would be nice.

One final thought, this solution could also be applied to other fields. Want to build a SAN PC and wire it to a pair of servers running My SQL ? Well, you now have a nice fast communication medium.

So, there you have it. Assuming this idea works then we now have a way to increase the speed of a network by reducing the latency rather than throwing more or faster CPUs at the problem. In the spirit of Open Source I do not propose to patent this idea, I want everyone to take the ideas presented here, play around with them, and if a university student is looking for his (or her) final year project they are welcome to give this a try. Should any of you have comments regarding this idea then post away. I should however point out that I'm a great fan of practical criticism, feel free to say that the idea sucks but if you do say WHY it sucks and HOW it can be improved."

13 of 86 comments (clear)

Min score:

Reason:

Sort:

Oh my word! by Anonymous Coward · 2002-11-14 09:26 · Score: 5, Funny

Imagine a Beowulf cluster of these!
Mmmm... by darkov · 2002-11-14 09:29 · Score: 5, Funny

sell my girlfriend into slavery

I'll give you ten bucks ... if she's cute ... and a goer.

--
Reliable, Great Value Hosting: $7.95/mo 2.4G/120G
Firewire or USB2.0 by Speedy8 · 2002-11-14 09:33 · Score: 3, Insightful

You might want to look into a custom solution using USB2.0 or Firewire. These can theoretically get you 300+ Megabytes per second (Limit of the PCI bus). It won't be an easy solution to pull off but it is definantly doable.
1. Re:Firewire or USB2.0 by Ashran · 2002-11-14 10:02 · Score: 3, Informative
  
  USB2: 480 MBit/s -> 60 Megabyte/sec
  Firewire: 400 MBit/s -> 50 Megabytes/sec
  
  which is still a lot faster than 100mbit cards
  
  --
  
  Before you email me, remember: "There is no god!"
Load balancing by OrangeSpyderMan · 2002-11-14 09:35 · Score: 4, Insightful

I don't know is this is a pratical possibility, but IIRC Linux 2.4.X can load balance a single network connection over several physical NICs - could this not be a "quick and dirty" for your problem? This could be a starting point..

--
Try NetBSD... safe,straightforward,useful.
Serial is faster by darkov · 2002-11-14 09:37 · Score: 4, Informative

Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe

As it happens, parrallel interconnection's days are numbered becuase they are fundamentally limited as tranmission speed increases. As the speed goes up you increasingly have problems with things like interactions between data lines and having the data arrive at the same time on each line. So, ironically, less lines means you can go faster and provide more bandwidth.

--
Reliable, Great Value Hosting: $7.95/mo 2.4G/120G
Well, what are you using it for? by 3-State+Bit · 2002-11-14 09:47 · Score: 5, Interesting

Usually one runs highly parallelizable things on clusters like this. Which means that the computation can be split into nodes easily, without having to constantly share much data between nodes. If you're not highly paralleled, then 12.5 megabytes a second (because that's what 100bt is) is going to slow you down less than having a slow front-side bus. (100 mhz? -- the point is, if /that/ is what limits your computation, versus your processor speeds, because you aren't parallelized, then maybe a cluster isn't your best bet.)

Consider:
If your nodes need to share more than 12.5 megs of data second, then you might as well be running 100 megahertz processors.

Of course, I could just be talking out my ass.
Working on a similar problem by Outland+Traveller · 2002-11-14 09:49 · Score: 3, Interesting

I've been experimenting with Gigabit Ethernet lately.

The good news is that it's less expensive than you think. Decent cards are only marginally more expensive than good 100bT cards, and netgear now makes a reasonably prices 8port gigibit switch. It doesn't support jumbo frames but it's quite usuable for small networks.

The bad news is that I'm finding that gigabit ethernet doesn't deliver the performance you might expect using traditional network protocols. NFS in particular sees only modest gains, even when using nfsv3 and increasing the block sizes and tuning the kernel buffers/TCP options. I'm still showing bandwidth bottlenecks on the network when I should be seeing bandwidth bottlenecks on the disk array.

It would appear that something isn't scaling. Given that network benchmarking tools do show gigabit ethernet performing at a reasonable speed, it would appear that most "legacy" protocols are not architected to take advantage of it.
current limitations? by call+-151 · 2002-11-14 09:51 · Score: 4, Informative
One thing that comes up a lot with Beowulf clusters that people don't always realize is deciding what the bottleneck is. There are basically three possible limitations:
- Processor-bound clusters- these have adequate network and storage support and are held back by the number of CPUs.
- Network-bound clusters- these have inadequate network capability and much of the time the processors are waiting for information over the network.
- Storage-bound clusters- these can have adequate processor-to-processor network capabilities, but share a slow hard drive, so time is spent waiting for IO.
Of course, the same cluster can be bound in different ways depending upon the applications that are being run. It is important to realize what the limitations are for your desired tasks and focus your improvements there. I have seen several clusters where they spent an ungodly amount of money on Mirrinet and a massive amount of time getting it working when they were running easily-parallelizable tasks that were really bound just by the number of CPUs.
--
It's psychosomatic. You need a lobotomy. I'll get a saw.
Architecture matching Algorithm by tolldog · 2002-11-14 09:57 · Score: 5, Insightful

This is why a beowulf cluster is not always the answer.

Depending on what you are trying to solve, the problem may need to be split up differently. The algorithm to solve the problem and the system you are using need to match well.

Beowulf is great for high cpu intensive tasks with low network useage. Other forms of clustering are good for problems that use shared memory (but this starts to nail the network). Some tasks split up so that just a simple queuing system is all that is needed to do the work, all that you need to be able to do is have a manager job determine who does what and if it was done.

-Tim

--
-I just work here... how am I supposed to know?
Use and existing solution - SCSI by QuietRiot · 2002-11-14 10:24 · Score: 3, Insightful

Write a driver that passes data between machines via the SCSI interface. Put each host controller in the chain on it's own ID, tie the networking part of the kernel into the SCSI part of the kernel, wave your magic wand and - **Presto!** Fast, parallel communications (with a lot of the headaches of the communication protocol taken care of by the SCSI command set -- allows for "concurrent" connections between multiple "devices" easily).

To scale, put multiple controllers in a high bandwith machine, moving data between chains. With 8 machines, there'd be no need because they could all fit on one.
This is how it was done ! by PaulBu · 2002-11-14 10:24 · Score: 3, Informative

Actually I had the pleasure of meeting fathers
of the original Beowulf (Don Becker and Tom Sterling)
and their story goes that it was originally built
exactly like this: a bunch of 4-port cards
connected in hypercube configuration. By the
way, for this case you can scale it to 2^4=16
nodes with 3 hops worst case latency.

The reason for this was that Don at the time was
writing a linux driver for that particular card
and needed some justification for that
activity... :)

Problem with this approach is that you do message routing in software running on your computational
nodes, not too efficient compared to dedicated
hardware on a switch. Thus, switches were used
ever since...

Paul B.
DUDE! by floydigus · 2002-11-14 10:27 · Score: 4, Funny

Imagine one node of this thing on it's own!

--
All things in moderation; including moderation