Cluster Interconnect Review
deadline writes to tell us that Cluster Monkeys has an interesting review of cluster interconnects. From the article: "An often asked question from both 'clusters newbies' and experienced cluster users is, 'what kind of interconnects are available?' The question is important for two reasons. First, the price of interconnects can range from as little as $32 per node to as much as $3,500 per node, yet the choice of an interconnect can have a huge impact on the performance of the codes and the scalability of the codes. And second, many users are not aware of all the possibilities. People new to clusters may not know of the interconnection options and, sometimes, experienced people choose an interconnect and become fixated on it, ignoring all of the alternatives. The interconnect is an important choice and ultimately the choice depends upon on your code, requirements, and budget."
Wow. No comments (except for the one idiot who's already been modded to -1), and the site is already slashdotted?
I'm not sure whether I should be annoyed, or amused.
Topher
Apparantly one of the cluster interconnects on this site failed.
Proof by very large bribes. QED.
Interesting article, but I'm not sure how many Slashdotters can fit a cluster powerful enough to saturate a GigE interconnect in their mother's basement.
"Oh boy"
I bet they use the $32 interconnect for their server
because some people practice what they preach
Storm
Where I work, I deal with 30-40GBps average read/write total throughput on our distributed filesystem using GigE and Cisco 6509s.
I have trouble imagining an application that could eat up more than that. It's bananas.
"every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine of the data is destined for that NIC"
Uhm.. no. That is only the case if the NIC has been set into promiscuous mode, or if it has joined too many multicast groups.
"...because the packets are not TCP, they are not routable across Ethernet"
Uhm. If they were IP they would still be routable. I suspect he meant "not IP".
I also get irritated by the spelling out of "Sixty-four-bit PCI"
But the article still has a lot of good reviews and a load of links to other sides with interesting info.
Imagine I have a simulation I'm running that takes a trace stream as input. The traces are between 5-50GB each. I want to run a parameter sweep of my simulation. So I submit 50 jobs to the cluster. If I have just GB ethernet I will completely saturate it whenever the jobs are trying to access the trace stream. Solution? Use bittorrent to copy the files to the local nodes and then run locally. This only requires 100GB of data from the file server and only requires it once. So filling up a GB switch is easy in my experience.
It's "real programmer" jargon for programs that do a lot of number crunching, like physics or weather simulations.
Mea navis aericumbens anguillis abundat
"codes" used like that is a term for parallel software that's particularly prevalent amongst the number-crunching crowd.
But he's probably talking about some kind of application that's intended for local-area application only and wants to avoid the overhead of TCP, UDP, and IP addressing, header-bit-twiddling, flow control, slow-start, kernel implementations optimized for wide-area general-purpose Internet networks, etc., and rolls its own protocols that assume a simpler problem definition, much different response times, and probably just pastes some simple packet counters over Layer 2 Ethernet, probably with jumbo frames.
If you've implemented your ugly hackery properly, you still _could_ bridge it over wide areas using standard routers even though it doesn't have an IP layer. That doesn't mean it would work well - TCP's flow control mechanisms were designed (particularly during Van Jacobson's re-engineering) to deal with real-world problems including buffer allocation at destination machines and intermediate routers and congestion behaviour in wide-area networks with lots of contending systems, which a LAN-cluster protocol might not handle because it "knows" problems like that won't happen. Timing mechanisms are especialy dodgy - they might have enough buffering to allow for LAN latencies in the small microseconds, but not enough to support Wide Area Network latencies that include speed-of-light (~1ms per 100 miles one-way in fiber) or insertion delay (spooling the packet onto a transmission line takes time, e.g. 1500 byte packet onto 1.5 Mbps T1 line takes about 8ms, and jumbo-frames obviously take longer.)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
s/codes/nodes/
It goes without saying, these article must have been written by imbaciles as opposed to "real" monkeys, it is just atrocious.
5mins of my life well spent. what a bargin.
/. is good for you.
Heh, alright. As an entirely different type of "real programmer", it just sounds ignorant to me. "Codes" is usually a word used by people who don't understand that code is code is code.
though I would argue that there was too much time spent discussing GigE, and not enough on the performance and scaling issues seen with the more exotic cards.
a cility=msc&lab=vr1119/ How to tie together 980 dual-proc Itanium-2 systems.
Not a technical issue, but a little note about the Infiniband cards reading, "Unlike the alternatives, just try to get information on pricing one of these without leaving all of your contact information for a salesman to use now and in perpetuity." I've been through this recently, and have considered (given the similar performance), purchasing Myrinet because they post their prices out in the open, so that you can make some informed decisions before calling their salescritter.
More technically, some analysis of the stability and maturity of the software stack would be nice. We owned Dolphinics SCI cards once (2001 ish), and while blazingly fast when they worked, on our Opterons the MPI system would mysteriously shut down. They were also very closed and proprietary about their software at that point, so we went round and round over the early 2.4 drivers. Myri, while more expensive, was also more stable.
Finally, to simply geek out for a moment, I saw numbers for the Quadrics cards once. PNNL built their Itanium-2 cluster with multiple quadrics cards per machine to get the bandwidth high enough for their chemistry apps. Light on details, but found at http://www.emsl.pnl.gov/using-emsl/tour/lab.php?f
the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
I don't know if I'd call it "interesting." More like the third seal of the apocalypse has just been broken.
We found that for disk intensive parallel computing that gigabit ethernet can be almost as fast as very, very expensive networking equipment. Of course our throughput requirements are small compared to many other types of applications. So try it cheap before you invest in the expensive networking equipment. You can always use the cheap stuff for login type work and distributed shells if you opt for the expensive equipment later.
It's not necessarily a sign of ignorance. Some of the best programmers I've known said "codes" instead of "code." These guys were from India, where everybody that speaks English says "codes." It's a cultural difference.
... codes ... ... code ... ... codes ...
Every other week (yes I'm exaggerating), we'd have a conversation like this:
Coworker(C):
Me(M): Dude, you just said "codes" again.
C: Sorry.
C:
[ less than a minute passes ]
C:
M: *looks at him funny*
C: Sorry. I can't help it. I learned it as codes. Where I'm from everybody says codes.
What sort of fun projects could a home experimenter with a pile of hardware dive into? It sounds like all these machines are used for a fairly narrow set of scientific applications. Anything a non-academic would find interesting?
well you obviously aren't pushing 30 gigabytes (your capital B did mean bytes right?) per second down gig-e links unless you are running several hundred in paralell.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Lately, the big contenders are:
-Ethernet
-Inifiniband
-Myrinet
I haven't heard much about SCI or Quadrics lately, and just these three have been tossed around a lot lately. Points on each:
-Ethernet is cheap, and frequently adequate. Low throughput and high latency, but it's ok. 10GbE ethernet is starting to proliferate to eliminate the throughput shortcomings, and RDMA is starting to possibly help latency for particular applications. Note that though overwhelmingly clusters put together using ethernet use IP stack to communicate over it, it is not exclusively true. There are MPI implementations available that sit right under the ethernet header layer. It bypasses the OS IP stack which can be very slow and reduces overhead per message. Increasing MTU also helps throughput efficiency. But for now only 1 Gigabit ethernet is remotely affordable at any scale (primarily due to current 10GbE switch densities/prices, adapters are no more expensive than Myrinet/Infiniband).
-Myrinet. With their PCI-E cards they achieve about 2 GBytes/sec bidirectional throughput, very nearly demonstrating full saturation of their 10GBit fabric. They also are among the lowest latency sitting right about 2.5 microsecond node-to-node latency as a PingPong minimum. Currently the highest single-link throughput technology realistically available to a customer (Infiniband SDR doesn't quite acheive it, about 200 or so MByte/s short, but DDR will overtake it as it realistically is available). Very focused on HPC and until recently also the only popular high-speed cluster interconnect that was very mature, easy to set up and maintain, and efficient. Now they are starting to embrace more interoperability with 10GbE, probably in response to the rise of infiniband.
-Infiniband. Until very recently immature (huge memory consumption for large MPI jobs, software stack that is highly complex and not easily maintanable, and the prominent vendor of chips (Mellanox), didn't acheive good latency. With Mellanox chips you are lucky to get into the 4 microsecond range or so. With Pathscale's alternative implementation (particularly on HTX), the lowest latency interconnect becomes possible (I have done runs with 1.5 microsecond end-to-end latency even with a switch involved). The maximum throughput is on the order of 1.7-1.8 GByte/s and more importantly is one of the faster technologies in ramping up to that. No technology acheives their peak throughput until about 4 MB message sizes, and Pathscale IB is remarkably a good performer down to 16k-32k message sizes. Additionally, IB has a broader focus and some interesting efforts. They make efforts to not only be a good HPC interconnect, but also to be a good SAN architecture that in many ways significantly outshines fibre channel. The OpenIB efforts are interesting as well. The huge downside is that for whatever reason no Infiniband provider has been able to demonstrate good IP performance over their technology. This particularly is an issue because most all methods of storage sharing from hosts are IP based. SRP is ok for the little amount of flexibility that strategy gives to be Fibre-Channel like, but nfs, smb, and image access like NBD and iSCSI all perform very poorly on Infiniband compared to Myrinet. iSER promises to alleviate that, but for the moment you are restricted to performance on the order of 2.4 gigabit/s for IP transactions. Myrinet has been able to deliver 6-7 Gigabit/s for the same measurements. You could overcome this by sharing storage enclosures and use something like lustre, GFS, or GPFS to communicate more directly with the storage over SRP, but generally speaking some applications demand flexibility not acheivable without IP performance.
And at the end of the day, I come home and run my home network on 100MBit ethernet, sigh. It is enough to run a diskless MythFrontend for HD content at least.
XML is like violence. If it doesn't solve the problem, use more.
For those not aware of how ethernet is limited latency wise regardless of what is done, I will explain a tad.
Ethernet is well architected for large deployments (enterprise-wide) with the packet routing (not IP routing) done on the switches. Menaing a computer sending a packet asks its switch to get it to 0A:0B:0C:01:02:03, having no idea where it will go. Switch only knows it's immediate neighbors, and will check/populate it's arp table to figure out the next entity to hand off. This means switches have to be really powerful because they are responsible for a lot of heavy lifting for all the relatively dumb nodes. This is not TCP, it is not IP, it is raw reality of ethernet networking. Aside from Spanning tree (which is not maintained for any other reason than keeping a network from getting screwed over by incorrect connections, not for performance), no single entity in the network has a map of how things look beyond its immediate neighbors.
IB, Myrinet, etc, are source routed. Every node has a full network map of every switch and system in the fabric. The task of computing communication pathways is distributed rather than concentrated (fits well with the whole point of clusters). node1 doesn't blindly say to the switch, 'send this to node636', it says to switch 'send this to port 5, and the next switch, put it out port 2, and the next switch, do port 9 and then it should be where it needs to be'.
There are more complicated issues their, but a lion's share of the inherent strength of non-ethernet interconnects is this.
XML is like violence. If it doesn't solve the problem, use more.
One, the price of all this stuff is exhorbitant, and most home applications could barely benefit from going from 100 MBit to Gigabit. Realistically, getting 1.5 microsecond latency and the ability to transfer GigaBYTES per second has no home use right now. Really exorbitant High definition streams top out at about 20 MBit/second for 1920x1080 MPEG-2, and of course no game demands that much throughput. Hard drives for home use can only theoreticly dump out 300 MB/s or so anyway (SATA II), and realistically except for cache operations you almost never acheive it.
Going to gigabit ethernet makes diskless systems close to theoretically working as fast as UDMA 66 drives, which allows for fun home projects working more smoothly. Latency for network operations is already similar to drive seek times, so going to insane latency won't help too much either.
Systems that benefit from this have to have large (many-drive) storage architectures to pull throughput from and large numbers of systems to have enough computational data to make the interconnect fabrics worth while. Before you begin to ever approach a system that large, your power/cooling bill would be insane.
If you were into the intrensic interesting stuff of this, you can learn most principles involved with good old ethernet, and fill the gaps with google research. It is undeniable that you learn more hands on, but if you ever really need to use it with a company or something and you have your bases covered, chances are you'd exceed most other candidates who aren't even aware of the technology.
XML is like violence. If it doesn't solve the problem, use more.
In supercomputing, a "code" is countable and roughly correlates to application, or program, or algorithm. It gets interesting when doing analysis and benchmarks and coupled problems where you might say you joined or linked two codes into the same program or distributed application. For example, Joe's atmospheric code might be coupled with Sue's oceanographic code to provide a better climate simulation.
One reason it is a countable noun is that these things are almost proper nouns. We refer to somebody's code as a well known artifact in the community, about which papers are published, etc. They are also thought of as quite monolithic, because message-passing parallel programming is so fraught with peril when you are trying to produce scientifically meaningful, numerical results. You don't just toss portions of code around like a substance. We don't go into museums and point at all the paint on the wall that happens to be surrounded by frames. We talk about paintings. Same thing.
I hesitate to tell you what a kernel is in this area. It has nothing to do with operating systems...
Cluster systems are configured by connecting multiple systems with a communications medium, referred to as an interconnect. OpenVMS Cluster systems communicate with each other using the most appropriate interconnect available. In the event of interconnect failure, OpenVMS Cluster software automatically uses an alternate interconnect whenever possible. OpenVMS Cluster software supports any combination of the following interconnects:
CI (computer interconnect) (Alpha and VAX)
DSSI (Digital Storage Systems Interconnect) (Alpha and VAX)
SCSI (Small Computer Storage Interconnect) (storage only, Alpha and limited support for I64)
FDDI (Fiber Distributed Data Interface) (Alpha and VAX)
Ethernet (10/100, Gigabit) (I64, Alpha and VAX)
Asynchronous transfer mode (ATM) (emulated LAN configurations only, Alpha only)
Memory Channel (Version 7.1 and higher only, Alpha only)
Fibre Channel (storage only, Version 7.2-1 and higher only, I64 and Alpha only)
Brilliant post - really enjoyed it. Thanks.
How many beans make five, anyhow ?
..can be gotten from the results of the 2005 HPC Challenge - real world results, no marketing blurb.
Try here if you want an idea .. my complaint was that the entire site was very linux centric .. there's some pretty good ideas going into Solaris and with their big push in amd64 and i386, it can be a more affordable and stable platform to work with .. in fact if you want you can reference the Infiniband source here or here for example ..
Code = programming in general
Code = one application or one logical part of an application
Codes = multiple applications or multiple logical parts of an application
Code is code is code when you are talking about programming in general. All of us are programmers and we write code.
"Code" can refer to a specific entity. Sometimes you'll hear it as "codebase" or "source" or "sourcebase". An example of a specific set of code is the Linux kernel. Another example of a specific set of code is Firefox.
Codes refers to multiple sets of specific entities. If I install all the source for my Linux distribution, I have many sets of code on my machine. I have the codes for the kernel, Firefox, and whatever else.
In the parallel community, your "appliction" may consist of multiple, seperate yet dependent, executables that must all be run together to make up the single job. So, you have multiple sets of code that make the application... thus... multiple codes.
Example: John, check out the atmospheric code (refering to the source of a single executable) to make sure it's using the correct algorithms. David, check out the ocean code (again, the source of a single executable) to make sure it's using the correct algorithm. Ok, now let's run these codes (talking about both executables) together to form an ocean/atmospheric model.
Right, but the more common usage doesn't reflect this because it uses "code" as a mass noun. You don't have a singular "code" or plural "codes"; you have some code, a piece of code, the Apache code, or all of the code that was ever written in C. That's my point. It's not a difference in scale, it's a difference in usage entirely.
Brilliant post - really enjoyed it. Thanks.
Would have been a lot funnier had it been attributed to the original author in my opinion.
More or less took it at face value. Your post raises two questions.
How about sharing the answers - I'm curious
How many beans make five, anyhow ?