These words make no sense. This machine uses a Clos topology, without source-routing and using rather small 24-port crossbars I might add. Nothing new there, has been done for 20 years. It's full bisection on paper, but Head of Line (HoL) blocking statistically reduces it to ~60% efficiency. And without adaptive routing, no way around it.
This machine has no new concepts or new hardware, boring.
There is a large BG/L at Livermore because Livermore bought it through a competitive procurement process. Just like other national labs and private companies. There was definitively no public funding used in the development in the BlueGene product line. However, IBM use government money to do research-oriented development, like X10 through HPCS funding.
If you believe Tata bought this cluster with a business plan to recover its cost in X years, you are mistaken. This machine is (advertisement + ego pumping + tax write off). Ego pumping is very important in India these days, they want to change their image, be respected...
That's a common mistake. The current implementation of the Myrinet standard (yes, it's an ANSI standard) is 2.5 Gb/s signal rate with a 8b/10b encoding yielding 2 Gb/s data rate. You can aggregate links like on the E card (2X, 4 Gb/s). The basic Infiniband signal rate is also 2.5 Gb/s, data rate is 2 Gb/s. With 4X links, the signal rate is 10 Gb/s and the data rate is 8Gb/s. With 12x, data rate is 24 Gb/s (30 Gb/s is the signal rate).
The next Myrinet product (Myrinet 10G) is using the same physical level as 10Gig Ethernet, with a 12.5 Gb/s signal rate and a true 10 Gb/s data rate. So actually, 10 Gigabit Ethernet has more bandwidth than 4X Infiniband.
BTW, there is no 12X Infiniband NICs, it's only used for inter-switch links.
> Have you looked at the cost figures!?! $6M (VT) v.s. $86M (MN)! Ridiculous!
The price of this machine is $20M. I don't know where they got this figure of $86M, but they should not invent numbers if they don't know.
If you want to speak about something ridiculous, speak about the $6M price tag for the VT cluster. This machine is a big PR operation, that's it. the VT guys ask us to sell them an interconnect for $300 a node, all included. This was below production cost, we refused. Others did not refuse to lose money to get visibility...
> Apart from very non-embarrassingly-parallel problems, the 5-6 XServe-cluster
super-cluster would be much more efficient
What do you mean by more efficient ? The upgraded VT cluster gets 60% of efficiency on Linpack, MareNostrum gets 65% of efficiency on a larger machine.
BTW, MareNostrum is supposed to be 2200 nodes. They use only 1782 nodes for the Top500 runs, ie 70% of the final machine.
> this MN looks too an expensive toy compared to a whole bunch of VT-clones
Price is not a reliable comparaison points. It's hard to know the real price tag on a machine, without services or other things and the money/PR tradeoff messes up everything.
The on-paper peak of 4x Infiniband is 1 GB/s, which is 8 Gb/s.
> On PCI-Express it's nearly double the bandwidth
Using 2 links or bidirectional ? The one-way maximum bandwidth of one 4x IB link is 1 GB/s. The only way to get over that is to use two links.
> The PCI-X bus is limiting the InfiniBand interconnect
The on-paper maximum bandwidth on PCI-X is 1 GB/s. It matches the one-way bandwidth of IB 4x. However, PCI-X is a bus, so bidirectional traffic is also limited to 1 GB/s. In short, if you have bidirectional traffic or multiple links, PCI-X is a bottleneck. Otherwise it's not.
PCI-Express is not a bus, it support 1 GB/s in each direction in the 4X flavor.
You obviously have no clue as to what Infiniband is or is capable of. First off 4x Infiniband is 10 times faster then Ethernet at 10 gigabits/sec.
4X infiniband is 10 Gb/s signal rate but actually 8 Gb/s data rate (8b/10b encoding). This is one of many facts that the IB marketing dept. keep forgetting (I keep telling them, but they won't listen for some reasons).
GigE and TCP are quite inefficient when compared to Infiniband
TCP over Infiniband is as inefficient, it has nothing to do with GigE. People use IP over GigE because it's convenient, but you can use GigE without IP if you talk directly to the hardware. Some have tried and are still trying http://www.disi.unige.it/project/gamma/, but the main problem is the lack of hardware documentation from GigE vendors and the short life span of GigE chips.
I even read that a 1024 node cluster using GbE was just as fast as a 256 node cluster using IB
It's interesting to note that there are not many 256 nodes clusters in production with IB at the moment, even less with 1024 nodes. Second, just as fast doing what ? A pointless benchmark specially tuned for Infiniband as the IB supporters are used to publish or real-world applications ? Yes, high speed interconnects make a difference but GigE is just fine for a lot of the HPC applications I have seen so far.
So before you start talking out of your ass do some research like I did.
Don't believe everything you read, and don't drink the cool-aid that fast. Look at the Top500 just to see what machines are out there, not for the ranking (Linkpack is useless). You will see that there are quite a lot of GigE clusters and not that many IB ones. It's a matter of economics: if IB makes sense, people will buy it. These days, they buy much more GigE (or other) than IB.
3. Myrinet (Roughly the same price as IB, but closed standard) sub 10 microsec
Myrinet is not a closed standard. It's an ANSI-VITA standard (26-1998). The specs are available for free (http://www.myri.com/open-specs/) and anybody can build and sell Myrinet switches, if they have the technology.
Furthermore, the latency is sub 4 microsec. Come to SuperComputing next month and you will see.
The only thing I'm aware of with respect to high-performance interconnect solution for blade servers available today is to get IBM blades with Myrinet daughter boards and an optical passthrough module. Ultimately, it can really reduce cabling for things like ethernet, kvm, etc etc, but those myrinet cables are still going to be a tad unwieldy (80+ wires to the cabinet, even if they are fiber cables).
The next Myrinet switch, based on 32 ports crossbar, will use cables composed of 4 optical fibers for switch to switch connectivity. So, a Myrinet switch module for the Blade Center will be connected to a big switch (256 ports) with only 4 of those 4-fibers cables. No more need for optical passthrough module...
In the Fermilab web page, they say they compared the latest Topspin product against their 2 years old Myrinet B cards. Not exactely apples to apples. It's funny, they end up using Gigabit Ethernet in point-to-point, it was much more cost effective.
There are a lot of thinks on paper with IB. The 2 last times I used tiny demo IB clusters that various vendors were evaluating, I saw 7.5us at the MPI, but I am very biaised too.
On Myrinet, 6.3 us is with GM, 3.5 us is with MX (my baby). Same hardware, different firmware. You will hear about it when it's released:-)
VT uses infiniband which is faster and lower latency than Myranet or the other common cluster interconnects
A few things:
This is Myrinet, not Myranet.
Infiniband does not have lower latency than Myrinet, at least not at the MPI level. Using MX, I get 3.5us with Pallas with E cards, 4us with D cards, and there is no trick like polling only a few sources, or caching the memory registration.
MX is not completely finished, but I will release a beta version this week so you can reproduce the numbers.
That's why all the high performance computing guys are using Infiniband.
Who else ? Can you cite 3 other clusters in the Top 500 using Infiniband ? I can't.
I cruised by the Cisco booth at SuperComputing and a fellow there told me that the VT cluster did not even use the Infiniband NIC for the HPL run, it used the GigE NIC and the IB was used only as a backbone between the Cisco switches. That would explain the disappointing HPL efficiency (58 %). Regarding the price, VT was smart to go to the vendors ready to not make money, even possibly lose a little, in order to gain visibility. Apple and Mellanox were perfect for that: Apple buying an advertisement campaign and Mellanox shipping stuff they don't sell otherwise.
So, to come back to your first claim, what really makes all this possible is the 4 Flops per cycle on the G5.
How does Myrinet "eats CPU when sending data" ?!? It uses DMA to read/write data from/to host memory. The only thing the CPU is doing is writing 48 Bytes by PIO to post the send, whatever is the size of the message to send.
So either this is a flame bait or you really have no idea what you are talking about. I think I do as I write code for Myrinet firmware.
It's not the first time that these folks in KY work around the definition of the acronym "Flop". A Flop is a floating point operation on 64 bits, not 32 bits. All entries in the Top500 used results with 64 bits HPL, nobody else in the world is running HPL on 32 bits. So claiming the moon on 32 bits is easy, useless for the sake of comparaison and almost unethical. I cannot believe that Dr Dietz do not know the difference by now.
The same machine would yield average results on 64 bits. Difficult to draw attention without headline numbers...
These words make no sense. This machine uses a Clos topology, without source-routing and using rather small 24-port crossbars I might add. Nothing new there, has been done for 20 years. It's full bisection on paper, but Head of Line (HoL) blocking statistically reduces it to ~60% efficiency. And without adaptive routing, no way around it.
This machine has no new concepts or new hardware, boring.
There is a large BG/L at Livermore because Livermore bought it through a competitive procurement process. Just like other national labs and private companies. There was definitively no public funding used in the development in the BlueGene product line. However, IBM use government money to do research-oriented development, like X10 through HPCS funding.
If you believe Tata bought this cluster with a business plan to recover its cost in X years, you are mistaken. This machine is (advertisement + ego pumping + tax write off). Ego pumping is very important in India these days, they want to change their image, be respected...
That's a common mistake. The current implementation of the Myrinet standard (yes, it's an ANSI standard) is 2.5 Gb/s signal rate with a 8b/10b encoding yielding 2 Gb/s data rate. You can aggregate links like on the E card (2X, 4 Gb/s). The basic Infiniband signal rate is also 2.5 Gb/s, data rate is 2 Gb/s. With 4X links, the signal rate is 10 Gb/s and the data rate is 8Gb/s. With 12x, data rate is 24 Gb/s (30 Gb/s is the signal rate).
The next Myrinet product (Myrinet 10G) is using the same physical level as 10Gig Ethernet, with a 12.5 Gb/s signal rate and a true 10 Gb/s data rate. So actually, 10 Gigabit Ethernet has more bandwidth than 4X Infiniband.
BTW, there is no 12X Infiniband NICs, it's only used for inter-switch links.
The price of this machine is $20M. I don't know where they got this figure of $86M, but they should not invent numbers if they don't know.
If you want to speak about something ridiculous, speak about the $6M price tag for the VT cluster. This machine is a big PR operation, that's it. the VT guys ask us to sell them an interconnect for $300 a node, all included. This was below production cost, we refused. Others did not refuse to lose money to get visibility...
> Apart from very non-embarrassingly-parallel problems, the 5-6 XServe-cluster super-cluster would be much more efficient
What do you mean by more efficient ? The upgraded VT cluster gets 60% of efficiency on Linpack, MareNostrum gets 65% of efficiency on a larger machine.
BTW, MareNostrum is supposed to be 2200 nodes. They use only 1782 nodes for the Top500 runs, ie 70% of the final machine.
> this MN looks too an expensive toy compared to a whole bunch of VT-clones
Price is not a reliable comparaison points. It's hard to know the real price tag on a machine, without services or other things and the money/PR tradeoff messes up everything.
> Funny that I see 800+ MB/s on many motherboards
The on-paper peak of 4x Infiniband is 1 GB/s, which is 8 Gb/s.
> On PCI-Express it's nearly double the bandwidth
Using 2 links or bidirectional ? The one-way maximum bandwidth of one 4x IB link is 1 GB/s. The only way to get over that is to use two links.
> The PCI-X bus is limiting the InfiniBand interconnect
The on-paper maximum bandwidth on PCI-X is 1 GB/s. It matches the one-way bandwidth of IB 4x. However, PCI-X is a bus, so bidirectional traffic is also limited to 1 GB/s. In short, if you have bidirectional traffic or multiple links, PCI-X is a bottleneck. Otherwise it's not.
PCI-Express is not a bus, it support 1 GB/s in each direction in the 4X flavor.
You obviously have no clue as to what Infiniband is or is capable of. First off 4x Infiniband is 10 times faster then Ethernet at 10 gigabits/sec.
4X infiniband is 10 Gb/s signal rate but actually 8 Gb/s data rate (8b/10b encoding). This is one of many facts that the IB marketing dept. keep forgetting (I keep telling them, but they won't listen for some reasons).
GigE and TCP are quite inefficient when compared to Infiniband
TCP over Infiniband is as inefficient, it has nothing to do with GigE. People use IP over GigE because it's convenient, but you can use GigE without IP if you talk directly to the hardware. Some have tried and are still trying http://www.disi.unige.it/project/gamma/, but the main problem is the lack of hardware documentation from GigE vendors and the short life span of GigE chips.
I even read that a 1024 node cluster using GbE was just as fast as a 256 node cluster using IB
It's interesting to note that there are not many 256 nodes clusters in production with IB at the moment, even less with 1024 nodes. Second, just as fast doing what ? A pointless benchmark specially tuned for Infiniband as the IB supporters are used to publish or real-world applications ? Yes, high speed interconnects make a difference but GigE is just fine for a lot of the HPC applications I have seen so far.
So before you start talking out of your ass do some research like I did.
Don't believe everything you read, and don't drink the cool-aid that fast. Look at the Top500 just to see what machines are out there, not for the ranking (Linkpack is useless). You will see that there are quite a lot of GigE clusters and not that many IB ones. It's a matter of economics: if IB makes sense, people will buy it. These days, they buy much more GigE (or other) than IB.
3. Myrinet (Roughly the same price as IB, but closed standard) sub 10 microsec
Myrinet is not a closed standard. It's an ANSI-VITA standard (26-1998). The specs are available for free (http://www.myri.com/open-specs/) and anybody can build and sell Myrinet switches, if they have the technology.
Furthermore, the latency is sub 4 microsec. Come to SuperComputing next month and you will see.
I work a lot with it, like ~3000 customers, almost half of them are industry (non academic or gvt).
You found bugs ? Care to share them ? Hardware failed ? Did you get it replaced ?
Can you give me the tech support ticket numbers so I can see if your complaints are reasonable (and have been addresses) or are just plain FUD ?
140C is quite hot... :-)
Patrick
There are a lot of thinks on paper with IB. The 2 last times I used tiny demo IB clusters that various vendors were evaluating, I saw 7.5us at the MPI, but I am very biaised too.
On Myrinet, 6.3 us is with GM, 3.5 us is with MX (my baby). Same hardware, different firmware. You will hear about it when it's released :-)
This is Myrinet, not Myranet.
Infiniband does not have lower latency than Myrinet, at least not at the MPI level. Using MX, I get 3.5us with Pallas with E cards, 4us with D cards, and there is no trick like polling only a few sources, or caching the memory registration.
MX is not completely finished, but I will release a beta version this week so you can reproduce the numbers.
Patrick
That's why all the high performance computing guys are using Infiniband.
Who else ? Can you cite 3 other clusters in the Top 500 using Infiniband ? I can't.
I cruised by the Cisco booth at SuperComputing and a fellow there told me that the VT cluster did not even use the Infiniband NIC for the HPL run, it used the GigE NIC and the IB was used only as a backbone between the Cisco switches. That would explain the disappointing HPL efficiency (58 %). Regarding the price, VT was smart to go to the vendors ready to not make money, even possibly lose a little, in order to gain visibility. Apple and Mellanox were perfect for that: Apple buying an advertisement campaign and Mellanox shipping stuff they don't sell otherwise.
So, to come back to your first claim, what really makes all this possible is the 4 Flops per cycle on the G5.
How does Myrinet "eats CPU when sending data" ?!? It uses DMA to read/write data from/to host memory. The only thing the CPU is doing is writing 48 Bytes by PIO to post the send, whatever is the size of the message to send.
So either this is a flame bait or you really have no idea what you are talking about. I think I do as I write code for Myrinet firmware.
It's not the first time that these folks in KY work around the definition of the acronym "Flop". A Flop is a floating point operation on 64 bits, not 32 bits. All entries in the Top500 used results with 64 bits HPL, nobody else in the world is running HPL on 32 bits. So claiming the moon on 32 bits is easy, useless for the sake of comparaison and almost unethical. I cannot believe that Dr Dietz do not know the difference by now.
The same machine would yield average results on 64 bits. Difficult to draw attention without headline numbers...