Ethernet The Occasional Outsider
coondoggie writes to mention an article at NetworkWorld about the outsider status of Ethernet in some high-speed data centers. From the article: "The latency of store-and-forward Ethernet technology is imperceptible for most LAN users -- in the low 100-millisec range. But in data centers, where CPUs may be sharing data in memory across different connected machines, the smallest hiccups can fail a process or botch data results. 'When you get into application-layer clustering, milliseconds of latency can have an impact on performance,' Garrison says. This forced many data center network designers to look beyond Ethernet for connectivity options."
Long Live the Token Ring!
One Ring to rule them all
I, for one, welcome our new non-ethernet overlords.
In our Data Center, we have a great big vat of steaming salt water and we drop one end of the cat5 cables from each server into the vat....those packets that can't figure out where they're going just drop to the bottom and die ...we have to drain this packet-goo out once a month. (but we do recycle it...we press it into CDs and sell them on Ebay)
(Seriously, haven't people heard cut-through switches which just look at the first part of the header and switch based on that... store-and-forward switches are so "1990s")
TDz.
The NSA's network sniffer, recently discovered at an AT&T broadband center, can only sniff up to 622MB. Sounds to me like if you use an InfiniBand switch, that would effectively make the output of the NSA's network sniffers complete gibberish.
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
I don't think I need to read anymore, well, I did verify that the number really appears in the article.
This author does not understand the subject material.
(I suppose you could deliberatly overload a switch enough to get this number, maybe, but that would be silly, and your switch would need 1.25Mbytes of packet cache.)
Ultra-low latency networking is a minor interest of mine, but one I've never had the chance to really pursue. Can anyone familiar with the landscape recommend some low-cost options for experimenting with this stuff? Or maybe just let me down gently. "No, Sammy, there are no low-cost options. And there's no Santa Claus."
"(By comparison, latency in standard Ethernet gear is measured in milliseconds, or one-millionth of a second, rather than nanoseconds, which are one-billionth of a second)"
That would be one-thousandth, not millionth (aka micro second). Not a good start...
I just blame it on the ether-bunny.
The origional post makes some comments that ... the smallest hiccups can fail a process or botch data results.
sharing memory
Sounds like bad design, or a known design trade off.
Quite reasonable, when on a slow link, until I know better assume the data I have is correct, if it isn't throw it out and start over. Not wildly different than branch prediction or other approaches to this type of information.
'When you get into application-layer clustering, milliseconds of latency can have an impact on performance,'
Faster is faster, not really a shocking concept.
The latency of store-and-forward Ethernet technology is imperceptible for most LAN users -- in the low 100-millisec range.
I don't know what sort of switches you use, but my home LAN, with two hops (including one over a wireless bridge) through only slightly-above-lowest-end DLink hardware, I consistantly get under 1ms.
When you get into application-layer clustering, milliseconds of latency can have an impact on performance
Again, I get less than 1ms, singular.
Now, I can appreciate that any latency slows down clustering, but the ranges given just don't make sense. Change that to "microseconds", and it would make more sense. But Ethernet can handle single-digit-ms latencies without breaking a sweat.
Oh, well. People tell me I'm just slow.
That just sounds daft. Given the bottle neck harddrives are for cpu's, it doesn't sound like a great shock that when you gotta wait for your data over ethernet you're going to see problems.
Maybe I should RTFA...
Ethernet's strength is it's flexiblity, not it's speed per se. It can handle changing network environments where hardware or software is added and removed continually, and you never know quite where the bandwith is most needed. You just plug it all in, and ethernet does a decent job of neotiating who gets to use the bandwidth.
But it's never been a really high speed protocol. It's easy to beat, speed-wise, as long as you know what the network use looks like ahead of time.
Which of course is a killer for most general use, but for specialty use that's not so much of a problem.
'Sensible' is a curse word.
The article's a bit lacking on details, but... Isn't store and forward unnecessary? It's definitely possible to get it down to a much lower latency than is stated in the article if you don't use it.
Actually, even with Gigabit ethernet availability HPTC and other network intensive data center operations have moved to Fibre Channel and things like:
1 21.html
Infiniband http://en.wikipedia.org/wiki/Infiniband
and Myrinet http://en.wikipedia.org/wiki/Myrinet
http://h20311.www2.hp.com/HPC/cache/276360-0-0-0-
HP HPTC site
-What's the speed of dark?
Nice troll.
There are only TWO reasons to use Store & Forward.
#1. You're running different speeds on the same switch (why?).
#2. You really want to cut down on broadcast storms (just fix the real problem, okay?)
Other than that, go for the speed! Full duplex!
What about a co-axial bus connection? VAMPIRE TAPS!!! WEEEEEE. Come on, someone else has to remember those. It was like playing "Operation" except if you "touched the sides" and screwed the tap too far in, you broke the cable. Fun to the max!
A computer once beat me at chess, but it was no match for me at kick boxing.
I've never used it but I know it can be done with Linux, a couple scsi controllers and other nifty things. Don't know the speed or latency of it though
I have a cluster of 45 dual Xeon processing nodes. Latencies average about 210 usec the same as could be expected in any 100Mbs connection, but using channel bonding my bandwidth is double that of a single ethernet connection. I don't have the need for faster, all our processes are wholly independent and don't need to do message passing.
Er, yeah. No kidding.
When I was writing applications at the San Diego Supercomputer Center, latency between nodes was the single greatest obstacle to getting your CPUs to running at their full capacity. A CPU waiting to get its data is a useless CPU.
Generally speaking, clusters who want high performance used something like Myrnet instead of ethernet. It's like the difference between consumer, prosumer, and professional products you see in, oh, every industry across the board.
As a side note, how many parallel apps solve the latency issue is by overlapping their communication and computation phases, instead of having them in discrete phases, this can greatly reduce the time a CPU is idle.
The KeLP kernel does overlapping automatically for you if you want: http://www-cse.ucsd.edu/groups/hpcl/scg/kelp.html
Maybe the author meant "imperial milliseconds"?
The article's worth reading, if you're not already familiar with currently popular cluster interconnects, but the title of "Data center networks often exclude Ethernet" is totally bogus.
I guess "Some Tiny Percentage of Data Centers use Something Faster than Ethernet in addition to Ethernet" didn't fit on the page.
I'm thinking that the reason the article got the idea that milli = millionth is because the US doesn't use the metric system.
All 7th graders in Canada know that micro means millionth, and milli = thousandth...hopefully the doctors and nurses in the US know the same thing.
Crashed any rockets lately?
--- malin.vidarlo.net ping statistics --- 15 packets transmitted, 15 received, 0% packet loss, time 14003ms rtt min/avg/max/mdev = 0.310/0.347/0.375/0.019 ms 2 hops, over 100Mb ethernet with a cheapass switch (8 port unmanaged hp). Seems like he got no grip on numbers...
Assembling etherkillers for fun an profit
MB is a measurement of data; in this case 10^6 bytes. (MiB would be 2^10.) I think you want a measurement of data, such as perhaps MBps. Too bad your comment is a) wrong and b) wrong. specifically it's a) just plain wrong and b) fails to take clusters into account. Were you trying to get fp or something? Anyway "this equipment was the Narus ST-6400, a machine that was capable of monitoring over 622 Mbits/second in real time in May, 2000... The latest generation is called NarusInsight, capable of monitoring 10 billion bits of data per second" - how do you know they're not using the current version today?
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Just had a quick ping to the beeb... via a wireless hop onto my ethernet network, two hops to my adsl router, then 6 hops around Nildram's network (ATM into their network then god knows, probably some form of gigabit ethernet) and a couple more hops to the bbc.
Average latency is around 20ms.
Now I know this isn't as plain as straight ethernet but I'd have guessed the latency if anything on ATM + the change from 802.11g to ethernet to atm to ethernet to whatever would have been worse.
So either someone is using cheep hardware or has misconfigured their network.
Apart from that if I was running a cluster each machine would probably have two NIC's depending on their use - one using gigabit ethernet to provide the internal network between nodes on the cluster and the other for external use. The external network would be as normal, the internal network I'd ensure had minimal routers/switches between the nodes and any switches/routers where a) good quality and b) correctly configured.
--- Users are like bacteria -> Each one causing a thousand tiny crises until the host finally gives up and dies.
All this time I've been playing Quake over LAN and I thought my ping was about 5ms. Silly me, it's clearly in the range of 100ms, even worse than when I take it online!
Whoops...
People run different speeds on the same switch all the time, and for not necessarily poor reasons: If you have a SMB (in this case, that's small or medium business) with maybe one big fileserver, you don't need to run gigabit to everyone... You can run 100Mbps to the clients, and run gig to the switch only. Of course, since just about everything but laptops is coming with gig now (and probably some of them) this is becoming less valuable.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
There's plenty of hardware out there that doesn't come gig-e equipped. Hell, I still deploy RS232 terminal concentrators at 10 megs now and then.
Do daemons dream of electric sleep()?
While that's true, most of the time those kind of devices would be happiest on their own subnet for security and management reasons - or at least, I'd be happiest with them there. Therefore they can live on different router interfaces, whether the router's from cisco, or a PC from fry's with linux on it. The only time it's really necessary to mix speeds on the same switch is when you have multiple clients accessing a resource and their aggregate speeds make it useful.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Dude, you've been really bitchy lately, what's up witht hat?
Linux, you magnificent bastard, I read the fucking manual!
http://en.wikipedia.org/wiki/Fibre_Channel
No, your token is lost in the Ethernet
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
I wonder what's happening to slashdot. That's as bad as technical news can get. Ethernet latency -- 100ms?? Typical Ethernet latencies are around a few hundred microseconds. Even the ping round-trip time from my machine to google.com is about 20ms.
$ ping google.com
PING google.com (64.233.167.99) 56(84) bytes of data.
64 bytes from 64.233.167.99: icmp_seq=1 ttl=241 time=20.1 ms
64 bytes from 64.233.167.99: icmp_seq=2 ttl=241 time=19.6 ms
64 bytes from 64.233.167.99: icmp_seq=3 ttl=241 time=19.5 ms
What a shame that such a post is on the front page of slashdot! Someone please s/milli/micro.
Probably just trying to release job-related stress. I've been missing my primary outlet since I damaged my race-suspension '89 Nissan 240SX and went back to driving my sloppy-suspension '81 M-B 300SD. This is a lot safer, anyway...
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
I'm talking performance. Store & Forward hammers your performance. In my experience, you get better performance when you run the server at 100Mb full duplex (along with all the workstations) and use Cut Through than if you have the server on a Gb port, but run Store & Forward to your 100Mb workstations.
Were you trying to get fp or something?
Yes, and you're right, I should have said Mbps and Gbps (30 Gigabit networking is going to create a packet flow that far outstrips the NSA's 622Megabit packet sniffing capability).
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
"MB is a measurement of data; in this case 10^6 bytes. (MiB would be 2^10.) I think you want a measurement of data, such as perhaps MBps."
No offense, but, but I think what you wanted to say was "I think you want a measure of data transmission speed", such as perhaps MBps".
Generally, though, it is "Mbps" for "Megabits per second", and not "MBps", which would be "Megabytes per second".
Good pedantry, but it's worth pointing out that InfiniBand runs at 30 Gbps, which is in fact faster than the 10 Gbps that you claim the NarusInsight can do.
The slashdot summary is wrong. If you read the actual article the author has it mostly correct except for one comment near the end.
Ethernet latency is about 100uS through a gigE switch, round-trip. A full-sized packet takes about 200uS (micro seconds), round-trip. Single-ended latency is about half of that.
There are proprietary technologies that have much faster interconnects, such as the infiniband technology described in the article. But the article also mentions the roadblock that a proprietary technology respresents over a widely-vendored standard. The plain fact of the matter is that ethernet is so ridiculously cheap these days it makes more sense to solve the latency issue in software, for example by designing a better cache coherency management model and by designing better clustered applications, then it does with expensive proprietary hardware.
-Matt
100/1,000,000th sounds ok. That's what, 1/10th a millisecond?
;)
"If you have nothing to hide, you have nothing to fear." - Every fascist, ever
Well, more than one of them can do. Is there any reason they can't use three insight boxes? Or maybe four, just to have some slack :) It might require additional hardware to split up traffic, but...
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
..or cheap enough to use Ethernet for processor interconnect.
SGI had some kind of shared-memory-over-Ethernet protocol back in the day. Worked about as well as a steam-powered ornithopter. It was designed for customers too cheap or unconcerned about performance to use when they had to.
And I dabbled in OpenMP or whateveritwas back at a contract with just one such cheap customer, and they got what they paid for. Here's a nickel, kid.
Ethernet is Ethernet, and Infiniband et.al. is Infiniband et.al., dad-gummit.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Note: I do have a dog in this fight.
One thing that isn't mentioned in the article is the amount of CPU power required to send out ethernet packets. The typical rule is 1 GHz of processing power is required to send 1 Gb of data on the wire. So, if you want to send 10 Gbs of data, you'd need 10 GHz of processor - pretty steep price. Some companies have managed to get this down to 1 GHz/3 Gbs of processing, and one startup(NetEffect) is now claiming roughly ~0.1 Ghz for ~8 Gbs on the wire, using iWarp. With this, your system can be processing information rather than creating packets.
The problem with Infiniband, Myranet, etc is that they require another card in your system (and associated heat problems, size issues, etc), special switches and equipment, and new training for your staff on how to get it up and going. However, IWarp, which is based on TCP/IP can use your standard DHCP, ping, tracert, ipconfig, etc and can allow a single card to be used for networking to the outside world (TCP/IP), clustering in the datacenter(IWarp), and storage (IScsi). 1 card, no special new software widgets, 10 Gb speeds.
However, you cant go and buy a iWarp card from Fry's today. Although, you cant buy an infiniband or myranet card there either
Such a setup requires extremely low latency, as the processors are pulling Linux operating system images over the InfiniBand links, instead of through a local hard drive. Also, processes shared in RAM among the Linux nodes all run through the Voltaire switch.
Loading bulk data over the network (as in BOOTP) suggests high bandwidth, not latency. And it doesn't even require it; high bandwidth for BOOTP is a convenience. My 10Mb/s Ethernet hub could do it.
The author really is clueless...
tasks(723) drafts(105) languages(484) examples(29106)
I had recently considered using this Tolkien ring until I found out that deinstallation is very difficult. Something about having to take it to a smelter.
For all you seem to know about Fiber Channel, we'd think you'd read a little further down your Wikipedia article and discover that FC does indeed support both ATM **AND** (this is a critical one) IP for computer-computer communication. Furthermore, you apparently don't know that FC isn't limited to fiber optic cable interconnects, though that's often where it's found. Beyond that, you surely don't know that FC works at 1,2,4,8 Gb/s, and was designed from the ground up to be a low-latency protocol... Naturally, that makes it great at storage, AND communication at large datacenters that already have the need for large amounts of NAS.
Article is retarded/biased/misinformed/didn't even bother to do actual research; my 48 node gigabit switch has 1ms and its quite busy b/w-wise the way we abuse it...has this genius ever heard of things called full duplex and not mixing speeds in a freaking DATA CENTER? Mine sits in a closet, not even on a rack.
Hell, I think most of my gaming servers are all sub 50ms and thats internet via cableco_that_massively_oversells_bandwidth...
I think you have no clue about what your saying. 1) InfiniBand is an open standard hosted by IBTA which is a consortium of companies. The spec is available for anyone who wants to understand/build InfiniBand hardware. Not IEEE does not make it proprietary. 2) The major roadblock with 10Gbps is physics. You can only reach so far with copper without retiming the signal. And optics are expensive. 10 GbE has the same problem and it won't be cheap any time soon. 3) InfiniBand has already reached a volume where on-board IB chips are available in $70-80 range .. 10 GbE is no where close. And IB DDR will be shipping next month (20 Gbps wire / 16 Gbps data).
4) Beowulfs are popular for a reason .. Cache Coherency is a bitch.
5) A round trip node-to-node latency in IB is 2.7 usecs (best case of course). With all the optimization in the world, you won't be able to get ethernet anywhere near that number.
6) InfiniBand is being WIDELY deployed. Sandia Thunderbird is a 9216 processor IB fabric in production. NCSA has Tungsten2 which is 1024 processor IB fabric. NCSA also has a Microsoft Windows Cluster running CCE over IB with 880 processors. There are several large firms Oil&Gas, BioTech, Banks, Market Data houses which run several large multi-hundred/multi-thousand processor IB clusters.
7) Just as with any technology it will take time for new technologies to be accessible to the masses .. so don't write off anything yet.
8) Do you research before you open your mouth.
I have actually designed a very big Infiniband switch (288 ports, 10Gb/s full duplex each port, fully bisectional), and that was a couple of years back. I also designed other things InfiniBand, but that's another story :)
I can assure you, if we can do it, they have some of them.
Oh, the throughput numbers (for a 7U rack) are 288 * 8Gb/s * 2 (bidirectional, 8b/10b encoded in the link) = 4.608Tb/s data throughput = 576MByte/sec.
PeteS
In the TOP500, it looks like ethernet is not yet an "outsider." Perhaps in the "top 100."
TO START
PRESS ANY KEY
Where's the 'ANY' key? I see Esk, Kitarl, and Pig-Up...
Well, except the oblig. s/ms/us, but pretty much yeah. With Pathscale (now QLogic) Infinipath HTX cards, you can get 1.5 us latency between nodes, Myrinet 10 G PCI-E can get about 2.5 us. Note that there is now 10Gb ethernet making inroads to compete on terms of throughput (which Infinband SDR, Myrinet are roughly 10 Gbps), but latency is of course still problematic. One chief advantage of non-ethernet is those networks are source routed and every node has a full topology map of how to get to their destinations. This has the benefit of distributing the task of routing to more processors, as well as making intelligent routing decisions. With ethernet, switches have a very heavy routing burden in a busy network. Compare this to a Myrinet or Infiniband switch which merely needs to look at the next port tag and send it on. By and large when trying to do benchmarks on these technologies, we generally don't worry too much about which switch is used. Contrast with Ethernet where we have to be mindful of the packets per second capability of the switch...
Of course, on a large scale network, it is much simpler and easier to do switch-routing frames, but for tightly controlled networks, source-routed can be very advantageous.
I will say switch routed frames have the *potential* for much better utilization of multi-port aggregations, but largely the member of a multi-port aggregation used to send a packet is not based on port congestion, but rather on a hash of the MAC address referenced in the packet, which is nothing a source routed network couldn't do.
XML is like violence. If it doesn't solve the problem, use more.
All new switches (that are decent) employ cut-through forwarding.
For the majority of applications, the answer is no.
For some (supercomputing really) where all processors in the cluster need to be kept busy and not waste any time getting their data, it is a big deal.
In those applications, you are not going to worry about buying the InfiniBand HCA (or Myrinet system) and the various devices that also speak (IB, Myrinet, your choice of highspeed interconnect here).
In those cases, you use what you need, not what you have.
PeteS
Most (all?) Ethernet hardware reads in an entire packet, looks at it, then sends it on to a destination. This makes building routers and switching hardware fairly easy but extremely slow.
If you go to a high-speed network, what you get is a packet being forwarded as it's being read. By the time the first few bits are through the switch, it should be able to figure out the next hop and have the packet moving in that direction. Phone companies have huge problems with the delays in Ethernet. This is why the ATM protocol was invented, it's hard to use, awkward and not too graceful, but it can fly through a switching network like nobody's business.
Ethernet is also extremely sloppy--Any switch along the way is allowed to throw a packet away and wait for the originator to resend causing a HUGE hiccupp in the communication stream (Most if not all routers do this whenever an address is not in it's forwarding table yet).
IIRC the faster protocols see a "Routing" packet in the stream and set up forwarding hardware before getting the actual packet/stream, then wait until the end of the packet (or entire stream) to tear the route down again.
Ethernet, however, due to it's simplicity is bridging the gaps. It's a pretty crappy protocol in general, but we keep throwing better, smarter hardware at it in an effort to brute-force it into the parameters we require. (I work for a company that makes Ethernet over fiber hardware, and have worked for companies based around ATM, SONET and other interesting solutions).
I guess the point of the article was to remind a world that is coming to believe that ethernet is the end-all be-all of networking that it was always just the simplest hack available and therefore the easiest to deal with.
Just like SNMP.
AFAIK, wireless doesn't consistantly support 100Mbps compared to local Ethernet. Usually, I get around 54Mbps, or possibly 10 Mbps on a weak signal.
That's why you still see store-and-forward - Wireless and wired networks are different speeds.
#1. You're running different speeds on the same switch (why?).
lets see:
you have an older but still functional and economical to run printer with a 10base2/T combo card in it and for which a replacement card would be either expensive or unobtainable.
you have 100mbit to most of the desktops because your wiring wasn't done well enough for gig-e to cope.
you have gigabit to your servers
you have a 10 gigabit backbone link.
also even if a switch is cutting through a lot of packets its still going to have to queue those that arrive while another packet is going out onto the backbone (assuming you have a hierachical network and most traffic is client-backbone). So i can't imagine the peak memory needs would be that much lower.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
[quote]
AFAIK, wireless doesn't consistantly support 100Mbps compared to local Ethernet. Usually, I get around 54Mbps, or possibly 10 Mbps on a weak signal.
That's why you still see store-and-forward - Wireless and wired networks are different speeds.
[/quote]
uh, wireless can't be patched into an ethernet segment, an AP is a router, or at the very least a bridge, and generally has a 100Mbit Ethernet interface. that is not the same thing as patching 100M into a Gigabit Switch. But yes, wireless AP's should be on a different segment anyway (and probably be behind a firewall?).
What could be better than a jet powered motorcycle? http://www.youtube.com/watch?v=u8l6GTHLSWE
The NSA's network sniffer, recently discovered at an AT&T broadband center, can only sniff up to 622MB. Sounds to me like if you use an InfiniBand switch, that would effectively make the output of the NSA's network sniffers complete gibberish.
My journal has more info on the latest Naurus, the Insight, as well as info gleaned from their website and links from their website on the connections between Naurus and intelligence agencies and contractors. The 6400 was installed 6 years ago at AT&T by the NSA - it has likely been upgraded. The Insight can do 2.5 Gbps (OC-48) at the application layer or 10 Gbps (OC-192) at the transport layer of the encapsulated TCP/IP stream. There are no faster WAN links in use than 2.5 Gbps at the regional carrier level or OC-192 at the continental carrier level AFAIK (last personal knowledge is 2 years old, but the network was at 10% capacity then, so I doubt they have upgraded). For undersea use they might use bigger pipes (I don't know), but generally there is enough dark fiber that the less-exotic equipment is more effective - just use more physical links and get greater redundancy.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
You idiots: don't use active backplane switches (aka: Cisco). Drop me an email and I'll fix you up with something that falls into the "doesn't suck" category.
-- Sean Chittenden
But on our network we vlan'd everything out. All servers on one vlan, I.T. on another vlan, and then major groups on their own vlans. Keeps traffic nice and segregated which is why the I.T. shop has iTunes sharing turned on full blast.
But here's where I notice some performance. We've got all the servers on a gigabit vlan. I can shift a 300MB file between servers in under 20 seconds. Transitioning a 5MB link takes five minutes.
So we did what we could to eliminate latency and we see it in the performance of our network.
Fair enough. Hope you get your groove back, though, you're more fun than this usually.
Linux, you magnificent bastard, I read the fucking manual!
You're retarded.
"But in data centers, where CPUs may be sharing data in memory across different connected machines..."
I have re-read this bit like twenty times and still have no idea what it means. The terms used clashes badly, which leads me to believe that the guy has no idea what he was talking about
Boy, you sure have a foul mouth. I suggest washing it out with soap.
My comments stand. Start posting prices and lets see how your idea of an open standard stacks up the reality. Oh yah, and remember for every $1000 you spend on your interconnect, that's $1000 less you have to spend on cpus and programmers with a clue.
The reality is that there is only one *correct* way to do a fast interconnect, and that is to build it into the CPU itself. Oh wait, AMD intends to do just that! That's what I'm waiting for. A cheap built-in interconnect that doesn't cost an arm and a leg or eat 100 watts of power all by itself. All this other junk has a shelf life of maybe a few years at best (as does pretty much all networking gear, but the difference is that this stuff costs 10 times as much). It's a huge waste of time for all but the most extreme clustered applications for which there is no algorithmic solution to the latency issue (read: people like to throw hardware at badly written programs more often then they should).
-Matt
It is nice to see people recognizing this, but the fact that data centers are not using ethernet is old news. The latency for tightly coupled distributed applications requires near microsecond latency that only non-ethernet networks could provide. That is until 10 Gigabit Ethernet (10GigE) burst onto the scene.
l y)
For several years now, 10GigE has been researched by several teams, most notably by RADIANT at Los Alamos National Lab (www.lanl.gov/radiant) (which is now SyNeRGy at Virginia Tech (website not yet online)) and NOWLAB at Ohio State (nowlab.cse.ohio-state.edu/).
With the inclusing of TCP offload-engines (TOE), ethernet hardware is becoming more similar to the non-ethernet hardware. Performance numbers show that 10GigE both competes with and surpasses non-ethernet networks in different benchmarks. But not in everything, thereby making the cliche, "choose the best tool for the job", more than adequate.
The edge in bandwidth and cost has always gone to ethernet due to economies of scale. However, niche markets that don't care about cost almost always choose non-ethernet to acheive the lowest-latency possible. Check out Top500.org for the break down of the interconnects used in the top 500 supercomputers world-wide (www.top500.org/lists/2005/11/l/Interconnect_Fami
For those interested in learning more about high-performance networks such as 10GigE and InfiniBand, as well as how ethernet and non-ethernet technologies are converging, I recommend reading the plethora of technical papers found on the RADIANT and NOWLAB websites.
In the TOP500, it looks like ethernet is not yet an "outsider." Perhaps in the "top 100."
It depends on what you're doing. If your job is highly parallel, Ethernet is fine. But what happens when every CPU needs access to every other CPUs results in "real time?" Well, low latency is then a must. 1 ms latency is potentially millions of wasted cycles.
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
/shrug
In practice, cheap is the reality. Just like how consumer goods dominate the market, with much less prosumer and professional equipment sold.
Fast interconnects are way more expensive than ethernet. People that want the extra performance, though, pay for it.
Let me try and jerk out into reality. Here is the press release from Mellanox announcing $69 per IB HCA chip in volume. http://www.mellanox.com/news/press_releases/pr_030 105.php
.. before you run into serious data coherency issues. Intel Xeon's can scale up to 4 processors per board. IBM has tweaked so they can scale up to 8 procs on their proprietory motherboards. AMD Hypertransport allows you to scale upto 8 sockets per motherboard. To scale beyond that you have to go use NUMA like interconnect (pricey) or push it on a fast pipe to other boxes (not so pricey). AMD recommends this. AMD specifically recommends to use InfiniBand to push high bandwidth. Go talk to AMD.
Hypertransport is available *today* in the market. Go talk to supermicro. You don't have to wait for it.
All this other junk has a shelf life of maybe a few years at best (as does pretty much all networking gear, but the difference is that this stuff costs 10 times as much).
Pretty much everything in computers has a shelf life of few years hardware and software .. what is your point ?
It's a huge waste of time for all but the most extreme clustered applications for which there is no algorithmic solution to the latency issue (read: people like to throw hardware at badly written programs more often then they should).
People are not complete idiots when they are spending millions of dollars setting up 100's or even 1000's of nodes of beowulf clusters over IB.
There are specific applications that are latency sensitive and bandwidth hungry. And that is the segment that IB addresses very well and where ethernet is considered an outsider.
Tell me you want to use 1G ethernet to process a satellite feed coming in at 3.2 Gbps .. again, I think you are either clueless or ignorant .. just because IB doesn't fit your bill doesn't mean it doesn't have its play.
Agreed its not widespread yet. But there are systems manufacturers actively looking at deploying this silicon on the mother board. IWILL is one company thats already started. Asus, Tyan are next.
The reality is that there is only one *correct* way to do a fast interconnect, and that is to build it into the CPU itself. Oh wait, AMD intends to do just that!
You do realize on board interconnects can only scale so much
i see alot of posts saying "why not just use gigabit". er...it's not so simple.
bandwidth and latency are not the same thing folks. it doesn't matter sometimes if you can send a full gigabit in one second. if it takes the whole lot an extra millisecond to get there it's no good in huge datacenters and applications mentioned in the story (summary).
latency is how long it takes a packet to actually get across a network.
bandwidth is how many packets you can shovel onto the network in a short period of time.
high bandwidth is good for sending large files and large chunks of data.
low latency is good for sending lots of seperate bits of small information, like simple browsing, or control signals used in file transfers.
They corrected the article. There were more flaws there earlier than there are now.
1 Earth is warming, 2 It's us, 3 it's royally bad, 4 we need to take action NOW
The funny thing is that there is already a solution to their problem out there. Raptor Networks Technologies already has their ethernet switches in a bunch of places and have (so far) proven that their distributed network technology runs circles around Cisco's (and others) centralized architecture and costs even less. They could probably keep up with the needs of these data centers. I've spoken to guys who use their hardware and they all say 'wow.' This sounds like a perfect network for Raptor's hardware. Anyone else ran into it?
Got sushi? The Sushi FAQ
The original most is in error... the article states that Ethernet latencies are more in the low MICRO SECOND range, not in the low hundreds of milliseconds.
Try it yourself... ping a machine one or two switch hops away on the local LAN.
Wow, a mistake was made. Flame on, dumbass. At least I am aware it is Fibre Channel and not Fiber Channel. There is a difference. I am also aware of the speeds FC runs on as well as it not being restricted to "glass" cable. I am glad you could make so many ASSumptions based on the tiny snippet of what I said.
Actually you can use FC for ATM and IP, but hardly anybody does because of the expense, and for the fact that it's something different from ethernet. It's only great for storage because of the low latency.
However, if FC doesn't embrace higher speeds such as 10gbit I can see 10gigE displacing it, provided that companies can produce switching hareware that'll have the same latency speed as FC.
For what I use FC for it's great (storage), but I'd be much happier if everything was just Ethernet.
Yes Francis, the world has gone crazy.
Hell, my company has an OC-192 coming in, and we're far from a "continental carrier".
Big carriers use lots of lines going lots of places and won't risk a single point of failure.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
I went over to the BlueGene/L's page to see how they manage to shuffle data around. They seem to use 1GB Ethernet links for IO nodes.
Any ideas why Ethernet is not an outsider here.
* lon3st4r *
I think you have no clue whom you were talking to.
Open Standard says nothing about price.
IB HBA's might be cheap, but the switching fabric sure as fuck aint.
As for cache coherency, you were addressing the man trying to change the cache coherency game. Watch out, skinny.
Lastly, there are some proprietary gigabit technologies (non IP based) that, while not 2.7 usec, are very close. Numerous MPI implementations are written with these technologies, although many also require hardware.
I dont think anyone is writing off IB. Its just a long ways away before we see switches that cost less than very nice houses. You can usually buy a house and stuff it with a couple rooms of gigabit switches for cheaper.
Its great tech, just not anywhere near the realm of mortals.
You have some very good points. Much better than grandfathers blathering, thanks.
I'd just like to quietly point out that costs are often switching costs, not HBA. Even 10GigE is getting reasonable for the adapters, but the switching costs are still completely astronomical.
It will be very interesting to see what the future of interconnects & cache coherency holds. We're definately rapidly approaching a cusp of something new. AMD today announced they're making a new socket for external peripherials, a great little indicator. I personally think the days of concentional HBA's are rapidly coming to a close, thats the 5 year obsolescence Dillon was talking about. Its going to be a drastic change, not just newer better adapters, newer fabrics, as used to be the case. PCI-ASI (switching fabric built around pci-express, very damned cool) and HT are becoming internal and peripherial busses, becoming full fabrics.
Really, I'm just glad things are finally in motion again, and that multiple groups are trying multiple approaches to this huge problem. There's been a lot of stagnation, besides the slow crawl forwards. Cache coherency and interconnection are being completely rethought, from many different angles.
FYI, Dillon is the guy doing DragonFly BSD, which is working on getting high latency cache coherency working in the OS, across systems. He's a little biased towards the "network is the computer", but in the general case he's dead on. There are definately cases where every usec of latency is a usec of 8,000 computers not doing any work whatsover, in which case IB is probably your god. But for most people? I'd wager HT or PCIe based interconnects, already ubiquitous but not used for system interconnect, will probably take over first. Couple million PCIexpress chipsets shipped this year, how many IB chips? Its advancing whats already used.
Good post mate, cheers.