How To Build a 100,000-Port Ethernet Switch
BobB-nw writes "University of California at San Diego researchers Tuesday are presenting a paper (PDF) describing software that they say could make data center networks massively scalable. The researchers say their PortLand software will enable Layer 2 data center network fabrics scalable to 100,000 ports and beyond; they have a prototype running at the school's Department of Computer Science and Engineering's Jacobs School of Engineering. 'With PortLand, we came up with a set of algorithms and protocols that combine the best of layer 2 and layer 3 network fabrics,' said Amin Vahdat, a computer science professor at UC San Diego. 'Today, the largest data centers contain over 100,000 servers. Ideally, we would like to have the flexibility to run any application on any server while minimizing the amount of required network configuration and state... We are working toward a network that administrators can think of as one massive 100,000-port switch seamlessly serving over one million virtual endpoints.'"
I hope they have invented something better than ordinary Ethernet cables to wire that ting with.
http://www.intellipool.se/ - Intellipool Network Monitor
I have nightmarish pictures popping into my head of a waterfall of ethernet cables spewing from this with user's ports un-numbered with no network diagrams. People bashing on the server room door in a zombie like state muttering "MRRRHH FACEBOOK!" "TWWIIIITEEEuggggghh" with me inside screeching "NO! NO! I DONT KNOW WHAT PORT YOUR DESK IS! NO! I CAN'T MAKE THINGS GO FASTER!" before curling up in a ball listening to the hum of servers and the lamentations of the users outside the door desperately scratching to get in.
Be you Admins? nay, we are but lusers!
I would seriously hate to be the guy that tripped over that power cable.
On the plus side it would be interesting to time how long it took for the DC's phone lines to melt.
-Matt
(redundant, redundant power. I know, I know)
--- Need web hosting?
I've long been of the opinion that putting more than a few hundred hosts on a single layer 2 network is almost always a bad idea.
What do you do about broadcast storms? How do you prevent some clown from anywhere in that 100,000 machine cloud from poaching another machine's IP address (either maliciously or by an accidental typo)?
Subnets and routers were invented for a reason. Just because you can bridge the whole world together into one massive virtual Ethernet segment doesn't mean you should.
The paper is about adding a layer of addressing so that IP and Ethernet addresses can be moved from one machine to another as instances of virtual machines are migrated around. It's not about the problems of physically building a very large switch. The switch components are mostly stock items.
Have fun replacing it when it fails. In my head I imagine something like this.
...that the answer involves duct tape.
Hasn't that already been done?
Lets see... That's 100,000 ports with 2 LEDs each (link, action/fdx/speed/poe) for a total of 200,000 LEDs. Lets say they use some of the cheapest SMD LEDs on the market. Well use digikey part number 160-1183-1-ND which is a cheap 0603 foot print green LED. At quantity 200,000 that comes out to $12,000 in cut-tape packaging or $9,450 if you buy 210,000 of them in 3,000-qty reels.
Lets say that all of the link LEDs are on 100% of the time and the the activity LED is on 50% of the time. That gives us 150,000 LEDs on at any given point in time. Our example LEDs use 20ma at 2.1V. So 150,000 LEDs at 20ma uses 3Ka. In total, 6.1Kw is burned by the green LEDs.
All that blinking... Damn. I want one NOW!!! More than a girl friend!
I can't just go out and buy 33,334 d-links and turn off DHCP on all but one of them?
For justice, we must go to Don Corleone
At least its over 9000
That's one big LAN party
It is the universe that makes fun of us all.
...would *actually* be ONE physical device!
you can just "think of" it as working like that.... unless you are the Network Engineer, and then it's still gonna mess with your head trying to make it all work. :-)
It's still ethernet.
Without getting too far into it, their brilliant plan to to insinuate a layer 2 and a half using "pseudo MAC addresses," using a directory service rather than broadcasts. They're hoping they can use this mess to paper over horrific network design.
Yeah, I'll grant you you might be able to cobble this mess together in an academic setting, and sure, you'll even be able to rig some demos that show miraculous increases in speed.
I can guarantee they'll find funding with their promise you'll even able to hire even LESS skilled network admins, meaning Zaboomafoo the Typing Lemur now has a shot at his CCIE.
But, damn, you ignorant twits. Most corporate networks are already mashed together by the most cut-rate cable monkeys they can find. The last thing we need is some half-assed "protocol" that will guarantee even more network designs that are guaranteed to trip and break their necks over the first packet.
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
This seems to be a solution to a nonexistent problem. A big router, for example a cisco CRS, can be a single node supporting any data center. And it is a router, so there is no need for any exotic solution (L3 inspection on a switch?). It has a max bandwidth of 80Tb/s or 80,000 Gb Ethernet nodes. The beauty is of course that you can configure your entire data center with a single router, which greatly simplifies the network configuration, and makes changes simple.
don't cut it off www.mgmbill.org
Will it go into Super-Saiyan switch mode?
I tried to think of a good sig, and this wasn't it.
I wonder if D-Link has any?
(swoooosh)
And then... let's say 10% of all computers starts up a SMB-share... welcome to broadcast heaven (or hell) :)
Won't there be a super huge traffic jam and collision if all the ports are in use?
this might be the opportunity for a new business. Heck, if done right, apply this on top of one of the OSS OS, and then have a modular set of boxes, you could take on Cisco.
They're basically NATting the layer two protocols. Combined with a super spanning tree for the natted addresses they're practically boosting layer two into layer three.
Before I read the paper I was thinking that it would be easier to just run all your services NATted at layer three, even using something like PPPoE (which is how cable networks solve the same basic problem, with something like half a million end-points on the same subnet). I guess it's more efficient to work with the simpler layer two protocols instead.
... they have only needed 1 port! :)
I wonder if these folks licensed SecureFAST? I only ever had a 3000 node LAN, but with the increase in bandwidth and chip performance since 1997, I'm sure 100,000 would be trivial.
I second. You are an unrecognized genius.
I like toast!
...and when this switch blows the fuses, you have 100.000 servers offline instead of 24... Brilliant!
NO! NO! I DONT KNOW WHAT PORT YOUR DESK IS! NO!
That's funny. Because right now I'm doing consulting work for a major bank. They know what port I'm on all the time. In fact, they have software that monitors my traffic and immediately cuts it off if something they don't like happens.
I just bring in my Macbook with an EVDO dongle if I want to surf.
"Welcome to LEDs Magazine, the leading global information source for the LED community."
Wow, just wow !
Squirrel!
One switch to rule them all...
bad quote: "Imagine the size of the Walmart needed to hold that thing!"
I only look human.
My mother is a halfling and my dad is an ogre, so that makes me an Ogreling
Good God! THE VLANS! A "show vlans" command would take all day to execute and print out to be thicker than War and Peace.
The game.
The past does not equal the future. Hardware improves, software improves.
Just because you were taught from birth that you should have thirty-five 100 port switches in your building and that is what you have always done does not mean you should continue to do it. Network engineers seem to LOVE buying lots of hardware (when given the money). Maybe it's just the cool factor, maybe they want job security? It WOULD be far easier to manage a single switched fabric flat network if you have the hardware and the failover to handle it.
Everyone has chimed in on the nightmare of cable management for something like this. But the idea that this would be a single point of failure for my data center scares me even more.
I regularly read Dr. Vahdat's blog. I first got interested in it after reading his paper on Epidemic Routing which can be found in his list of publications here.
If you read his blog post you will see that he accomplishes his goal by creating a hierarchical tree of MAC addresses instead of a simple table. He also states that a large part of the proliferation of MAC addresses in these systems is due to virtual machines. Therefore everyone's nightmares of cabling hell are relatively moot.
Though I haven't contacted him yet, it seems that this solution would require reassigning new MAC addresses such that they can be organized hierarchically as we are accustomed to doing with IP addresses. If this is the case then it seems one would have two choices:
Now, I am not an expert in the details of switches, routing, or NAT so I may have gotten some of the details wrong. But you get the idea.
How many IT staff would go mad in the sea of network wires?
At the point of 100,000+ ports I would rather invest heavily in research to make a wireless switch that can handle 100,000+ connections at Gigabit speeds (and of course a corresponding wireless devices interface for each rack).
My Sig indicates the end of the comment I posted.
Lemme see - 100,000 eggs, one basket.
Good idea.
I can't wait until it reaches the limit on the MACS it can learn and just starts forwarding. :-)
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Its not as simple as you think when your 92 Tbps router is actually 72 16 slot routers connected together with fabric chassis, where each 16 slot router shelf is over 1 thousand pounds and the size of a 42U rack. Then you still need to connect switches with lots of ports to it.
.
Look at Cisco's data center products, if you are looking to build a data center, for example the 18 slot Nexus 7k.
.
The 18 slot Nexus 7k can support 512 10 gig ports on a single switch which is physically smaller than a single one of the 72 router shelfs of your 92 tbps CRS-1 requires, not to mention all the fabric chassis needed to connect your CRS-1 together..
To get 512 10 gig ports it would take an 8 shelf CRS-1 multichassis configuration.
.
For a back of the envelope test you can put 256 Cisco 6513 chassis on a 512 port nexus 7k with redundant 10 gig links and then each 6513 can support 11 slots * 48 ports = 528 ports for servers. That gives a total of 256 * 528 = 135,168
ports in the system. After this is setup, you now have about 128 42U racks full of over 100k blinking lights.
And I failed to make an old D-link wireless router into a WAP last night. Man, I stink!
I call it 'The Aristocrats'
But will it blend Linux into a beowulf cluster in Soviet Russia for great justice?
Ethernet is not always best. Ring topologies have inherent advantages in environments like this that should not be overlooked. Ethernet caught on in large part because of vendors catering to a dumbed-down market.
If it is a nonexistent problem, then why did Cisco build a solution for it?
it is certainly *not* multiple routers connected together with fabric racks. First, it is configured as a single router, and appear as a single router in the network topology. Secondly, the bandwidth behind each 40Gb/s card is about 200Gb/s to have the entire box behave as a single nonblocking router.
don't cut it off www.mgmbill.org
Wizards, scripts, GUIs and "automagic" are awesome tools. I love my OSPF. I love my Spanning Tree. I love my VTP. I love my Auto speed and duplex settings. I love every tool that helps me take care of tedium and drudgery.
But before you hand these tools to a network designer, they absolutely need to understand HOW and WHY those tools do what they do, lest your network ends up looking like it was built by Mickey the Wizard's Apprentice. Powerful tools require MORE skill on the part of the network admin, not less, because when those tools go wrong, they cause instant damage. Screw up a static route, and one subnet will not ping. Screw up OSPF settings, and multiple subnets may not ping. Screw up VTP settings, and your whole network can go away.
Your argument basically amounts to this. My young son doesn't have the strength yet to cut firewood safely with an ax and saw, so obviously I need to hand him a top-of-the-line Stihl chainsaw.
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
There isn't really much of a difference. Obviously you need something that will allow you to change configuration in one place and push it to all the hardware, that's a piece of software that could be developed for any kind of hardware, that can be configured. It doesn't have to say Cisco on the box in order to push configurations from some centralized location.
To make it appear as a single router in the network topology, all it has to do is to not decrement the TTL on packets as it travels through the fabric. Even if it is opaque to the outside, it is still routing packets on the inside. I have no idea if it routes based on MAC addresses, IP addresses, or some tags it put on the packets for the purpose. And it doesn't make a difference, as long as the actual routing can be done in hardware. All three are simple enough that routing of them can be done in hardware. Obviously you will still need to prevent packets from looping somehow. Spanning trees and TTLs are not the only ways to prevent packets from looping. I don't know how Cisco does it.
What are the advantages you get from not decrementing the TTL on packets going through it? It is going to hide some information about its internals. That information is hidden from the outside world as well as from the people who actually need to debug it when it fails. Hopefully Cisco has introduced some alternative way of debugging it. Other than that not decrementing the TTL reduces the chances, that you run into a path longer than the packet TTL. Usually that is not a problem.
Obviously you need the extra capacity somewhere. But why care so much about where it is put?
Hehe, I notice you didn't mention the price of that device.
I can only imagine when Cisco Layer 3 switches in the 6000 series used
to got for a quarter million several years back.
in ignorance, but you better not try to design one that way. If your job is vacuum cleaner DESIGN, it would really help if you knew how to wind that motor, and more importantly, WHY that motor was wound that way. But certainly, if you're the janitor, then feel free to push that handle back and forth, serene in the knowledge that someone else has done the heavy lifting for you.
I'm talking about the damage this idea would do to network design when Billy-the-uberl33t-LAN-Party-Badass tries to recable his Daddy's 15-site company without understanding the implications of switchport vlan assignment (true story). You're talking about how great it would be if the average MCSE janitor could know even less than they do now.
Of course, the reason this sets me off is that I spend a lot of my time dealing with critical networks -- 911, fire, police, hospitals, airports, etc -- where people can literally die when the network goes down. In the past 10 years especially, the trend has been to cut not just corners but whole cloth instead. I used to walk into emergencies to find competent staffs bushwacked by unsuspected bugs and subtle network design issues -- honest-to-God problems.
Now I generally walk into emergencies to find out that Cletus the 90-day-Community-College Wonder has teamed up with Zaboomafoo the Typing Lemur to bring the network down out of criminal negligence and idiocy. Where the Hell is the guy I used to work with, you know, the one who would have stopped this long before it was an issue? Oh, they let him go, they say. He was too expensive.
Was he more expensive than the clusterfrack you two idiots have belched forth upon the land?
This idea that we should invent a powertool to allow even greater ignorance in network design will wreak untold havoc and ensure that even MORE of my nights are interrupted at 3 am by yet another jackass who has stuck his head in the honeypot because he, at a fundamental level, did not understand that crockery that makes up the honeypot wouldn't just automagically stretch to fit his head.
I'm arguing we should keep the bears of very little brain away from the honeypots.
You're arguing we should start making rubber honeypots.
I'm trying to warn you you'll end up with a bunch of asphyxiated bears that way.
You're probably thinking that we have several billion bears we can import before we run out of them.
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
the max gig ports you can have on a CRS-1 multichassis is 55,296 using 16x Sip-800 full of SPA-8x1GE-V2.
Cletus, is that you? How 'bout you, Billy and Zaboomafoo just move on out of the server room before you break something again, OK? I'd really like to sleep a whole eight hours tonight without getting yet another panicked phone call...
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
Look, you're whining about the complexity of network design without understanding WHY there's complexity. We didn't break up layer 2 and layer 3 for the simple fun of it. We did it because we HAD to. There's a long reason and history for each piece of modern networking -- yes, even a kludge as ugly as NAT -- but you're not bothering to even try to wrap your head around any of it. For someone who whines about "world-proofing children" in your sig, you've undertaken precious little of it yourself. You sound just like every AOL refugee I've ever had in my classes whining about "Why can't it all be on one vlan?"
No one in modern networking is happy with the current state of afffairs, which is why if I had to guess I'd say we'll end up "routing to the edge" with IPv6 eventually. But guess what, even when we reach that Promised Land, you're still going to have to worry about MAC addresses. You'll still have to worry about routing loops. Because of some misplaced security fears, you're still going to have to learn about NAT, and yeah, you're still going to have to know how to subnet. Better networking tools aren't going to make my profession obsolete any more than better scalpels will eliminate surgeons.
A 2009 Honda Civic is a lot easier to operate than a Model T Ford.
But the design work was an order of magnitude harder and more complex.
And as bitter and pissed off as you are at paying my invoices because you can't do this yourself, get used to it. My profession isn't going anywhere. In fact, I'm busier than ever. I'm bitching about not getting enough sleep, remember?
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
Why that is so hard for law firms to understand I'll never know.
Because in the field of Law and Business, when someone says it's so, that makes it so. The judge says he's guilty, so he's guilty. A senior partner says you're wrong, so you're wrong. The highest authority in the room are twelve people who have no idea what's going on, and the highest authority in the land are nine people who can't tell you the price of milk.
People who eat, sleep and breathe in that atmosphere become extremely disconnected from reality. They tend to take it personally when someone tells them, "Just because you say it's so, doesn't make it so."
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
Yes, I'm going on and one trying to explain the technical side of it to you, but it's starting to feel a little like trying to explain math to a dog.
You're complaining about network complexity when you have no clue about WHY it's complex. Your asking that building networks be "easier," but you have no clue what you even mean by that.
So please, if you're not able to talk to the grownups about the real issues, step away from the keyboard. You're worse than the idiots showing up locked and loaded at the local healthcare discussions.
You're spouting opinions about things you know nothing about.
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."
They're not reducing complexity. They're proposing sandwiching another layer between two and three. It's not going to make things easier to design and troubleshoot. It's going to end up causing more trouble than it's worth. The only people who like this idea are salesguys like you who will have a new buzzword to sell.
But hey, by all means, implement this scheme. You're going to end up needing twice the network engineers you do now. The network explosions it will cause will be epic, the stuff of legend like Mt. St. Helens.
And for the love of Mike, I'm currently working 60-70 hours a week. We're not the Maytag repairmen. Most of us would LOVE to find a better way to do things. I have no doubt that 100 years from now, computer networking will make current schemes look slow and stupid. But those future protocols will still need to connect to the node -- layer one, identify the node -- layer two, and group the nodes together to make them easier to address -- layer three.
Look, I have no doubt you spend your week with your SE wildly gesticulating at you and shouting. I know by the time those frantic shouts get through your ears, it sounds like Charlie Brown's schoolteacher.
Show him some patience. He's trying to wedge some understanding between your ears.
He's not having much luck, apparently.
He put his boots up on the table and made a face. "The sig," he smirked. "You can waste your life in search of the sig."