A Look At the Workings of Google's Data Centers
Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting:
"'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."
Thats just plain bad engineering and technical capabilities. In fact the entire tone reeks of bad engineering (aside from hard disk failure). Broadcast engineering (television) trucks are wired at about twice the density of the typical data center and if they had these kinds of failure rates we would not have any live event television at all. The failures and attitude are shocking.
Actually, no, you don't need to increase the number of fans, because the number of fans required is a function of the total amount of heat produced, not the air volume.
No it isn't. If a machine works flawlessly for ten years 90% of the time, and you only have one odds are everything will always work. If you have ten, odds are one will die with in that ten years. Things are different at large scale, and failure prediction is an important part of creating such a big cluster. But yes, even on a small scale you should always plan for failure.
Well.. maybe. Or Maybe not. But Definitely not sort of.
Most rackmounts that I've seen have an 'identify' LED that you can have blink (I assume you can automate this with SNMP and management software).
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
The problem comes from requirements changing. "Sorry, we designed this building for X load, now you're using X+10% load so we have to add additional cooling units to keep up"
I had this problem at the University where I worked a while ago. We rolled in a nice new SGI Altix machine. We had enough power, but the cooling system couldn't move enough cubic feet of air into the one part of the room where the box was. As soon as you reach capacity, temps skyrocket.
I have never seen a switch fail what are you doing to them? mine are just consumer 5-16port devices
And that's why. If you're using "smart hubs" or "dumb switches" (aka, your $99 Linksys switch), then you're probably not going to have issues. All it does is store MAC tables and forwards data to the appropriate ports. You probably also don't have multiple other network switches/hubs/routers hanging off of those devices somewhere downstream, and if you do then it's very likely that you know what and where they are and can plan for them.
On the other hand, trying to manage an enterprise-class switch with advanced features can be a little more complicated, especially when you start allowing anybody to plug any other kind of network devices into the switch. You can easily end up with spanning tree loops, issues with frame sizing, cross-brand autonegotiation failures, and who knows what else. And that's before you even have to start worrying about bugs in various firmware revisions or some enterprising "hax0r d00dz" who passed Comp Sci 101 trying to do things that he shouldn't be doing, and spoofing addresses to try to cover his tracks.
Here's what they used in 1998... A Wikipedia article explains a bit of what they're doing now...
Two of my imaginary friends reproduced once
1. You've been fantastically lucky.
2. You've not been in IT terribly long.
3. Your job doesn't involve network management and so your experience of what switches can do when they have a mind to is limited.
Solid-state simple dumb switches can and do fail, as can managed ones. If you're lucky, they fail in a fairly obvious fashion (eg. they just stop pushing packets on some or all ports).
If you're unlucky, they start spewing corrupt frames everywhere confusing the hell out of everything else on the network and you have to figure out exactly which switch is doing this and get rid of it.
Typically Infiniband gets used for clustering and Fibre Channel for mass storage. They offer low latency 'lossless' connections. If your putting together a data center and you talk to the vendors on how to do that, they will steer you towards IB and FC. Of course they would, the tech works and it makes them very rich (huge profit margins on the hardware).
I wouldn't be surprised to learn that Google is instead using 'Ethernet everywhere', it has some good advantages:
1. Ethernet is cheap and scales in speed faster than anything else. 2009 will be a big year for 10GigE price reduction.
2. You have Ethernet for your networking, why not use it for your clustering and storage as well - reduce cables, physical links, power...
3. Google wants to cluster between data centers. Sounds like internetworking, but regardless using IB or FC between data centers is going to be a PITA.
4. Ethernet currently doesn't ensure 'lossless' however, with some effort, your clustering traffic can be practically lossless anyway.
5. 10Gig Ethernet is being enhanced to ensure lossless (priority based pause frames and congestion control)- enabling FCoE, but not limited to FCoE.
No, the % measures the overhead of AC compared to your heat-producing costs. There is no theoretical minimum to the AC needed, so you can't report an "efficiency" number like you're trying to do.
No, they use gigabit and 10G ethernet. Infiniband is the opposite of cheap commodity hardware. Infiniband is expensive per port and not commodity.
Google has a two vendor policy, I know some of their network gear for gig-e and 10G-e is Force10. Google and Force10 are both involved in the 802.3ba (40G and 100G), Force10 is on the IEEE committee and Google is one of the customers with demand, they may have a seat on the committee I don't really know all the members.