A Look At the Workings of Google's Data Centers
Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting:
"'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."
It's a lot easier and cheaper to make failure-tolerant software if you're looking at system functionality on a cluster/datacentre level than it is to ensure all your hardware is bulletproof.
Hardware will fail - it's up to the intelligence of the overlaid systems to mitigate that.
You could say that Google is taking advantage of the fact that hardware is unreliable to reduce cost.
With server farms the size of Google's, failures are going to occur daily regardless of how "fault-tolerant" your hardware is. Nothing is 100% failure free. Given that failures will occur, you need fault tolerance in your software, and if your software is fault tolerant, then why waste money on overpriced "fault-tolerant" hardware? If you can buy N cheapo servers for the price of 1 hardened one, then you'll typically have N times the CPU power available, and the software makes them both look as reliable.
And even if you think of Google as a whole, it is significantly more popular in Europe and the US than it is in Asia, so you would still have uneven traffic rates.
Full Tilt