Slashdot Mirror


A Look At the Workings of Google's Data Centers

Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting: "'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."

6 of 160 comments (clear)

  1. Re:Overheating and rewiring? by William+Robinson · · Score: 4, Funny

    The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.

    Each machine has smoke detector installed right on top of it. The Maintenance director is standing at the gate of data center with pistol in his both hands. As soon as alarm is heard, a batch of maintenance engineers rush towards the faulty machine with keyboard, harddisc, mouse, motherboard and other components. The faulty components of machine are replaced on the rhythm of drumbeats they have been rehearsed through 1000's of times. The crew has to rewire the machine, reboot, and be back at the gate with burnt machine in less than 5 minutes or they are shot dead.

    The trouble is, because of this time limit, the maintenance engineers simply pull machine out of rack without disconnecting any wires. And that's why rewiring is needed.

  2. Re:Failure tolerance vs. failure prevention by dotancohen · · Score: 4, Funny

    At what point is skimping on hardware because the system is failure tolerant costlier than using more reliable hardware? Google is not skimping on hardware. They are simply not trusting hardware to be reliable. Actually, they are buying twice as much hardware as they would otherwise need, according to TFA. Er, not that I read it or anything, I swear,....
    --
    It is dangerous to be right when the government is wrong.
  3. Re:Failure tolerance vs. failure prevention by cp.tar · · Score: 4, Funny

    Actually, they are buying twice as much hardware as they would otherwise need, according to TFA. Er, not that I read it or anything, I swear,....

    Don't worry, your secret is safe with us.

    Real Slashdotters not only fail to read TFAs, but they also completely miss any and all relevant information in other people's posts.
    Therefore, someone may hook on your claim that Google is not skimping on hardware and try to argue that they, in fact, do. Your admission to having read TFA will go completely unnoticed.

    And before you ask yourself how come I noticed it: I didn't.
    And besides, I'm new here.

    --
    Ignore this signature. By order.
  4. Re:Traffic Patterns for Google by tristian_was_here · · Score: 3, Funny

    I bet certain trends happen at night

  5. Re:It's the same everywhere, regardless of scale by Bender0x7D1 · · Score: 3, Funny

    Sounds like you have dust in your cables. I would recommend you clean the inside of your cables with compressed air so the bits don't get stuck on the lint and other stuff in there. The bits travel very fast, so even small dust particles can be a problem.

    --
    Reading code is like reading the dictionary - you have to read half of it before you can go back and understand it.
  6. Re:Jeff Dean is the smartest guy I've ever met by thczv · · Score: 2, Funny

    Jeff, is that you?