Slashdot Mirror


A Look At the Workings of Google's Data Centers

Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting: "'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."

10 of 160 comments (clear)

  1. And the Network That Connects These Clusters? by eldavojohn · · Score: 4, Insightful
    A surprisingly lengthy and revealing blog posting indeed. Quite informative and interesting.

    While Google uses ordinary hardware components for its servers ... I would like to point out that the networking details were vastly overlooked. Information about the servers is interesting but when you're networking such a vast amount of computers together, I would be more interested in a quick graphic of how the IP addresses are layed out over 'a typical' cluster of 1,800 machines.

    I understand distributed computing and I understand distributed searching. But the fact of the matter is that at some point at the top of the chain, you're usually transferring very large amounts of data--no matter how tall your 'network pyramid' is. The coding itself is no simple feat but I have heard rumors that Google was building their own 10-Gigabit ethernet switches since they couldn't find any on the market. You'll notice a lot of sites are just speculating but it certainly is a nontrivial problem to network clusters of thousands of computers with more than 200,000 in the whole lot and not require some serious switch/hub/networking hardware to back it.
    --
    My work here is dung.
    1. Re:And the Network That Connects These Clusters? by magarity · · Score: 4, Insightful

      a quick graphic of how the IP addresses are layed out over 'a typical' cluster of 1,800 machines
       
      I'll bet they don't mess with tcp/ip - that's way too slow and bulky. Think Infiniband or some other switched fabric instead of heirarchical.

    2. Re:And the Network That Connects These Clusters? by Anonymous Coward · · Score: 3, Insightful

      Bwaahhahhahah. ARe you kidding?

      1) TCP/IP isn't really slow and bulky. It's one of the best protocols ever designed. With only minimal enhancements to the original protocol as designed, a modern host can achieve nearly line speed 10Gbit with pretty minimal CPU. We can push 900+Mbyte/sec from a single host. If you need more bandwidth, then do channel bonding.

      2) Infiniband? That costs at least $250-500 per node plus more for switches. Google is not going spend that kind of money for the limited benefits.

      I would suspect their in-house networking is actually pretty boring- standard TCP/IP with VLANs and LACP to make addressing easier and performance a bit higher.

  2. Re:Failure tolerance vs. failure prevention by Vectronic · · Score: 3, Insightful

    Interesting, but I would probably venture a guess: never.

    Unless of course you are talking about P2's and ISA's, and its not a matter of "reliability" I dont think, it could easily be argued that a $200 [component] is just as reliable as a $500 [component] I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.

    Component A:cheaper, less cutting edge (generally more reliable)

    Component B: Has 3 times the power, 3 times the load, costs 3 times as much.

    If a single component A fails, there is still 2 running (depending on the component) and thus a 33% loss in performance, a third the of total cost to replace (making it like a 6th of the costs compaired to component B)

    If component B fails, 100% loss, complete downtime, 100% expense. (relatively)

  3. Re:Failure tolerance vs. failure prevention by PerspexAvenger · · Score: 5, Insightful

    It's a lot easier and cheaper to make failure-tolerant software if you're looking at system functionality on a cluster/datacentre level than it is to ensure all your hardware is bulletproof.
    Hardware will fail - it's up to the intelligence of the overlaid systems to mitigate that.

  4. Re:Failure tolerance vs. failure prevention by SpinyNorman · · Score: 5, Insightful

    You could say that Google is taking advantage of the fact that hardware is unreliable to reduce cost.

    With server farms the size of Google's, failures are going to occur daily regardless of how "fault-tolerant" your hardware is. Nothing is 100% failure free. Given that failures will occur, you need fault tolerance in your software, and if your software is fault tolerant, then why waste money on overpriced "fault-tolerant" hardware? If you can buy N cheapo servers for the price of 1 hardened one, then you'll typically have N times the CPU power available, and the software makes them both look as reliable.

  5. Re:Failure tolerance vs. failure prevention by Anpheus · · Score: 2, Insightful

    You're also paying through the nose for every extra nine of uptime.

    That's not to say it's impossible, IBM, HP, any of the "big iron" companies can offer you damn near 100% uptime without major changes to your software.

    But be prepared to pull out the checkbook. You know, the REALLY BIG one that is only suitable for writing lots of zeroes and grand prize giveaways.

  6. Re:Traffic Patterns for Google by eebra82 · · Score: 5, Insightful

    There is no 'night' and 'day' for a worldwide internet-based organization such as google. When you have night, someone else has day. Both of you use google. Google consists of dozens of data centers spread out over the planet. Therefore, Asian Google users connect to Asian data centers and not American ones. Because of this, traffic will obviously vary greatly over a 12 hour period.

    And even if you think of Google as a whole, it is significantly more popular in Europe and the US than it is in Asia, so you would still have uneven traffic rates.
  7. Re:Failure tolerance vs. failure prevention by Znork · · Score: 4, Insightful

    I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.

    From what it looks like they're doing exactly what I do for myself; skip the extraneous crap and simply rack motherboards as they are.

    In that case we're not talking 3 of something cheaper; you could probably get up towards 5-10 of something cheaper. Then consider that best price/performance is not generally what is bought, and the difference is even wider.

    Of course, it's not going to happen in the average corporation, where most involved parties prefer covering their ass by buying conventional branded products. Point out to your average corporate purchaser or technical director that you could reduce CPU cycle costs to 1/25 th, and that you could provide storage at 1/100th of the current per gigabyte cost and they'll whine 'but we're an _enterprise_, we cant buy consumer grade stuff or build it ourselves'.

    Ten years ago people brought obsolete junk from work home to play with. These days I'm considering bringing obsolete stuff from home to work because the stuff I throw out is often better than low-prioritized things at work.

  8. Re:Failure tolerance vs. failure prevention by jacobsm · · Score: 3, Insightful

    First let me state that I'm a mainframe systems programmer and a true believer of this technology. IMHO Google should start looking at mainframe based virtualization instead of the server farms they currently depend on.

    One z10 complex with 64 CPU's, 1.5 TB of memory, can support thousands of Linux instances all communicating with each other using hypersocket technology. Hypersockets uses microcode to enable communications between environments without going to the actual network.

    A z10 processor complex is as close to 100% fault tolerant as possible, energy efficient, cost effective when compared to the total cost of the alternatives.