Slashdot Mirror


A Look At the Workings of Google's Data Centers

Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting: "'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."

24 of 160 comments (clear)

  1. And the Network That Connects These Clusters? by eldavojohn · · Score: 4, Insightful
    A surprisingly lengthy and revealing blog posting indeed. Quite informative and interesting.

    While Google uses ordinary hardware components for its servers ... I would like to point out that the networking details were vastly overlooked. Information about the servers is interesting but when you're networking such a vast amount of computers together, I would be more interested in a quick graphic of how the IP addresses are layed out over 'a typical' cluster of 1,800 machines.

    I understand distributed computing and I understand distributed searching. But the fact of the matter is that at some point at the top of the chain, you're usually transferring very large amounts of data--no matter how tall your 'network pyramid' is. The coding itself is no simple feat but I have heard rumors that Google was building their own 10-Gigabit ethernet switches since they couldn't find any on the market. You'll notice a lot of sites are just speculating but it certainly is a nontrivial problem to network clusters of thousands of computers with more than 200,000 in the whole lot and not require some serious switch/hub/networking hardware to back it.
    --
    My work here is dung.
    1. Re:And the Network That Connects These Clusters? by magarity · · Score: 4, Insightful

      a quick graphic of how the IP addresses are layed out over 'a typical' cluster of 1,800 machines
       
      I'll bet they don't mess with tcp/ip - that's way too slow and bulky. Think Infiniband or some other switched fabric instead of heirarchical.

    2. Re:And the Network That Connects These Clusters? by arktemplar · · Score: 3, Interesting

      Agreed, but their interconnect topology is what should be interesting not just the hardware, after all with simple topologies etc., there is a limit to how it scales efficiently, I have been doing some work on parallel processing for supercomputers as my undergrad thesis and believe me the major thing that differs amongst the top some 100 super computers is their interconnect topology not just their hardware.

      Also, their search algo is based on eigen values I think, a very very profitable algo to parallelize. what version of parallel libraries do they use ?

      --
      blog plug -> The Darker Side of Light
    3. Re:And the Network That Connects These Clusters? by Nethemas+the+Great · · Score: 3, Informative

      Here's what they used in 1998... A Wikipedia article explains a bit of what they're doing now...

      --
      Two of my imaginary friends reproduced once ... with negative results.
    4. Re:And the Network That Connects These Clusters? by Anonymous Coward · · Score: 3, Insightful

      Bwaahhahhahah. ARe you kidding?

      1) TCP/IP isn't really slow and bulky. It's one of the best protocols ever designed. With only minimal enhancements to the original protocol as designed, a modern host can achieve nearly line speed 10Gbit with pretty minimal CPU. We can push 900+Mbyte/sec from a single host. If you need more bandwidth, then do channel bonding.

      2) Infiniband? That costs at least $250-500 per node plus more for switches. Google is not going spend that kind of money for the limited benefits.

      I would suspect their in-house networking is actually pretty boring- standard TCP/IP with VLANs and LACP to make addressing easier and performance a bit higher.

  2. Re:Failure tolerance vs. failure prevention by Vectronic · · Score: 3, Insightful

    Interesting, but I would probably venture a guess: never.

    Unless of course you are talking about P2's and ISA's, and its not a matter of "reliability" I dont think, it could easily be argued that a $200 [component] is just as reliable as a $500 [component] I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.

    Component A:cheaper, less cutting edge (generally more reliable)

    Component B: Has 3 times the power, 3 times the load, costs 3 times as much.

    If a single component A fails, there is still 2 running (depending on the component) and thus a 33% loss in performance, a third the of total cost to replace (making it like a 6th of the costs compaired to component B)

    If component B fails, 100% loss, complete downtime, 100% expense. (relatively)

  3. Re:Failure tolerance vs. failure prevention by PerspexAvenger · · Score: 5, Insightful

    It's a lot easier and cheaper to make failure-tolerant software if you're looking at system functionality on a cluster/datacentre level than it is to ensure all your hardware is bulletproof.
    Hardware will fail - it's up to the intelligence of the overlaid systems to mitigate that.

  4. Overheating and rewiring? by throatmonster · · Score: 4, Interesting

    The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.

    --
    All pass beyond reach of medicine. None pass beyond the reach of love.
    1. Re:Overheating and rewiring? by William+Robinson · · Score: 4, Funny

      The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.

      Each machine has smoke detector installed right on top of it. The Maintenance director is standing at the gate of data center with pistol in his both hands. As soon as alarm is heard, a batch of maintenance engineers rush towards the faulty machine with keyboard, harddisc, mouse, motherboard and other components. The faulty components of machine are replaced on the rhythm of drumbeats they have been rehearsed through 1000's of times. The crew has to rewire the machine, reboot, and be back at the gate with burnt machine in less than 5 minutes or they are shot dead.

      The trouble is, because of this time limit, the maintenance engineers simply pull machine out of rack without disconnecting any wires. And that's why rewiring is needed.

  5. It's the same everywhere, regardless of scale by Enleth · · Score: 3, Interesting

    I've been managing a dorm network consisting of two "servers" (routing, PPPoE, some services like network printing etc.), a single industrial rack-mounted swithch and dozens of consumer switches spread all over the building.

    And they failed. And then they failed again. And again. Sometimes completely, but usually just a single port, or just "a bit" - it looked as if the switch was working, but every - or every n-th, or every bigger than x - packet got mangled, misdirected or whatever. Or sometimes packets appeared just out of the blue (probably some partial leftovers from the cache) and a few of them made enough sense to be received and reported. Sometimes a switch with no network cables attached to it started blinking its lights - sometimes on two ports, sometimes just on a single one.

    Well, I could go on for hours, but you get the idea. What happens at Google happens everywhere, they just have some nice numbers.

    Regardless, the article is quite entertaining to read for a networking geek ;)

    --
    This is Slashdot. Common sense is futile. You will be modded down.
    1. Re:It's the same everywhere, regardless of scale by Bender0x7D1 · · Score: 3, Funny

      Sounds like you have dust in your cables. I would recommend you clean the inside of your cables with compressed air so the bits don't get stuck on the lint and other stuff in there. The bits travel very fast, so even small dust particles can be a problem.

      --
      Reading code is like reading the dictionary - you have to read half of it before you can go back and understand it.
    2. Re:It's the same everywhere, regardless of scale by ocbwilg · · Score: 3, Informative

      I have never seen a switch fail what are you doing to them? mine are just consumer 5-16port devices

      And that's why. If you're using "smart hubs" or "dumb switches" (aka, your $99 Linksys switch), then you're probably not going to have issues. All it does is store MAC tables and forwards data to the appropriate ports. You probably also don't have multiple other network switches/hubs/routers hanging off of those devices somewhere downstream, and if you do then it's very likely that you know what and where they are and can plan for them.

      On the other hand, trying to manage an enterprise-class switch with advanced features can be a little more complicated, especially when you start allowing anybody to plug any other kind of network devices into the switch. You can easily end up with spanning tree loops, issues with frame sizing, cross-brand autonegotiation failures, and who knows what else. And that's before you even have to start worrying about bugs in various firmware revisions or some enterprising "hax0r d00dz" who passed Comp Sci 101 trying to do things that he shouldn't be doing, and spoofing addresses to try to cover his tracks.

    3. Re:It's the same everywhere, regardless of scale by jimicus · · Score: 3, Informative

      I have never seen a switch fail what are you doing to them? mine are just consumer 5-16port devices but they are in constant use quite often at maximum capacity for several hours at a time while large files are transfered over the network. I think I have had one crash needing a reeboot once and had to reset another after a momentry power loss another time. Then at least one of the following is true:

      1. You've been fantastically lucky.
      2. You've not been in IT terribly long.
      3. Your job doesn't involve network management and so your experience of what switches can do when they have a mind to is limited.

      Solid-state simple dumb switches can and do fail, as can managed ones. If you're lucky, they fail in a fairly obvious fashion (eg. they just stop pushing packets on some or all ports).

      If you're unlucky, they start spewing corrupt frames everywhere confusing the hell out of everything else on the network and you have to figure out exactly which switch is doing this and get rid of it.
  6. Re:Failure tolerance vs. failure prevention by dotancohen · · Score: 4, Funny

    At what point is skimping on hardware because the system is failure tolerant costlier than using more reliable hardware? Google is not skimping on hardware. They are simply not trusting hardware to be reliable. Actually, they are buying twice as much hardware as they would otherwise need, according to TFA. Er, not that I read it or anything, I swear,....
    --
    It is dangerous to be right when the government is wrong.
  7. Software architecture, Not hardware by howardd21 · · Score: 3, Interesting

    The fact that they attribute success to the software did not surprise me; the chunk and shard (not mentioned in the article) approach has been known for some time. But the fact that the GFS architecture works with BigTable and MapReduce was interesting, and that it handles many data/content types. What this creates is not only a scalable structure volume size, AND a sustainable business model. As new content types are added, regardless of size or type, they can generally be indexed appropriately. I am looking forward to searching more within types like video and audio, or even medical records like xRays or MRI results. The possibilities are staggering.

    --
    no comment
  8. Re:Failure tolerance vs. failure prevention by The+Second+Horseman · · Score: 5, Interesting
    It depends on the kind of applications you're running. Google is something of a singular case. A lot of businesses need to run a lot of small servers for dissimilar applications, not similar ones. If you're talking about business apps that don't play well together on a single server and you virtualize them, you can get a pair of 8-core servers (something like an HP Proliant DL380 G5) with an extra NIC, fibre channel HBA and 32 GB of RAM, plus local SAS drives.

    You can easily run a dozen large VMs on one of those with room to spare (assuming some of them have 2GB or 3GB of RAM allocated to them). If you limit it to ten per box, that's twenty VMs, and you can migrate servers between them or fail them over in case of a fault. Those DL380's (if you have dynamic power savings turned on) can average under 400 watts of power draw each - so 40 watts per server. In our environment, we've got 5 hosts running a ton of VMs, some of which don't have to fail over (layer 4-7 switch, also a VM), so we're getting closer to 25 or 30 watts per VM. We'd have the SAN array anyway for our primary data storage, so that wasn't much of an extra. We're using fewer data center network ports, and few fibre channel ports. We've actually been able to triple the number of "servers" we're running while actually bringing energy use down as we've retired more older servers and replaced them with VMs. And it's been a net increase in fault tolerance as well.

  9. Re:Failure tolerance vs. failure prevention by cp.tar · · Score: 4, Funny

    Actually, they are buying twice as much hardware as they would otherwise need, according to TFA. Er, not that I read it or anything, I swear,....

    Don't worry, your secret is safe with us.

    Real Slashdotters not only fail to read TFAs, but they also completely miss any and all relevant information in other people's posts.
    Therefore, someone may hook on your claim that Google is not skimping on hardware and try to argue that they, in fact, do. Your admission to having read TFA will go completely unnoticed.

    And before you ask yourself how come I noticed it: I didn't.
    And besides, I'm new here.

    --
    Ignore this signature. By order.
  10. Hardware is cheap by Ritz_Just_Ritz · · Score: 3, Interesting

    It's always going to be cheaper to use anthill labor on this type of problem. Even relatively powerful 1RU and .5RU servers are dirt cheap these days. Hell, I was able to buy a pile of .5RU machines for one of my projects this week. I can't believe how cheap things have gotten:

    quad-core xeon @2.66ghz
    4gb RAM
    2 x 500gig barracudas (RAID1)
    dual gigabit ether
    CentOS 5.1
    US$1100 per unit

    They are all stashed behind a Foundry ServerIron to load balance the cluster. So far, it seems to scale VERY well and increasing capacity is as simple as tossing another US$1k server on the pile.

    Cheers,

  11. Re:Traffic Patterns for Google by tristian_was_here · · Score: 3, Funny

    I bet certain trends happen at night

  12. Re:Failure tolerance vs. failure prevention by SpinyNorman · · Score: 5, Insightful

    You could say that Google is taking advantage of the fact that hardware is unreliable to reduce cost.

    With server farms the size of Google's, failures are going to occur daily regardless of how "fault-tolerant" your hardware is. Nothing is 100% failure free. Given that failures will occur, you need fault tolerance in your software, and if your software is fault tolerant, then why waste money on overpriced "fault-tolerant" hardware? If you can buy N cheapo servers for the price of 1 hardened one, then you'll typically have N times the CPU power available, and the software makes them both look as reliable.

  13. Re:Failure tolerance vs. failure prevention by TheRaven64 · · Score: 4, Interesting

    It depends on how much downtime costs you. If Google is down for five seconds, no one will notice - they will just assume that their link is slow, blame their ISP, and hit refresh. If a telecom's billing system or a bank's transactional system is down for five seconds then they are likely to lose a lot of money. The only difference between doing this kind of thing in hardware and software is the fail-over time and the cost. Google take a slower fail-over time in exchange for lower costs. For them and for 99.9% of businesses, it makes perfect sense. The remaining 0.1% are the reason IBM's mainframe division is so profitable.

    --
    I am TheRaven on Soylent News
  14. Re:Traffic Patterns for Google by eebra82 · · Score: 5, Insightful

    There is no 'night' and 'day' for a worldwide internet-based organization such as google. When you have night, someone else has day. Both of you use google. Google consists of dozens of data centers spread out over the planet. Therefore, Asian Google users connect to Asian data centers and not American ones. Because of this, traffic will obviously vary greatly over a 12 hour period.

    And even if you think of Google as a whole, it is significantly more popular in Europe and the US than it is in Asia, so you would still have uneven traffic rates.
  15. Re:Failure tolerance vs. failure prevention by Znork · · Score: 4, Insightful

    I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.

    From what it looks like they're doing exactly what I do for myself; skip the extraneous crap and simply rack motherboards as they are.

    In that case we're not talking 3 of something cheaper; you could probably get up towards 5-10 of something cheaper. Then consider that best price/performance is not generally what is bought, and the difference is even wider.

    Of course, it's not going to happen in the average corporation, where most involved parties prefer covering their ass by buying conventional branded products. Point out to your average corporate purchaser or technical director that you could reduce CPU cycle costs to 1/25 th, and that you could provide storage at 1/100th of the current per gigabyte cost and they'll whine 'but we're an _enterprise_, we cant buy consumer grade stuff or build it ourselves'.

    Ten years ago people brought obsolete junk from work home to play with. These days I'm considering bringing obsolete stuff from home to work because the stuff I throw out is often better than low-prioritized things at work.

  16. Re:Failure tolerance vs. failure prevention by jacobsm · · Score: 3, Insightful

    First let me state that I'm a mainframe systems programmer and a true believer of this technology. IMHO Google should start looking at mainframe based virtualization instead of the server farms they currently depend on.

    One z10 complex with 64 CPU's, 1.5 TB of memory, can support thousands of Linux instances all communicating with each other using hypersocket technology. Hypersockets uses microcode to enable communications between environments without going to the actual network.

    A z10 processor complex is as close to 100% fault tolerant as possible, energy efficient, cost effective when compared to the total cost of the alternatives.