Slashdot Mirror


A Look At the Workings of Google's Data Centers

Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting: "'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."

12 of 160 comments (clear)

  1. Hard drive failures by pacroon · · Score: 2, Interesting

    When looking at it on that massive scale, you really get the idea of just how fragile a hard drive really is. I wonder how much money the new generations of data storage is going to cost for large corporations like Google. And not to mention how existing corporations will handle it, once those devices goes from "super computers" to mainstream hardware.

    --
    It's all fun & games until someone loses the game.
  2. Overheating and rewiring? by throatmonster · · Score: 4, Interesting

    The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.

    --
    All pass beyond reach of medicine. None pass beyond the reach of love.
    1. Re:Overheating and rewiring? by mdenham · · Score: 2, Interesting

      Yeah, the overheating part could be solved by investing in more racks, and then putting half as many units on each rack.

      This also allows for future throughput improvements from a single unit, and probably would cost less than the two days' downtime every overheat (racks are relatively cheap, time isn't).

  3. It's the same everywhere, regardless of scale by Enleth · · Score: 3, Interesting

    I've been managing a dorm network consisting of two "servers" (routing, PPPoE, some services like network printing etc.), a single industrial rack-mounted swithch and dozens of consumer switches spread all over the building.

    And they failed. And then they failed again. And again. Sometimes completely, but usually just a single port, or just "a bit" - it looked as if the switch was working, but every - or every n-th, or every bigger than x - packet got mangled, misdirected or whatever. Or sometimes packets appeared just out of the blue (probably some partial leftovers from the cache) and a few of them made enough sense to be received and reported. Sometimes a switch with no network cables attached to it started blinking its lights - sometimes on two ports, sometimes just on a single one.

    Well, I could go on for hours, but you get the idea. What happens at Google happens everywhere, they just have some nice numbers.

    Regardless, the article is quite entertaining to read for a networking geek ;)

    --
    This is Slashdot. Common sense is futile. You will be modded down.
  4. Software architecture, Not hardware by howardd21 · · Score: 3, Interesting

    The fact that they attribute success to the software did not surprise me; the chunk and shard (not mentioned in the article) approach has been known for some time. But the fact that the GFS architecture works with BigTable and MapReduce was interesting, and that it handles many data/content types. What this creates is not only a scalable structure volume size, AND a sustainable business model. As new content types are added, regardless of size or type, they can generally be indexed appropriately. I am looking forward to searching more within types like video and audio, or even medical records like xRays or MRI results. The possibilities are staggering.

    --
    no comment
  5. Re:Failure tolerance vs. failure prevention by The+Second+Horseman · · Score: 5, Interesting
    It depends on the kind of applications you're running. Google is something of a singular case. A lot of businesses need to run a lot of small servers for dissimilar applications, not similar ones. If you're talking about business apps that don't play well together on a single server and you virtualize them, you can get a pair of 8-core servers (something like an HP Proliant DL380 G5) with an extra NIC, fibre channel HBA and 32 GB of RAM, plus local SAS drives.

    You can easily run a dozen large VMs on one of those with room to spare (assuming some of them have 2GB or 3GB of RAM allocated to them). If you limit it to ten per box, that's twenty VMs, and you can migrate servers between them or fail them over in case of a fault. Those DL380's (if you have dynamic power savings turned on) can average under 400 watts of power draw each - so 40 watts per server. In our environment, we've got 5 hosts running a ton of VMs, some of which don't have to fail over (layer 4-7 switch, also a VM), so we're getting closer to 25 or 30 watts per VM. We'd have the SAN array anyway for our primary data storage, so that wasn't much of an extra. We're using fewer data center network ports, and few fibre channel ports. We've actually been able to triple the number of "servers" we're running while actually bringing energy use down as we've retired more older servers and replaced them with VMs. And it's been a net increase in fault tolerance as well.

  6. Hardware is cheap by Ritz_Just_Ritz · · Score: 3, Interesting

    It's always going to be cheaper to use anthill labor on this type of problem. Even relatively powerful 1RU and .5RU servers are dirt cheap these days. Hell, I was able to buy a pile of .5RU machines for one of my projects this week. I can't believe how cheap things have gotten:

    quad-core xeon @2.66ghz
    4gb RAM
    2 x 500gig barracudas (RAID1)
    dual gigabit ether
    CentOS 5.1
    US$1100 per unit

    They are all stashed behind a Foundry ServerIron to load balance the cluster. So far, it seems to scale VERY well and increasing capacity is as simple as tossing another US$1k server on the pile.

    Cheers,

  7. Re:Failure tolerance vs. failure prevention by TheRaven64 · · Score: 4, Interesting

    It depends on how much downtime costs you. If Google is down for five seconds, no one will notice - they will just assume that their link is slow, blame their ISP, and hit refresh. If a telecom's billing system or a bank's transactional system is down for five seconds then they are likely to lose a lot of money. The only difference between doing this kind of thing in hardware and software is the fail-over time and the cost. Google take a slower fail-over time in exchange for lower costs. For them and for 99.9% of businesses, it makes perfect sense. The remaining 0.1% are the reason IBM's mainframe division is so profitable.

    --
    I am TheRaven on Soylent News
  8. Re:And the Network That Connects These Clusters? by arktemplar · · Score: 3, Interesting

    Agreed, but their interconnect topology is what should be interesting not just the hardware, after all with simple topologies etc., there is a limit to how it scales efficiently, I have been doing some work on parallel processing for supercomputers as my undergrad thesis and believe me the major thing that differs amongst the top some 100 super computers is their interconnect topology not just their hardware.

    Also, their search algo is based on eigen values I think, a very very profitable algo to parallelize. what version of parallel libraries do they use ?

    --
    blog plug -> The Darker Side of Light
  9. Re:And the Network That Connects These Clusters? by dodobh · · Score: 2, Interesting

    AFAIK, Google uses Force10 switches for the networking infrastructure. Details are confidential though. I learnt this from the Force10 salesguy convinving me to buy their hardware.

    --
    I can throw myself at the ground, and miss.
  10. Jeff Dean is the smartest guy I've ever met by swb · · Score: 2, Interesting

    He was my friend in high school and roommate in college for a year. Smartest guy I've ever met in my life, easily smarter than any other PhDs I've known, including people I know with Harvard post-med school doctorates.

  11. Re:And the Network That Connects These Clusters? by dfj225 · · Score: 2, Interesting

    TCP is slow and bulky if dropped packets are a very rare thing. Confirming delivery of every packet results in a lot of wasted communication for the vast majority.

    My guess is that they use something else for internal communication. You can always recover from errors at the application level instead of forcing every packet to be confirmed.

    TCP is great for general communication over the Internet and not so great for specialized cases where performance is important, like at Google.

    --
    SIGFAULT