A Look At the Workings of Google's Data Centers

← Back to Stories (view on slashdot.org)

A Look At the Workings of Google's Data Centers

Posted by Soulskill on Saturday May 31, 2008 @12:07AM from the we're-gonna-need-a-bigger-boat dept.

Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting: "'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."

17 of 160 comments (clear)

Min score:

Reason:

Sort:

Failure tolerance vs. failure prevention by Anonymous Coward · 2008-05-31 00:14 · Score: 1, Interesting

At what point is skimping on hardware because the system is failure tolerant costlier than using more reliable hardware?
1. Re:Failure tolerance vs. failure prevention by The+Second+Horseman · 2008-05-31 01:00 · Score: 5, Interesting
  
  It depends on the kind of applications you're running. Google is something of a singular case. A lot of businesses need to run a lot of small servers for dissimilar applications, not similar ones. If you're talking about business apps that don't play well together on a single server and you virtualize them, you can get a pair of 8-core servers (something like an HP Proliant DL380 G5) with an extra NIC, fibre channel HBA and 32 GB of RAM, plus local SAS drives.
  You can easily run a dozen large VMs on one of those with room to spare (assuming some of them have 2GB or 3GB of RAM allocated to them). If you limit it to ten per box, that's twenty VMs, and you can migrate servers between them or fail them over in case of a fault. Those DL380's (if you have dynamic power savings turned on) can average under 400 watts of power draw each - so 40 watts per server. In our environment, we've got 5 hosts running a ton of VMs, some of which don't have to fail over (layer 4-7 switch, also a VM), so we're getting closer to 25 or 30 watts per VM. We'd have the SAN array anyway for our primary data storage, so that wasn't much of an extra. We're using fewer data center network ports, and few fibre channel ports. We've actually been able to triple the number of "servers" we're running while actually bringing energy use down as we've retired more older servers and replaced them with VMs. And it's been a net increase in fault tolerance as well.
2. Re:Failure tolerance vs. failure prevention by TheRaven64 · 2008-05-31 01:21 · Score: 4, Interesting
  
  It depends on how much downtime costs you. If Google is down for five seconds, no one will notice - they will just assume that their link is slow, blame their ISP, and hit refresh. If a telecom's billing system or a bank's transactional system is down for five seconds then they are likely to lose a lot of money. The only difference between doing this kind of thing in hardware and software is the fail-over time and the cost. Google take a slower fail-over time in exchange for lower costs. For them and for 99.9% of businesses, it makes perfect sense. The remaining 0.1% are the reason IBM's mainframe division is so profitable.
  
  --
  I am TheRaven on Soylent News
Hard drive failures by pacroon · 2008-05-31 00:28 · Score: 2, Interesting

When looking at it on that massive scale, you really get the idea of just how fragile a hard drive really is. I wonder how much money the new generations of data storage is going to cost for large corporations like Google. And not to mention how existing corporations will handle it, once those devices goes from "super computers" to mainstream hardware.

--
It's all fun & games until someone loses the game.
Overheating and rewiring? by throatmonster · 2008-05-31 00:36 · Score: 4, Interesting

The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.

--
All pass beyond reach of medicine. None pass beyond the reach of love.
1. Re:Overheating and rewiring? by mdenham · 2008-05-31 00:52 · Score: 2, Interesting
  
  Yeah, the overheating part could be solved by investing in more racks, and then putting half as many units on each rack.
  
  This also allows for future throughput improvements from a single unit, and probably would cost less than the two days' downtime every overheat (racks are relatively cheap, time isn't).
2. Re:Overheating and rewiring? by Anonymous Coward · 2008-05-31 01:23 · Score: 1, Interesting
  
  So when you increase the total number of racks you will increase the total number of floor space required (thus increasing rent of the property). This will then increase the volume of air within the room which would then need to have more fans to move more cool air (which will increase electrical costs). Hiring a few guys to run around and fix problems will definitely be cheaper then spreading out the machinery to allow better air flow.
It's the same everywhere, regardless of scale by Enleth · 2008-05-31 00:51 · Score: 3, Interesting

I've been managing a dorm network consisting of two "servers" (routing, PPPoE, some services like network printing etc.), a single industrial rack-mounted swithch and dozens of consumer switches spread all over the building.

And they failed. And then they failed again. And again. Sometimes completely, but usually just a single port, or just "a bit" - it looked as if the switch was working, but every - or every n-th, or every bigger than x - packet got mangled, misdirected or whatever. Or sometimes packets appeared just out of the blue (probably some partial leftovers from the cache) and a few of them made enough sense to be received and reported. Sometimes a switch with no network cables attached to it started blinking its lights - sometimes on two ports, sometimes just on a single one.

Well, I could go on for hours, but you get the idea. What happens at Google happens everywhere, they just have some nice numbers.

Regardless, the article is quite entertaining to read for a networking geek ;)

--
This is Slashdot. Common sense is futile. You will be modded down.
Software architecture, Not hardware by howardd21 · 2008-05-31 00:54 · Score: 3, Interesting

The fact that they attribute success to the software did not surprise me; the chunk and shard (not mentioned in the article) approach has been known for some time. But the fact that the GFS architecture works with BigTable and MapReduce was interesting, and that it handles many data/content types. What this creates is not only a scalable structure volume size, AND a sustainable business model. As new content types are added, regardless of size or type, they can generally be indexed appropriately. I am looking forward to searching more within types like video and audio, or even medical records like xRays or MRI results. The possibilities are staggering.

--
no comment
Hardware is cheap by Ritz_Just_Ritz · 2008-05-31 01:09 · Score: 3, Interesting

It's always going to be cheaper to use anthill labor on this type of problem. Even relatively powerful 1RU and .5RU servers are dirt cheap these days. Hell, I was able to buy a pile of .5RU machines for one of my projects this week. I can't believe how cheap things have gotten:

quad-core xeon @2.66ghz
4gb RAM
2 x 500gig barracudas (RAID1)
dual gigabit ether
CentOS 5.1
US$1100 per unit

They are all stashed behind a Foundry ServerIron to load balance the cluster. So far, it seems to scale VERY well and increasing capacity is as simple as tossing another US$1k server on the pile.

Cheers,
Re:And the Network That Connects These Clusters? by arktemplar · 2008-05-31 02:19 · Score: 3, Interesting

Agreed, but their interconnect topology is what should be interesting not just the hardware, after all with simple topologies etc., there is a limit to how it scales efficiently, I have been doing some work on parallel processing for supercomputers as my undergrad thesis and believe me the major thing that differs amongst the top some 100 super computers is their interconnect topology not just their hardware.

Also, their search algo is based on eigen values I think, a very very profitable algo to parallelize. what version of parallel libraries do they use ?

--
blog plug -> The Darker Side of Light
Re:And the Network That Connects These Clusters? by dodobh · 2008-05-31 04:03 · Score: 2, Interesting

AFAIK, Google uses Force10 switches for the networking infrastructure. Details are confidential though. I learnt this from the Force10 salesguy convinving me to buy their hardware.

--
I can throw myself at the ground, and miss.
Jeff Dean is the smartest guy I've ever met by swb · 2008-05-31 04:33 · Score: 2, Interesting

He was my friend in high school and roommate in college for a year. Smartest guy I've ever met in my life, easily smarter than any other PhDs I've known, including people I know with Harvard post-med school doctorates.
Re:And the Network That Connects These Clusters? by Anonymous Coward · 2008-05-31 12:23 · Score: 1, Interesting

How much of googles not caring about hardware has more to do with the fact that for google it doesn't have any reason to? It doesn't really matter if every single page on the planet is indexed or a few million go missing here and there or a few terrabytes of data walks off you can just crawl new copies and be on your merry way.

I agree entirely with the jist of the argument software based fault tolerance and scaling is a great thing and any meaningful scaling of applications really must done in an application specific context (not data tiers or hardware) with good software design.

Having said that at a very high level most businesses with datacenters can't afford to play fast and loose with their processing loads where total avaliability of data is much more valuable to them than indexing some poor fools web site half way around the world. In many areas partial return or return of wrong data on a query is much much worse than no response. For practical reasons the differences in realities tends to make more expensive hardware a better fit for a specific practically *finite* task in the real world outside of google.
Re:And the Network That Connects These Clusters? by Anonymous Coward · 2008-05-31 14:08 · Score: 1, Interesting

How much of googles not caring about hardware has more to do with the fact that for google it doesn't have any reason to? It doesn't really matter if every single page on the planet is indexed or a few million go missing here and there or a few terrabytes of data walks off you can just crawl new copies and be on your merry way. Not much actually. You can't lose people's gmail, docs, images (picassa), their adwords and adsense accounts and ads. Ad clicks must be recorded with auditable durability, and auditable account data durability for advertisers and website publishers showing ads. Also, people get angry if you lose their custom settings in any of myriad services. Don't make the mistake that running a huge search company means that search is the only thing that needs to be done. There's a lot more that has to go on to support those systems, and some of them have pretty strict ACID requirements.

If you don't believe that cheap components can still make a reliable system, read the GFS paper. GFS ends up more fault-tolerant than a bunch of RAID arrays.

In the end, it all boils down to scale. If you have two servers, go ahead and buy the best ones you can. If you have 5 disks, put them in a RAID. If you have thousands however, duplication and replication on cheaper hardware is the way to go. Projects such as Hadoop mean you don't need to roll your own anymore either.
Re:And the Network That Connects These Clusters? by dfj225 · 2008-05-31 16:40 · Score: 2, Interesting

TCP is slow and bulky if dropped packets are a very rare thing. Confirming delivery of every packet results in a lot of wasted communication for the vast majority.

My guess is that they use something else for internal communication. You can always recover from errors at the application level instead of forcing every packet to be confirmed.

TCP is great for general communication over the Internet and not so great for specialized cases where performance is important, like at Google.

--
SIGFAULT
Re:And the Network That Connects These Clusters? by Anonymous Coward · 2008-06-01 02:24 · Score: 1, Interesting

More nonsense. TCP returns incredibly quickly when there are dropped packets. The amount of data for ACKing packets on "perfect" networks is tiny relatively to the volume being pushed. Go look at the documents on TCP performance, if you can get 900+Mbyte/sec on a 10Gig host, you might as well go home for the day with a job well done feeling.

TCP's costs are ones I'm happy to pay any day. Even internally, I really doubt that Google would have implemented their own in-order reliable delivery protocol. They might tweak the stack details a little bit to get a bit more performance, but I really doubt they've implemented GoogleCP instead of TCP.