A Look At the Workings of Google's Data Centers
Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting:
"'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."
At what point is skimping on hardware because the system is failure tolerant costlier than using more reliable hardware?
When looking at it on that massive scale, you really get the idea of just how fragile a hard drive really is. I wonder how much money the new generations of data storage is going to cost for large corporations like Google. And not to mention how existing corporations will handle it, once those devices goes from "super computers" to mainstream hardware.
It's all fun & games until someone loses the game.
The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.
All pass beyond reach of medicine. None pass beyond the reach of love.
I've been managing a dorm network consisting of two "servers" (routing, PPPoE, some services like network printing etc.), a single industrial rack-mounted swithch and dozens of consumer switches spread all over the building.
;)
And they failed. And then they failed again. And again. Sometimes completely, but usually just a single port, or just "a bit" - it looked as if the switch was working, but every - or every n-th, or every bigger than x - packet got mangled, misdirected or whatever. Or sometimes packets appeared just out of the blue (probably some partial leftovers from the cache) and a few of them made enough sense to be received and reported. Sometimes a switch with no network cables attached to it started blinking its lights - sometimes on two ports, sometimes just on a single one.
Well, I could go on for hours, but you get the idea. What happens at Google happens everywhere, they just have some nice numbers.
Regardless, the article is quite entertaining to read for a networking geek
This is Slashdot. Common sense is futile. You will be modded down.
The fact that they attribute success to the software did not surprise me; the chunk and shard (not mentioned in the article) approach has been known for some time. But the fact that the GFS architecture works with BigTable and MapReduce was interesting, and that it handles many data/content types. What this creates is not only a scalable structure volume size, AND a sustainable business model. As new content types are added, regardless of size or type, they can generally be indexed appropriately. I am looking forward to searching more within types like video and audio, or even medical records like xRays or MRI results. The possibilities are staggering.
no comment
It's always going to be cheaper to use anthill labor on this type of problem. Even relatively powerful 1RU and .5RU servers are dirt cheap these days. Hell, I was able to buy a pile of .5RU machines for one of my projects this week. I can't believe how cheap things have gotten:
quad-core xeon @2.66ghz
4gb RAM
2 x 500gig barracudas (RAID1)
dual gigabit ether
CentOS 5.1
US$1100 per unit
They are all stashed behind a Foundry ServerIron to load balance the cluster. So far, it seems to scale VERY well and increasing capacity is as simple as tossing another US$1k server on the pile.
Cheers,
Agreed, but their interconnect topology is what should be interesting not just the hardware, after all with simple topologies etc., there is a limit to how it scales efficiently, I have been doing some work on parallel processing for supercomputers as my undergrad thesis and believe me the major thing that differs amongst the top some 100 super computers is their interconnect topology not just their hardware.
Also, their search algo is based on eigen values I think, a very very profitable algo to parallelize. what version of parallel libraries do they use ?
blog plug -> The Darker Side of Light
AFAIK, Google uses Force10 switches for the networking infrastructure. Details are confidential though. I learnt this from the Force10 salesguy convinving me to buy their hardware.
I can throw myself at the ground, and miss.
He was my friend in high school and roommate in college for a year. Smartest guy I've ever met in my life, easily smarter than any other PhDs I've known, including people I know with Harvard post-med school doctorates.
How much of googles not caring about hardware has more to do with the fact that for google it doesn't have any reason to? It doesn't really matter if every single page on the planet is indexed or a few million go missing here and there or a few terrabytes of data walks off you can just crawl new copies and be on your merry way.
I agree entirely with the jist of the argument software based fault tolerance and scaling is a great thing and any meaningful scaling of applications really must done in an application specific context (not data tiers or hardware) with good software design.
Having said that at a very high level most businesses with datacenters can't afford to play fast and loose with their processing loads where total avaliability of data is much more valuable to them than indexing some poor fools web site half way around the world. In many areas partial return or return of wrong data on a query is much much worse than no response. For practical reasons the differences in realities tends to make more expensive hardware a better fit for a specific practically *finite* task in the real world outside of google.
If you don't believe that cheap components can still make a reliable system, read the GFS paper. GFS ends up more fault-tolerant than a bunch of RAID arrays.
In the end, it all boils down to scale. If you have two servers, go ahead and buy the best ones you can. If you have 5 disks, put them in a RAID. If you have thousands however, duplication and replication on cheaper hardware is the way to go. Projects such as Hadoop mean you don't need to roll your own anymore either.
TCP is slow and bulky if dropped packets are a very rare thing. Confirming delivery of every packet results in a lot of wasted communication for the vast majority.
My guess is that they use something else for internal communication. You can always recover from errors at the application level instead of forcing every packet to be confirmed.
TCP is great for general communication over the Internet and not so great for specialized cases where performance is important, like at Google.
SIGFAULT
More nonsense. TCP returns incredibly quickly when there are dropped packets. The amount of data for ACKing packets on "perfect" networks is tiny relatively to the volume being pushed. Go look at the documents on TCP performance, if you can get 900+Mbyte/sec on a 10Gig host, you might as well go home for the day with a job well done feeling.
TCP's costs are ones I'm happy to pay any day. Even internally, I really doubt that Google would have implemented their own in-order reliable delivery protocol. They might tweak the stack details a little bit to get a bit more performance, but I really doubt they've implemented GoogleCP instead of TCP.