Slashdot Mirror


When the Power Goes Out At Google

1sockchuck writes "What happens when the power goes out in one of Google's mighty data centers? The company has issued an incident report on a Feb. 24 outage for Google App Engine, which went offline when an entire data center lost power. The post-mortem outlines what went wrong and why, lessons learned and steps taken, which include additional training and documentation for staff and new datastore configurations for App Engine. Google is earning strong reviews for its openness, which is being hailed as an excellent model for industry outage reports. At the other end of the spectrum is Australian host Datacom, where executives are denying that a Melbourne data center experienced water damage during weekend flooding, forcing tech media to document the outage via photos, user stories and emails from the NOC."

135 comments

  1. Nothing to see here people. by Anonymous Coward · · Score: 0

    Google was open. What exactly is the issue?

    1. Re:Nothing to see here people. by teknopurge · · Score: 1

      The fact there was one.

  2. Really Quiet by Anonymous Coward · · Score: 0

    It gets really quite.

  3. what about having people onsite? by alen · · Score: 2, Insightful

    aren't there any people in the data center to tell them that yes there has been a power outage, so and so machines are affected, etc? sounds like all they have is remote monitoring and if something happens than someone has to drive to the location to see what's wrong

    1. Re:what about having people onsite? by Anonymous Coward · · Score: 0

      You are thinking too small-scale. Of course there are people on-site. Google has data centers all over the world -- how are they going to drive there?

    2. Re:what about having people onsite? by dch24 · · Score: 1, Troll

      What I want to know is, what caused the outage?

      The post on the google-appengine group details all the things they did wrong and are going to fix, after the power went out. Fine, I have to plan for outages too. But what caused the unplanned outage?

    3. Re:what about having people onsite? by Anonymous Coward · · Score: 1, Funny

      You are thinking too small-scale. Of course there are people on-site. Google has data centers all over the world -- how are they going to drive there?

      http://en.wikipedia.org/wiki/DUKW

      'nuff said.

    4. Re:what about having people onsite? by Anonymous Coward · · Score: 0

      No silly, those are for trans-oceanic Street View.

    5. Re:what about having people onsite? by nedlohs · · Score: 3, Insightful

      Who cares?

      Power failures are expected, what you can do is have plans for when they occur - batteries, generators, service migration to other sites, etc, etc. Those plans (and the execution of them) are what they had problems with.

    6. Re:what about having people onsite? by Anonymous Coward · · Score: 0

      Google doesn't use traditional data centers. They build theirs out of modules constructed from shipping containers. cf. Google Data Center Video, data center secrets revealed.

      So, remote monitoring, and then someone goes to check the module the alarm came from. They may have to walk 100 meters to get to the module, though.

    7. Re:what about having people onsite? by hedwards · · Score: 2, Insightful

      My parents once lost power for several hours because a crow got fried in one of the transformers down the street. People around here lose power from time to time when a tree falls on a line. Unplanned power outages are going to happen. Even though line reliability is probably higher now than at any time in the past, it still happens and companies like Google that rely upon it being always there should have plans.

      This isn't just about keeping the people that use Google services informed, this is an admission that there's something to fix and that they're going to fix what they can. There isn't any particular reason why they need to disclose such plans beyond being a huge player and not wanting to scare away the numerous people that count on them for important work.

    8. Re:what about having people onsite? by afidel · · Score: 1

      No, the question is why did the end users *see* the power outage? I would guess Google's insistence on using cheap motherboards with local battery and non-redundant PSU's bit them in the butt here. In a properly designed and maintained datacenter the loss of main power and a single generator won't take out a single server or piece of networking gear, but Google has gone with the RAED (redundant array of expensive datacenters) model instead of the traditional dual PSU, dual PDU, dual UPS, dual generator with redundant data paths setup typical of an HA datacenter.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    9. Re:what about having people onsite? by dave562 · · Score: 0, Flamebait

      This just goes to show that Google is as "incompetent" as anyone else. There was a discussion on here the other day and a poster asked why Microsoft, with all of their resources, hasn't come up with a secure OS yet. It was suggested that the know how to create such an OS is out there, and it would just take money and will on Microsoft's part. This seems like the Google equivalent.

      Google is trying to push Apps as a replacement for Exchange and Office. They are trying to push it as a replacement for hosting in house. I steered my organization away from Apps for the time being because I wasn't impressed with their support and there are a whole slew of other people who feel like they are being jerked around by Google for what should be simple support issues. It is not reassuring that Google hasn't gotten high availability down yet for one of their flagship products. I'm glad that they are being transparent about where they screwed up, but come on now, really? They haven't figured out fail-over yet? This is Google, the multi-hundred billion dollar organization. They can't fail-over one of their core offerings?

    10. Re:what about having people onsite? by vakuona · · Score: 3, Insightful

      Cheap doesn't mean not properly designed! Google doesn't do redundancy on a micro scale. For them it's pointless. In fact, from what I know, Google knows their hardware will fail, so they have written their software to handle hardware failures gracefully. When something like this happens, they write a report, and get someone about to work out a fix so that the outage doesn't recur.

    11. Re:what about having people onsite? by afidel · · Score: 1

      Yes, but from the description given in the report Google Apps Engine works much more closely to a traditional application stack with manual datacenter failover and asynchronous data replication (with slower sync data replication available real soon now). It's not the stateless application that typifies Googles larger datacenter experience and calls for a much more traditional HA setup to give better than 3-4 9's of uptime.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    12. Re:what about having people onsite? by RockWolf · · Score: 1

      My parents once lost power for several hours because a crow got fried in one of the transformers...

      Hah - payback for the LHC baguette incident. One-all now, Mother Nature!

      --
      February 9th, 2009 8:55pm: Slashdot becomes self-aware.
  4. Oh, my lifestream by bigredradio · · Score: 1

    My lifestream was interrupted and I didn't even notice! (see http://tech.slashdot.org/story/10/03/08/0024205/Time-To-Take-the-Internet-Seriously for reference)

  5. An "Incident"? by No+Lucifer · · Score: 0, Offtopic

    Jack must've forgotten to enter the code...

    1. Re:An "Incident"? by GNUALMAFUERTE · · Score: 1

      So, rewatching season 2? The addiction is terrible. I recommend a dosis of Flashforward ...

      --
      WTF am I doing replying to an AC at 5 A.M on a Friday night?
  6. Isn't this part of their SLA? by HerculesMO · · Score: 0, Troll

    I thought that contracts required Google to disclose the cause and time of their downtime, and this disclosure is part of that.

    Right now though, Google is making Microsoft look like they have better uptime for SaaS.

    --
    The price is always right if someone else is paying.
    1. Re:Isn't this part of their SLA? by hedwards · · Score: 2, Insightful

      That's the downside, anytime you acknowledge a mistake you're then looking like you have more than the idiots that have hundreds of mistakes that they don't disclose until caught making.

    2. Re:Isn't this part of their SLA? by eth1 · · Score: 1

      Depends on who's doing the shopping.

      If you're looking for a serious hosting facility, then incident response should be one of the things you look at. If they haven't had an incident*, then you have no idea how they'll handle it when (not if!) one happens. They can hand you all the documentation in the world, but that can't speak to execution.

      * that they've admitted to

  7. Read the comments by RaigetheFury · · Score: 5, Insightful

    I pity EvilMuppet. Guy is a tool. There are contractual agreements that are in place to prevent pictures, aka the "rules" but when the data center blatantly LIES they are breaking the trust and violating the agreement. Case Law exists where contracts can be violated when one accuses the other of violating said contract.

    That's what happened. The data center was lying about what happened to avoid responsibility for the equipment it was being paid to host. Pictures were taken and are being used to prove the company did violate the trust of the contract.

    You can argue the semantics and legality of it but if this goes to court the pictures will be admissible and the data center will lose.

    1. Re:Read the comments by Anonymous Coward · · Score: 0

      EvilMuppet might just be the DC's SockPuppet :)

    2. Re:Read the comments by houghi · · Score: 1

      An interview with him from a previous 'non-event' : http://www.youtube.com/watch?v=WcU4t6zRAKg

      --
      Don't fight for your country, if your country does not fight for you.
    3. Re:Read the comments by 1_brown_mouse · · Score: 1

      I love those guys.

      They did a comedy show building up to the 2000 Sydney Olympics.

      http://en.wikipedia.org/wiki/The_Games_(Australian_TV_series)

      Spawned "the Office" style of pseudo documentary. Excellent show.

      Search: clarke dawe "the games" on youtube to see some clips.

    4. Re:Read the comments by DragonWriter · · Score: 1

      There is no such thing as case law.

      Yes, there is.

      There is legal precedent, which is not law.

      Whether and to what extent legal precedent is binding and, as such, "law" depends on the legal system; there are some legal systems in which it is not, and some in which, in specific circumstances, it is. The latter is true of the UK and many former British colonies, including, e.g., the USA, Canada, Australia, among others.

      Judges and lawyers who follow precedent are lazy, spineless fucks.

      Judges who follow precedent that is binding on them are doing their job.

      Lawyers who fail to recognize and cite applicable precedent are doing poor service to their client, and are potentially, in extreme cases, liable for malpractice.

    5. Re:Read the comments by sexconker · · Score: 0

      Did you vote on that precedent?
      Did your representatives pass that precedent?
      Is that precedent in the Constitution?

      Precedent is not binding.
      Law is.
      There is a difference.

      The fact that people like you believe precedent has any legal weight to it is disgusting.

    6. Re:Read the comments by spartacus_prime · · Score: 1
      If not for precedent, our legal system would collapse. The reason precedent exists is so that judges (who have a better idea of what the law is than the representatives and senators in Washington) have something to be guided by, rather than just pulling something out of their ass and calling it law.

      In a system without precedent, you could theoretically have judges decide cases based on the individual facts of each case, rather than on what the law says. You would have judges throwing the book at defendants that rub them the wrong way, or judges ruling a particular way because they woke up on the wrong side of the bed that day.

      Yes, I knew you were trolling, but you'll have to do better than that.

      --
      If you can read this, it means that I bothered to log in.
    7. Re:Read the comments by Th3+3vil+Mupp3t · · Score: 1

      Looking over the contract we have with Datacom, you'd be hard pressed to have the Managing Director's statements be material in affecting a contract violation. Given that the photos were taken well before any statement was made to the public by a Datacom representative takes at least some of the basis away from your argument of trust.

      As for evidence, colleagues of mine have damaged equipment and I have remote monitoring, MRTG graphs and other means of validating facts. How do you think that a particular public statement can absolve a provider of responsibility or compliance to a contract precisely?

      The pictures being admissible is another matter altogether - one I'm definitely not qualified to speculate on!

      I'm definitely not employed by Datacom, and the fact that I've had to alter my work practices based entirely on photos such as these being published in the past is part of the basis for my contention regarding the issue of these specific photographs being taken. I'm very much not alone in this.

    8. Re:Read the comments by Th3+3vil+Mupp3t · · Score: 1

      Not quite. My evil overlord is another!

    9. Re:Read the comments by Anonymous Coward · · Score: 0

      Plus he makes two contradictory arguments.

      First, he states that since you're not allowed photography in the datacentre, the actions of the person posting the images have meant that he and his colleagues:

      "...had to make major changes to their work practices (and a number of us have had major production-level impacts as a result) due to no mobiles being allowed on the data floor."

      Now, either the contract he's so obsessive about the observance of contains a clause forbidding photographic equipment of any kind on the floor (inc. mobiles with cameras) or it doesn't, and in the case of the latter, there's no good reason why he can't take his mobile into the building with him - unless he's abiding by contracts that don't even exist now.

  8. title should read "Google App Engine NOT a Cloud" by Anonymous Coward · · Score: 1, Funny

    Obviously if the power goes out, and the service goes offline, then it WASN'T a cloud. If it's a cloud, it can't go down. If it goes down, it wasn't a cloud.

    What's there to get?

  9. Re:and what about openess during the incident? by theIsovist · · Score: 3, Funny

    Glen Beck, is that you!?

  10. They had a perfect contingency plan for this case by juanjux · · Score: 5, Funny

    ...but it was stored on Google Docs.

  11. Re:title should read "Google App Engine NOT a Clou by Anonymous Coward · · Score: 1, Insightful

    Even a cloud isn't effective if all the nodes go down, it's not magic.

  12. Significantly higher latency? by nacturation · · Score: 2, Interesting

    A new option for higher availability using synchronous replication for reads and writes, at the cost of significantly higher latency

    Anyone know some numbers around what "significantly higher latency" means? The current performance looks to be about 200ms on average. Assuming this higher availability model doesn't commit a DB transaction until it's written to two separate datacenters, is this around 300 - 400ms for each put to the datastore?

    --
    Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
    1. Re:Significantly higher latency? by sexconker · · Score: 0

      No, because the writes should happen in parallel.
      No need to write, confirm, write, confirm, commit.

      Just write both, confirm both, commit.

      Then of course, you have to commit twice. And commit your commits.

      And get a receipt for your husband.
      And give the nice government man a receipt for the receipt you received.

    2. Re:Significantly higher latency? by DragonWriter · · Score: 1

      Anyone know some numbers around what "significantly higher latency" means?

      I suspect not, since the feature hasn't been implemented yet.

    3. Re:Significantly higher latency? by nacturation · · Score: 1

      No, because the writes should happen in parallel.

      Correct, the writes should happen roughly in parallel but for the offsite datacenter there will be the latency of the round-trip to send the data and receive the commit confirmation. So latency to the offsite datacenter will need to be factored in plus any overhead involved with managing simultaneous writes to geographically disparate datastores. That's my best guess as to why they said "significantly higher latency".

      --
      Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  13. Don't they have by Anonymous Coward · · Score: 0

    UPS's and backup generators? or some other onsite emergeancy power? (wind turbine, batteries, bunch of illegals on treadmills etc

    1. Re:Don't they have by johnncyber · · Score: 5, Informative

      Dude RTFA (I know, I know, shame on me). The backup generators kicked in, but 25% of the machines in data center did not receive power before crashing.

    2. Re:Don't they have by ElectricTurtle · · Score: 1

      lol no UPS = fail

      --
      I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
    3. Re:Don't they have by lucifuge31337 · · Score: 1

      lol you = don't know how datacenters work

      --
      Do not fold, spindle or mutilate.
    4. Re:Don't they have by Glendale2x · · Score: 1

      Actually no, Google doesn't use UPS systems if this is one of their designs that uses one small sealed lead acid battery per server.

      --
      this is my sig
    5. Re:Don't they have by Critical+Facilities · · Score: 1

      I have to say, I think ElectricTurtle is right. If the generators came online as they're claiming, how could it be that 25% of the load dropped during transfer?? There's more to this story than is being told, and instead, they're focusing on how they came back online rather than why they went offline in the first place. I'd be willing to bet you that heads are rolling behind closed doors. If there were properly functioning UPSs in the building (either the large ones or the server-mounted batteries Google sometimes likes), then there shouldn't have been any outage on the transfer to generator.

    6. Re:Don't they have by Critical+Facilities · · Score: 1

      I've heard a few rumors that they're re-thinking this strategy. I'm betting this event might keep those conversations going.

    7. Re:Don't they have by ElectricTurtle · · Score: 0, Troll

      Yeah, I just imagined working in raised-floor, climate-controlled rooms. You don't know shit about me, nor could you from a four word drive-by. You just want to put people down because it does something for you. That behavior demonstrates that you are a pitiful excuse for a decent human being, congratulations! Piss off.

      --
      I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
    8. Re:Don't they have by lucifuge31337 · · Score: 1

      You are also assuming that all datacenters have and need UPSes. This is simply not the case. More and more facilities are going to flywheel generators as maintaining batteries for transfer time between mains and generator power is insanely expensive in floor space, labor, and replacement costs. Nothing in any of the linked content says what kind of generators they have, or anything about a UPS. Based on the simple fact that Google can afford and makes it a priority to hire too notch talent and build things the right way, are you really telling me that you believe you and ElectricTurtle are smarter than the combined brainpower set loose by Google for building and maintaining this facility?

      --
      Do not fold, spindle or mutilate.
    9. Re:Don't they have by lucifuge31337 · · Score: 1

      That's a nice try at another troll.

      You demonstrated that you don't know enough about modern data center design based on your 4 word comment. No further information was necessary.

      Plenty of people who have worked in data centers wouldn't know this, so the fact that you may have worked in one is a moot point.

      See the reply to the guy who also doesn't know this stuff that was trying to stick up for you. http://slashdot.org/comments.pl?sid=1575066&cid=31403320

      --
      Do not fold, spindle or mutilate.
    10. Re:Don't they have by Critical+Facilities · · Score: 3, Interesting
      First of all, the "flywheel generators" you're referring to are actually either standalone UPS systems or a part of a DRUPS (Diesel Rotary UPS). Here is some information on one of the leading manufacturers of such equipment.

      However, all of this is moot, since even if they had a flywheel setup as you're speculating, it still doesn't explain why 25% of the floor went down. If the equipment was installed, maintained and loaded properly, they should've been able to get to the generators with no problem.

      are you really telling me that you believe you and ElectricTurtle are smarter than the combined brainpower set loose by Google for building and maintaining this facility?

      No, I'm telling you that I manage a data center, and I know first hand how they work (or in this case, should work). I fail to see an adequate explanation of how this was unavoidable.

    11. Re:Don't they have by ElectricTurtle · · Score: 1

      You would do better to see his reply to your reply. He's already putting you in your place so well that any similar effort by me would be redundant.

      --
      I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
    12. Re:Don't they have by Vancorps · · Score: 1

      The argument is simply that going without adequate battery power to handle transfer switching is asinine and you seem to think that's normal data-center behavior. You would be the only one that thinks that would be properly redundancy and all the data-centers I'm in have battery backed transformers to handle the load while they switch to alternate power.

      The most expensive data center I'm in even goes so far as to have an hour of battery time to handle generator failures during a power outage.

      ElectricTurtle and Critical Facilities both have comments that mirror my own experience and echo every data center best practice. People without this power are asking for problems. Google tried something against best practice and despite us individually not having more brain power than Google, collectively the likes of IBM, Microsoft, and every other large corporation with many large data centers have come to this conclusion. Many and I'm looking as those lovely Texas data centers keep trying to buck the best practice and surprise surprise, it bites them in the ass.

      That said, Google has a great track record so I'm not going to call any of their practices into question, it sounds like the event was mishandled and that's why there was a service outage. Sometimes events are mishandled due to unforeseen circumstances or something didn't have their morning cup of coffee. That's why companies do post-mortems and the fact that Google was so open about it is a good sign that the same situation won't lead to another outage which is what matters given their stellar uptime.

    13. Re:Don't they have by DragonWriter · · Score: 2, Insightful

      There's more to this story than is being told, and instead, they're focusing on how they came back online rather than why they went offline in the first place.

      That's because they are focussing on what went wrong. Power losses, including ones that take down the whole data center, are accepted risks and part of the reason they have a redundant data centers and failover procedures.

      The failure wasn't that they had a partial loss at a datacenter. The failure was that the impact of that loss wasn't mitigated properly by the systems that were supposed to be in place to do that.

    14. Re:Don't they have by jbengt · · Score: 1

      What I inferred was that the real problem wasn't that they failed - complete failure they would have recovered from. Unfortunately, they did not understand what their state was when only some of them failed, and did not figure out how to recover.

    15. Re:Don't they have by Critical+Facilities · · Score: 1

      Power losses, including ones that take down the whole data center, are accepted risks and part of the reason they have a redundant data centers and failover procedures. The failure wasn't that they had a partial loss at a datacenter. The failure was that the impact of that loss wasn't mitigated properly by the systems that were supposed to be in place to do that.

      I must respectfully disagree. Power losses that take down the whole data center are definitely NOT accepted risks. The entire reasoning for spending millions upon millions of dollars to have UPS systems, Static Switches, Automatic Throwover Switches, Diesel Generators with thousands of gallons of fuel, etc isn't because you think downtime is acceptable, it's because downtime is not an option.

      We almost agree though. I do agree that the failure was in improper mitigation of the risk as opposed to mitigation of the outage once it happened. There is no reason given that explains why 25% of their floor(s) went down, and in a properly run data center (at least a Tier 3 or higher), there is no reason any of the Critical Load should ever go down.

    16. Re:Don't they have by awyeah · · Score: 1

      I was under the impression that Google's servers all had small individual batteries in each chassis to provide power during generator spin-up in lieu of full-on UPSs. Maybe some of them didn't last as long as they were supposed to? Or maybe the generator took longer to warm up than it should have>

      --
      Why, no, I haven't meta-moderated lately. Thanks for asking!
    17. Re:Don't they have by Anonymous Coward · · Score: 0

      You're still not thinking at Google scale. Your mistake is presuming that a single data center is vital to Google. If Google has many other data centers that can happily take traffic away from the bad one, why would you spend a lot more money trying to get another "9" of availability for that single data center? Could it be that Google views the loss of an entire single data center as an acceptable risk?

    18. Re:Don't they have by Critical+Facilities · · Score: 1

      Yeah, I've seen those reports, and I am curious as to whether or not they have the server-mounted batteries at all of the data centers, or just some of them. You and I are thinking right along the same lines. If indeed they did have the server-mounted batteries, I'd be curious to know why they didn't hold all the load. It seems to me that either the server-mounted battery strategy is less reliable than traditional UPS Systems, or as you suggested, perhaps something happened with the generators (those generators should have spun up and been carrying load within 15 seconds). Either way, something didn't work as designed, and TFA doesn't touch on any of it.

    19. Re:Don't they have by awyeah · · Score: 1

      Yep. I mean, as it's been stated in other comments, I think Google's way of hedging its bets is to have redundant data centers, so I think they correctly focused on the procedural issues.

      However... as a current programmer and former IT guy, I'd like to know more about what caused the failures in the first place.

      --
      Why, no, I haven't meta-moderated lately. Thanks for asking!
  14. App Engine down again? by bjourne · · Score: 2, Insightful

    App Engine must be Googles absolutely most poorly run project. It has been suffering from outages almost weekly (the status page doesn't tell the whole truth unfortunately), unexplainable performance degradations, data corruption (!!!), stale indexes and random weirdness for as long as it has been run. I am one of those who tried for a really long time to make it work, but had to give up despite it being Google and despite all the really cool technology in it. I pity the fool who pays money for that.

    The engineers who work with it are really helpful and approachable both on mailing lists and irc, and the documentation is excellent. But it doesn't help when the infrastructure around it is so flaky.

  15. ISO9001 by Anonymous Coward · · Score: 1, Insightful

    This should be standard practice... It's like the good bits of ISO9001 with a bit more openness. When done right, ISO9001 is a good model to follow.

  16. the worst nightmare of data center peeps by filesiteguy · · Score: 3, Interesting

    i don't run a data center, but manage systems that rely on the data center 18 hrs/day 6 days/week. we pass upwards of $300m through my systems. I've yet to get a satisfactory answer as to exactly what would happen if - say - a water line breaks and floods all the electrical (including the dual redundant UPS systems) in the data center.

    1. Re:the worst nightmare of data center peeps by SmilingBoy · · Score: 2, Informative

      First, your servers will shutdown ungracefully, and then, they will be destroyed with little chance of recovery. You will then have to rebuild your systems, and restore the data from the offsite backup. This will of course take time. If this is too much off a risk, you should run a alternate datacentre mirroring your primary databases that can go live within minutes.

    2. Re:the worst nightmare of data center peeps by mjwalshe · · Score: 1

      switch to the alternate DC - I worked for BT and the set up an alternate DC across town for Telecom Gold just in case the thames flooded

    3. Re:the worst nightmare of data center peeps by Anonymous Coward · · Score: 0

      To summarize: you won't get much sleep for the next few weeks and your bosses will find a way to throw you under the bus for not "covering your bases".

    4. Re:the worst nightmare of data center peeps by Hurricane78 · · Score: 1

      Well, I’m no expert, but it’s not very hard to get a building water tight, now is it?

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    5. Re:the worst nightmare of data center peeps by Anonymous Coward · · Score: 0

      it is really hard:
      1-water can slowly eat the concrete away
      2-water can go trough small microscopic opening
      combine 1 and 2 and you got a water infiltration

    6. Re:the worst nightmare of data center peeps by afidel · · Score: 1

      *across town*!? Hmm, here in the states best practice (and legal requirements for certain industries) requires significantly more distance than that between DC's. Ours is just inside of reasonable driving range (6 hours) but is on a different power grid, different core services from our Tier-1 ISP, etc.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    7. Re:the worst nightmare of data center peeps by FlexAgain · · Score: 1

      Across town could be 20 miles away in London. On the other side of the Thames is very likely to have it's power and data coming from completely independent systems, even a different power station and over a different part of the national grid.

      Since BT was historically the only telecoms provider, even now they are plenty big enough to easily be in a position to have multiple independent data feeds, and if they all fail, nothing else in the capital is working anyway, so a DC's survival would be a minor issue.

      A six hour drive from London going North would almost put you in Scotland, and in the other direction, you would have run out of land, and be well on your way to Paris if you crossed the Channel.

      --
      Actually it is rocket science...
    8. Re:the worst nightmare of data center peeps by Eil · · Score: 1

      I've yet to get a satisfactory answer as to exactly what would happen if - say - a water line breaks and floods all the electrical (including the dual redundant UPS systems) in the data center.

      Simple: the power equipment gets an unscheduled watering and your servers go down.

      If you want to minimize the impact that a disaster can wreak on your servers in a datacenter, then you need to have your entire setup running and synchronously replicated in another datacenter.

    9. Re:the worst nightmare of data center peeps by filesiteguy · · Score: 1

      Funny you mention that. I've been trying to get two solutions going. (Remember, I have zero actual power over server budgets other than recommendations.)

      I have setup all servers under my responsiblity in VM's (using VirtualBox) and am ready to deploy on a minimum of servers with only databases available. (I have roughly 3 TB of data and about 22 TB of images.)

      I've been patiently standing by, waiting for a data center agreement to be formalized, whereby we'll have a hot-site setup in a center about twenty miles away. (There are multiple DS3 and OC3 lines between my office and the remote data center.)

    10. Re:the worst nightmare of data center peeps by Anonymous Coward · · Score: 0

      Not necessarily hard, but try steering clear of patents

    11. Re:the worst nightmare of data center peeps by mjwalshe · · Score: 1
      well the main point there was to mitigate against the thames barier failing.

      Though our netork designer did comment about teh slow 10M link we had between the two DC's his comment (this is mid 80's) was "dont knock it I had oxford street dug up for that"

    12. Re:the worst nightmare of data center peeps by mjwalshe · · Score: 1

      oh it was just the Telecom Gold (Dialcom) Service. Propper phone networks have orders of magnitude more redundancy.

  17. Re:title should read "Google App Engine NOT a Clou by Anonymous Coward · · Score: 1, Funny

    Whoosh.

  18. "no online database will replace your daily news" by SlappyBastard · · Score: 1

    OMFG! There's swinging at an outside pitch and there's try to hit one that was thrown in the fuckin' stands!!

    --
    I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
  19. Huh? by SlappyBastard · · Score: 2, Informative

    How did I end up in this article? Ah!!!

    --
    I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
  20. Generators plus UPS FTMFW by Anonymous Coward · · Score: 2, Insightful

    Epic fail.

    Any data center worth it's weight in dirt, must have UPS devices sufficient to power all servers plus all network and infrastructure equipment, as well as the HVAC systems too, for a minimum of at least 2 full hours on batteries, in case the backup generators have difficulty in getting started up and online.

    Any data center without both adequate battery-UPS systems plus diesel (or natural gas or propane powered) generators is a rinky-dink, mickey-mouse amateur operation.

    1. Re:Generators plus UPS FTMFW by ElectricTurtle · · Score: 1

      Yeah, seriously. I worked for a mid-size company that had a very modest server farm (it was a retail-related business), and even we had everything switch to diesel at the instant the grid might go down. Since our switches were POE, and our phone were VOIP, and our computers were laptops, it was like there was no power outage at all. We'd be on the phone with one of our stores and just say 'oh, the power went out, well, back to your issue...'

      It's hard to believe that freakin' Google wouldn't be at that level...

      --
      I support the Slashcott and will not be reading or commenting from 2/10/14 to 2/17/14. Beta is steaming pile of dog shit
    2. Re:Generators plus UPS FTMFW by mjwalshe · · Score: 1

      quite thers a comment somwhere else about how 356 main was highly regarded lol - if everything insn't running of the batteries 24/7 it aint a real datacentre.

    3. Re:Generators plus UPS FTMFW by Anonymous Coward · · Score: 0

      Two full hours requires a MASSIVE battery capacity. It's far more feasible to count on 10-15 minutes from the batteries and make sure your generators start up promptly.

      Also, some of Google's datacenters (not sure if this is one of them) dispense with many centralized batteries in favor of building the battery into each server alongside the PSU. This avoids some issues with AC->DC->AC conversion, leaving them with just AC->DC at each server. I'm just speculating, but it's possible that generator startup went as planned and the 25% of servers that didn't survive the outage turned out to have too-short battery life on their local battery packs. Hard to verify battery performance without a live test...

    4. Re:Generators plus UPS FTMFW by Tynin · · Score: 3, Insightful

      You are so cute. I know very little about UPS systems, but when I was working in a datacenter that housed 5000 servers we had a two story room that was twice the size of most houses (~2000 sq ft) with rows and rows of batteries. I was told that in the event of a power outage, we had 22 minutes of battery power before everything went out. The idea of having enough for 2 hours would have been one an interesting setup considering how monstrously large this one already was. Besides, I'm unsure why you'd ever need more than that 22min since that is plenty of time for our on site staff to gracefully power down any of our major servers if the backup generator failed to kick in.

    5. Re:Generators plus UPS FTMFW by Darth_brooks · · Score: 2, Funny

      Yeah, and when the guys at the Jesus Christ of Datcenters that you describe have to do something like, say, switch from generator to utility power manually, and the document that details that process is 18 months old and refers to electrical panels that don't exist anymore, you get what you had here. A failure of fail-over procedures. If the lowliest help desk / operator can't at least understand the documentation you've written, then you've failed.

      The only equipment failure listed is a "power failure." Granted, that can be as simple as "car hits a telephone pole and knocks out a chunk of the grid, leaving your office in the dark", which should be an easily survivable event. But how do you handle a failure like "50kva inline UPS shits the bed leaving nothing but a smoking chassis that no one wants to go anywhere near?" or "HVAC unit fails on christmas eve when only a skeleton staff is on duty and fills the raised floor with 8 inches of water, shorting everything within an inch of its life and making it impossible to bring any hosted services back online?"

      There's nothing like a little bit of "we had no idea these three or four unrelated circumstances could happen simultaneously" disaster porn to make you realize that A. Outage / DR / fail-over planning is more than just throwing money at stuff (UPS's, generators, redundant lines, etc) and B. No matter how good your plan is, it will never be 100% effective.

      --
      There are some people that if they don't know, you can't tell 'em.
    6. Re:Generators plus UPS FTMFW by Glendale2x · · Score: 1

      That's what I was thinking; the local battery design that was previously praised became the fault. A large central UPS can monitor and test its batteries more than just plugging an SLA battery into the DC side of a server power supply and patting yourself on the back for being a genius. A UPS gives more telemetry, too. How did Google monitor those individual batteries? Not all SLA batteries are perfect. Were they tested and maintained? I'm guessing "no" to both if 25% of the servers lost power before the generators started (probably 10 to 30 seconds, which isn't that much). How long was the generator start window? TFA doesn't say anything about that.

      Google fails to address what caused the outage (beyond that the power went out). I've read some comments here saying that's not important, just how they handled it afterward is important. I disagree; if their no-UPS design has some fundamental flaws in it, they should admit it and address it, even if that means going back to a traditional centralized UPS.

      --
      this is my sig
    7. Re:Generators plus UPS FTMFW by Pentium100 · · Score: 1

      Of course you can verify battery performance safely. My UPS has battery test (checks if the batteries can still be used, if it fails, batteries need replacing) and run time calibration (discharges batteries to 25% and monitors how long it took, based on that it can estimate how long will it be able to hold the load). The whatever system google is using should be able to check the batteries while power is on, so that you don't end up with batteries that have 20% of their original capacity when the power goes down.

    8. Re:Generators plus UPS FTMFW by DragonWriter · · Score: 1

      Any data center worth it's weight in dirt, must have UPS devices sufficient to power all servers plus all network and infrastructure equipment, as well as the HVAC systems too, for a minimum of at least 2 full hours on batteries, in case the backup generators have difficulty in getting started up and online.

      Google's setup appears to rely on the fact that they have redundant data centers, so failover to another data center addresses this problem. The problem here, as identified in their post-mortem, is that for training and other reasons, the fail over wasn't handled correctly.

      Since there are sources of data center failure that having UPS + Generator backup won't help with at all, for something like this redundant data centers are essential whether or not you use UPS + Generator backup. Once you have redundant data centers, this problem should be solvable with failover. So, I think Google's general approach was reasonable from the start, as are their plans (detailed in the post-mortem) to address the failure by addressing the training and other issues which prevented failover plans from being properly executed.

    9. Re:Generators plus UPS FTMFW by DragonWriter · · Score: 1

      Google fails to address what caused the outage (beyond that the power went out).

      This is false. Google details at some length the causes of the customer-facing outage. The power going out is an early problem, but its not a particular important issue because that's an accepted risk in their plans. The failure was in the fact that the procedures that are intended to prevent a power loss at a data centre from producing a customer-affecting outage had inadequate coverage of partial losses of power, and on top of that were not executed properly (in part due to inadequate training on the procedures, in part because of outdated and incomplete documentation of the procedures.)

      Google has redundant data centers for a reason -- so that anything that causes one to go down doesn't effect operations that rely on them. Its not a "failure" if a data center goes down due to one of the risks accepted in the design of the individual data center. It only becomes a failure if the redundancy doesn't work as intended. The reasons for that failure -- and Google's plans on dealing with them -- are addressed, at some length, in the published post mortem.

    10. Re:Generators plus UPS FTMFW by Richard_at_work · · Score: 2, Informative

      In your rush to criticise 'Microsoft land', you must have overlooked his closing statement regarding 'if the backup generator failed to kick in'.

      You cannot have uptime without power. A mains outage coupled with an unexpected generator failure *will* result in downtime - your decision now is whether you wish your servers to be gracefully shutdown, or just have the rug pulled from under them and hours or days of potential angst as a result. Which is it?

      And before you suggest larger UPSes for longer protection, consider why you have both a generator and a UPS in the first place - UPSes cost a lot, they cost a lot to buy, and they cost a lot to maintain, and then they cost a lot to replace after only a few years. A generator in comparison costs a lot less all round.

    11. Re:Generators plus UPS FTMFW by Glendale2x · · Score: 1

      This is false. Google details at some length the causes of the customer-facing outage.

      I only sort of skimmed over TFA to get the big points, but if you can point out the part where they explain why 25% of the servers lost power, I'd appreciate it.

      --
      this is my sig
    12. Re:Generators plus UPS FTMFW by Anonymous Coward · · Score: 0

      You consider powering down major servers to be a good option? Smells like an opinion from microsoft land (where "planned downtime" counts as "uptime", and an "uptime" of 95% is "acceptable"...)

      Yeah, pretty much the best option when it is obvious all your servers are getting ready to fall over from a lack of power. This was ~9 years ago and we ran mostly an IRIX and System V shop but had a few NT4 and Win2k servers, there were some servers we wanted to be absolutely sure were sync'd when they went down. Even with 2 independent power feeds, a large UPS system, and a backup generator that was tested quarterly, we still had a total lose of power occur during a hurricane when our backup generator failed to start (Murphy's law was working long hours that week). After that we started renting a 2nd generator anytime a hurricane even looked at our State wrong. That was one of the few years we missed our 4 9's SLA with some customers.

    13. Re:Generators plus UPS FTMFW by Tynin · · Score: 1

      Bleh, somehow I posted that anon...

    14. Re:Generators plus UPS FTMFW by evilviper · · Score: 1

      My UPS has battery test (checks if the batteries can still be used, if it fails, batteries need replacing) and run time calibration (discharges batteries to 25% and monitors how long it took, based on that it can estimate how long will it be able to hold the load).

      Yes, but you HAVE A UPS (an APC Smart-UPS by the sound of it). Google does not. They have SLA batteries hard-wired into the PSUs of each individual server. Now, it would be possible for there to be special circuitry to do an online battery test inside the PSU, but considering Google's focus on extra cheap commodity equipment, and redundancy as the answer to all problems, I fully expect that they've forgone the extra circuitry needed for such features...

      The whatever system google is using should be able to check the batteries while power is on, so that you don't end up with batteries that have 20% of their original capacity when the power goes down.

      "should" is a strong word, and this could very well have been the issue.

      They may have decided that swapping batteries after X years, or Y power outage events is accurate enough to forego tests, and the number of exceptions (early failures) is small enough to make this a viable strategy. Doing spot-checks in addition to this would make this an even better strategy, still.

      Battery tests could be performed by automatically instructing various PDUs to power-down 1 out of X units, or racks, preferably in a round-robin, or least-recently tested first. Knowing that each battery should power each system for, say, 15 minutes, any which aren't up and running 10 minutes later should be pulled and replaced. Additionally, on-board circuitry monitoring voltage could be used as an indicator that the battery is likely at 100% capacity (over 12.5V) 50% capacity (~11V), or nearly dead (below ~10.5V).

      How often these tests take place would depend on how many servers Google is willing to put at risk at any one time, and perhaps even how much of an impact powering off/on a large number of systems at about the same time would have on their in-house circuitry, and upstream power provider...

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    15. Re:Generators plus UPS FTMFW by shish · · Score: 1

      you must have overlooked his closing statement regarding 'if the backup generator failed to kick in'.

      If the backup generator fails, then you want 2 hours, not 20 minutes, so that you can have a third set of generators and specialist engineers flown in to install them; if the outage looks long-term, get a tanker truck running between the building and the fuel depot, etc.

      "20 minutes to shut down gracefully" might be better than "nothing", but it's certainly not great

      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    16. Re:Generators plus UPS FTMFW by Richard_at_work · · Score: 1

      Obviously you do not understand just how expensive UPSes actually are - you just added several tens of millions of dollars to the annual running cost of your data center.

      UPSes are *that* expensive. If you really wanted to guard against a generator failure, you double up on generators but you do not spend more than you need on UPSes. A UPS should be there to smooth the transition to the generator, not to run the site for any significant length of time.

  21. Re:title should read "Google App Engine NOT a Clou by Davorama · · Score: 1

    Sounds more like fog to me.

    --

    Davo -- Free speech, free software, AND free beer.

  22. When the Power Goes Out At Google... by binaryseraph · · Score: 3, Informative

    ...a fairy dies.

    1. Re:When the Power Goes Out At Google... by Colz+Grigor · · Score: 1

      ...a fairy dies.

      I suspect that this will result in a large overpopulation of fairies. Since Google would be to blame for this, perhaps they should begin some sort of fairy mitigation program?

    2. Re:When the Power Goes Out At Google... by binaryseraph · · Score: 0, Flamebait

      We might need to contact Enron to instigate more rolling black-outs (like they did in the late 90's). This might help keep the population under control.

    3. Re:When the Power Goes Out At Google... by Anonymous Coward · · Score: 0

      They already have an agreement with San Francisco.

    4. Re:When the Power Goes Out At Google... by Anonymous Coward · · Score: 0

      Cool Fairy Hunting!
      Mmmmm... Tasty.

    5. Re:When the Power Goes Out At Google... by binaryseraph · · Score: 1

      How the hell is this flamebait? Enron, really? Is someone upset about an Enron crack?

  23. Useless for large scale problems by mcrbids · · Score: 5, Interesting

    Of COURSE there are people onsite. Most likely they have anywhere from a dozen to a hundred people onsite. But what's that going to do for you in the case of a large-scale problem?

    The otherwise top rated 365 Main facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.

    So there you are, with perhaps 25 staff a 4-story building with tens of thousands of servers, the power is out, nobody can figure out why, and the phone lines are so loaded it's worthless. Even when the power comes back on, it's not like you are going to get "hot hands" in anything less than a week!

    Hey, even with all the best planning, disasters like this DO happen! I had to spend 2 wracking days driving to S.F. (several hours drive) to witness a disaster zone. HUNDREDS of techs just like myself carefully nursing their servers back to health, running disk checks, talking in tense tones on cell phones, etc.

    But what pissed me off (and why I don't host with them anymore) was the overly terse statement that was obviously carefully reviewed to make it damned hard to sue them. Was I ever going to sue them? Probably not, maybe just ask for a break on that month's hosting or something. I mean, I just want the damned stuff to work, and I appreciate that even in the best of situations, things *can* go wrong.

    So now I host with Herakles data center which is just as nice as the S.F. facility, except that it's closer, and it's even noticably cheaper. Redundant power, redundant network feeds, just like 365 main. (Better: they had redundancy all the way into my cage, 365 Main just had redundancy to the cage's main power feed)

    And, after a year or two of hosting with Herakles, they had a "brown-out" situation, where one of their main Cisco routers went partially dark, working well enough that their redundant router didn't kick in right away, leaving some routes up and others down while they tried to figure out what was going on.

    When all was said and done, they simply sent out a statement of "Here's what happened, it violates some of your TOS agreements, and here's a claim form". It was so nice, and so open, that out of sheer goodwill, I didn't bother to fill out a claim form, and can't praise them highly enough!

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Useless for large scale problems by Critical+Facilities · · Score: 4, Insightful

      The otherwise top rated 365 Main [365main.com] facility in San Francisco went down a few years ago. They had all the shizz, multipoint redundant power, multiple data feeds, earthquake-resistant building, the works. Yet, their equipment wasn't well equipped to handle what actually took them down - a recurring brown-out. It confused their equipment, which failed to "see" the situation as one requiring emergency power, causing the whole building to go dark.

      I think you made the right decision in changing providers. I remember that story about the 365 outage, and while I am too lazy to look up the details again, I recall it being as you're telling it. To that end, I'd simply say that they most certainly did have the proper equipment to handle the brown out, but obviously not the proper management. If you're having regular (if intermittent) power problems (brown outs, phase imbalances, voltage harmonic anomolies, spikes, etc), just roll to generator, that's what they're there for.

      I'm sick of people making the assumption that the operators of the facility were just at the mercy of a power quality issue because they have redundant power feeds and automatic transfer switches. Yes, in a perfect world, all the PLCs will function as designed, and the critical load will stay online by itself. However, it takes some foresight and some common sense sometimes to make a decision to mitigate where necessary. I direct all my guys to pre-emptively transfer to our generators if there are frequent irregularities on both of our power feeds (i.e. during a violent thunderstorm, simultaneous utility problems, etc).

      In other words, I'm agreeing with you that the service you received was unacceptable. Along with that (and in rebuttal to the parent post), I'm saying that it's not enough to talk about how they came back from the dead, but why they got there in the first place.

    2. Re:Useless for large scale problems by interkin3tic · · Score: 1

      But what pissed me off (and why I don't host with them anymore) was the overly terse statement that was obviously carefully reviewed to make it damned hard to sue them. Was I ever going to sue them? Probably not, maybe just ask for a break on that month's hosting or something.

      You wouldn't but come on, you know how we Americans are. We sue when we can't play Halo for a few days.

      Chances aren't bad that someone was looking for a lawsuit, heading it off at the pass had a chance to prevent some stupid lawsuits which would waste time and only benefit lawyers, possibly requiring some invasive, poorly thought-out court-ordered hinderance which would have slowed the recovery.

    3. Re:Useless for large scale problems by listentoreason · · Score: 1

      Of COURSE there are people onsite. Most likely they have anywhere from a dozen to a hundred people onsite.

      and as long as you're quiet and don't try to damage the control systems, you can move about freely and they'll generally ignore you

    4. Re:Useless for large scale problems by Ocker3 · · Score: 1

      And yet another lesson in customer service, whether tech related or not. Own up to the problem early, apologise, explain what happened, how you fixed it, and how you're going to prevent it from happening again. Any half-way intelligent business customer knows that shite happens, no backup plan is fail proof, what you Really want besides five 9s is a hosting company who's going to be up front and honest. Information is power, sharing information increases that power, it doesn't reduce it, so having your customers know more about your company and how it handles problems (assuming you're good at it) will increase their confidence and encourage them to keep working with you. You never know, you may impress them enough that they'll tell others about how well you handled the problem, and a good reputation is pure gold in any business.

  24. try employing the right people by mjwalshe · · Score: 2, Interesting

    try hiring some staff with telco experiance instead of kids with a perfect GPA scores from stanford and design the fraking thing better !

  25. Has anyone from Ubisoft read this? by Ben4jammin · · Score: 1

    I think it would do them good, considering the recent downtime with Assassin's Creed 2. Has anyone seen any info on that outage?

  26. Re:and what about openess during the incident? by dburkland · · Score: 0, Insightful

    Keith Olbermann, is that you!?

    Fixed that for you

  27. Lucky they have multiple datacenters by Anonymous Coward · · Score: 0

    Google is lucky they have a second, third, n number of datacenters to failover to in the first place. You might be surprised how many large companies still rely on truck-shipped tapes or other "cold" disaster recovery methods even for their most critical business data. If you had to restore your systems via tape, would your company still be alive by the time you came back up? Or would the negative publicity from the event lead to a slow and timely death? Although this was an eye-opening experience for Google, it should be even moreso for companies who haven't had to experience this type of an event. Unfortunately in my experience many companies (Google included) will not change disaster recovery policies (or many other IT policies for that matter) until a significant event has occurred. The question is, should you bet your business and be reactive, or protect your business and be proactive? In Google's case, they were able to be reactive and will come out alright. Many others probably wouldn't be so lucky. All-in-all I believe this will be a great learning experience for Google, and as a side-effect will hopefully direct more people to looking at cloud technology to protect their business from outages.

  28. multiple datacenters by Colin+Smith · · Score: 1

    Power failures are expected, what you can do is have plans for when they occur - batteries, generators, service migration to other sites, etc, etc

    Too small scale, too complex, too much human intervention and too unreliable. Minimum of 2 datacenters on opposite sides of the world and you only send half the traffic to each. When the first vanishes the second picks up the traffic. The exact mechanism depends on the level of service you want to provide.

     

    --
    Deleted
  29. Lessons Like by Greyfox · · Score: 2, Funny

    Don't have all your shit in one data center, maybe? I'd have thought that one would be pretty fundamental. Of course, knowing Google they're going to decide that what they really need is power generation right on site, then they'll just pop off and invent nuclear fusion before lunch.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

    1. Re:Lessons Like by cpghost · · Score: 1

      they'll just pop off and invent nuclear fusion before lunch.

      And they'll call it gFusion?

      --
      cpghost at Cordula's Web.
  30. floods by zogger · · Score: 2, Insightful

    Did you ever actually see a big flood? Freaking awesome power, like a fleet of bulldozers. Smashes stuff, rips houses off foundations, knocks huge trees over, will tumble multiple ton boulders ahead of it, etc. Just depends on how big the flood is. We had one late last year here, six inches of rain in a couple of hours, just tore stuff up all over. The "building" that can withstand a flood of significant size exists, it is called a submarine. Most buildings of the normal kind just aren't designed to deal with anything that destructive. Some can resist minor floods, but not too many.

    1. Re:floods by Ant+P. · · Score: 1

      The structure that can withstand a flood has existed for a lot longer than submersible warships - it's called a "hill". If you don't have one conveniently nearby to use you can even build an artificial one.

    2. Re:floods by zogger · · Score: 1

      A hill isn't a building. He was talking about water proofing a building. Under normal conditions, sure, buildings are pretty good to keep you from the weather, but in big floods, most will suffer leakage or outright destruction. That's why you always see people trying to save their homes or businesses with sand bags. It just isn't that common for buildings to be built bad flood tough. Some probably exist, but not too many. And yep, a good building on top of the biggest hill around would be the safest. I was just going for the cheap laugh mentioning a submarine, they are our tightest and strongest man made structures built to deal with keeping humans away from too much water. So, if you built a building like a submarine, it might make it through a big flood.

    3. Re:floods by DragonWriter · · Score: 2, Informative

      The structure that can withstand a flood has existed for a lot longer than submersible warships - it's called a "hill". If you don't have one conveniently nearby to use you can even build an artificial one.

      An "artificial hill" intended to protect an area from floods is usually called a "levee", and while certainly extremely useful for their intended purpose, they aren't exactly an ironclad guarantee. So having contingency plans for the case where they fail isn't a bad idea.

    4. Re:floods by Bryan3000000 · · Score: 1

      The structure that can withstand a flood has existed for a lot longer than submersible warships - it's called a "hill". If you don't have one conveniently nearby to use you can even build an artificial one.

      An "artificial hill" intended to protect an area from floods is usually called a "levee", and while certainly extremely useful for their intended purpose, they aren't exactly an ironclad guarantee. So having contingency plans for the case where they fail isn't a bad idea.

      Buildings that are built _on top_ of a hill (even an artificial one), don't have quite the same set of severe problems with flooding that occurs in low-lying areas. ;)

  31. Back around 2005... by kilodelta · · Score: 1

    We decided to move three of our divisions into one facility, those included to business facing units and the I.T. division.

    I was charged with laying out the design for data, telecom and electrical for the project. Also had engineering of our little NOC.

    Nice setup - redundant power in the I.T. division, nice big APC UPS for the entire room, had it's own 480V power drop, dual HVAC units, a natural gas fired generator. It's nice to have the money to do this.

    Since we were a state agency we had to use state DNS services. And one day the city had a massive power outage. We were up and running happy as a clam but we found the Achilles heel in all our plans. Without DNS we couldn't get in or out. I had floated the idea of maintaining our own DNS server but nobody wanted to hear that. We had the decent network connection, and the redundant power (Yes, we even placed a UPS/Generator backed up outlet in the MDF for Cox's Marconi router) so why the hell not replicate the state DNS services?

    Let that be a lesson. We tried to plan for all contingencies and we completely missed our dependence on an outside state agency. Of course since a river runs right behind we also raised the NOC floor by about a foot.

    1. Re:Back around 2005... by Richard_at_work · · Score: 3, Insightful

      Let me add my own little story, which happened back in the good old days of June 2009.

      The company had spent the past year rearchitecting the entire IT infrastructure, as the complete core application suite for the business was, other than your standard peripheral utilities like Office et al, green screen based, using a proprietary language from the early 1980s that was barely still maintained and wasn't going anywhere fast.

      It was my job to handle the systems infrastructure side of the deal, while another team handled software development and I was way ahead of them - the core business applications were still in the planning stages while the infrastructure to handle and host them was well advanced. The platform we chose was well designed, with onsite redundancy built into the base cost and easily scalable - dare I say it myself, it was a good job. The only thing I had no hand in on the hardware side was the actual building infrastructure, as we had moved to custom built offices about 5 years prior, and there was someone else on the team that handled telecoms and the building. But we had a UPS and a generator, so all seemed well in the world.

      Alongside the new infrastructure came the new business continuity plan. Well, I say 'new' - I can't really say there was an 'old' BCP. Sure, we rented space at a major BC facilities provider, but there had never been any test, and there wasn't even any written documentation as to what to do.

      Here is where I must admit my first failure - the BCP was not treated as an integral, tied-in-like-a-knot part of the infrastructure, it was a separate project running alongside. Sure, the new infrastructure was designed to take a local server failure through redundancy, or even allow ease of moving to an offsite location. That part of it was all in place. My failure was in not ensuring that the offsite location actually existed as the new infrastructure grew.

      However, by the start of 2009, the basic infrastructure needs of the BCP were well known, costed and presented to the company board of directors. And there it sat. Every month I would ask them if it had been signed off, if I could spend the money. Every month I received a negative answer, it just hadn't been discussed at these busy directors meetings.

      And that was my second failure. I had no sponsor in those meetings, there was basically no IT representation (the IT director had resigned after the modernisation was pushed through, he wanted no part in it as he had not been taking the business forward himself). With no sponsor, no one wanted to raise the potential spending of a hundred thousand pounds themselves. And so it sat.

      Then one day in June, we had a routine fan replacement on the UPS. The engineer was signed in, did the replacement under the watchful eye of a senior helpdesk technician, and flipped the UPS back from maintenance bypass to full protected mains. And that was when the first bang happened.

      And all the lights went dark. All the whirring stopped. All the phones stopped ringing. All the people stopped talking.

      It was blissfully quiet for a few precious seconds. And then it was painfully quiet for about another 5. And then all hell broke loose.

      The core business applications did not fair well. The 30 year old architecture essentially had no failsafe for database writes, and as the server had quit in the midst of several thousand writes, we knew we had just lost a significant amount of data.

      Its worth taking several seconds out to explain how the core application language does its job. Firstly, there is no database server, its all C-ISAM datafiles directly read from and written to by each individual application. Locks are handled by each application internally, with OS level locking preventing concurrent writes to the same record in the data file. No database engine, no transaction logging, no roll backs, no error correction, nothing. There was nothing in the language to protect those poor l

    2. Re:Back around 2005... by kilodelta · · Score: 1

      Yes we also had a failover site for our Central Voter Registration System. Never tested of course because nobody to be the one whose head would roll.

      Apparently it did work though. When we had that DNS fail we were able to see that the hot standby site came up without a hiccup.

      But you make a good point, unless you have someone high up that's going to shepherd your project through, you'd be better be prepared for some ugly times.

      Luckily we had full buy-in on ours. Another thing happened though. I.T. moved before anyone else which involved getting data circuits up, ordering new switch level hardware, etc.

      Now the reason we moved is because our space was 90% complete. So we put our new PIX firewalls into the rack, the new HP4180GL switch in the rack, etc.

      Building contractor was stringing in coaxial for TV distribution (We home ran everything to the NOC) and manages to fry the power supply on the 4180GL!

      Building owner was very good about it though, spent the $1,500 for the new power supply. I made arrangements with other state agencies and got us 96 ports worth of switches in the space of an hour. Our total downtime, 2 hours.

  32. Re:title should read "Google App Engine NOT a Clou by Anonymous Coward · · Score: 0

    Wow, that cloud's on the move!

  33. Re:title should read "Google App Engine NOT a Clou by RoFLKOPTr · · Score: 1

    Obviously if the power goes out, and the service goes offline, then it WASN'T a cloud. If it's a cloud, it can't go down. If it goes down, it wasn't a cloud.

    The cloud got too big and it rained.

  34. Re:and what about openess during the incident? by Spazztastic · · Score: 1

    $Political_Pundit_I_Disagree_With, is that you!?

    Fixed that for you

    No, I think I got it right this time.

    --
    Posts not to be taken literally. Almost everything is sarcasm.
  35. ups battery systems can fail by Anonymous Coward · · Score: 0

    I have seen data centers crash. I do not think it is likely in this case but twice I have seen issues with the UPS system take a datacenter down

    Once a battery in the UPS system blew up and sprayed acid on the wall as it crashed. Once a component inside died.

    I have also seen them go down due to too much stuff plugged in not really a ups issue.

    Funny thing is in the seven years I was working there, I never saw the mains drop. It could have happened for a few seconds but they came back up before I was paged.

    Usually our data center was affected by HVAC issues.

  36. Outside or the TV by Anonymous Coward · · Score: 0

    I turn on my TV or go outside to return to a normal life.

  37. Post Mortem Missed the Problem by photonrider · · Score: 1

    I read the post-mortem and I think they completely missed the mark. Power failed to some machines. They only noticed because "...traffic has problems..." They should have been monitoring the power to detect this situation. They didn't say whether they have the data center power supply on a UPS or not. If it was, it was dying and no one noticed. If they had been monitoring the power they might have avoided the whole mess.

  38. The Datacom response model by puppet10 · · Score: 1

    Repeat to yourself: "All is well, All is well, All is well" and everything will be exactly like you wish it to be.

    Note originators of response model are not responsible for anyone being taken away to a psychiatric facility because of a belief response model user is psychotic

    --
    -------- This space intentionally left blank --------
  39. Maybe... by mu51c10rd · · Score: 1

    They'll come out with it when Apple releases iFusion...

  40. What happened by Anonymous Coward · · Score: 0

    The local power utility accidentally tripped the only bus supplying the 30MW of power to the data center.

    Google's data center lost all power for 8 seconds before the generators, and not all of them, came online.

    The UPS system failed completely.

    This facility is about 2 years old in SC.