Slashdot Mirror


Lightning Strike KOs Amazon, Microsoft EuroClouds

1sockchuck writes "A lightning strike has caused power outages at the major cloud computing data hubs for Amazon and Microsoft in Dublin, Ireland. The incident has caused downtime for many sites using Amazon's EC2 cloud computing platform and Microsoft's BPOS (Business Productivity Online Suite)."

43 of 189 comments (clear)

  1. So Cloud v Cloud.... by artor3 · · Score: 5, Funny

    ...nature wins?

    1. Re:So Cloud v Cloud.... by Torinir · · Score: 3, Funny

      Flawless Victory!

    2. Re:So Cloud v Cloud.... by Trilkin · · Score: 3, Funny

      There is bad fanfiction depicting this event. Lots of it.

      --
      Nobody cares what the CAPTCHA for your post was.
    3. Re:So Cloud v Cloud.... by CrankyFool · · Score: 2

      US is not another availability zone -- it's a different region. There are multiple AZs per region and -- if Amazon is doing their job -- a lightning strike should not take down more than one AZ in a region.

    4. Re:So Cloud v Cloud.... by Z00L00K · · Score: 2

      And with all eggs in one basket you can be sure to crack them all in one punch.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    5. Re:So Cloud v Cloud.... by rtfa-troll · · Score: 2

      I dislike "the cloud" term as much as the next /.er but right now its working as intended.

      That seems to be true for Amazon; the outage is exactly what they documented in the case of the loss of a data centre. I'll give them (tentative) points for a job acceptably done*. You can't, however, say the same for Microsoft. They had a user visible application level outage for the loss of a single location. That's a screw up and clearly shows why you shouldn't trust anything "business critical" to just one cloud.

      * we still don't have clarity about the physical separation between their generator and phase synchronisation system. I don't know if they could have saved themselves with a proper physical layout of their separated power supplies. Also we have no idea what caused the transformer to be struck instead of just some lightning protector.

      --
      =~ s,(.*),<sarcasm>$1</sarcasm>,g if any_point_you_wish();
    6. Re:So Cloud v Cloud.... by Guspaz · · Score: 3, Informative

      That might be true if Amazon didn't have multiple AZs in single datacenters.

      The fault isn't necessarily Amazon's for stuff like this. The whole point of cloud infrastructure is that you use many cheaper instances to scale load and provide high availability caused by the failure of any one (or group of) node. Take Netflix, for example. While they do have their share of outages, they were completely unaffected by Amazon's big EC2 failure a few months ago, despite the fact that a significant portion of Netflix' infrastructure was hosted out of the affected region. Why? Because they built failure into their system, to the extent where they have a process that goes around killing random instances to keep them on their toes. They've planned for and built their system around the possibility that large chunks of the system might just up and vanish without warning.

      If you're building a large-scale cloud system, *geographic* diversity should obviously be a part of any high availability plan. I'd also say that having provider diversity isn't a bad idea, but it seems like a lot of big cloud customers just stick with one provider.

    7. Re:So Cloud v Cloud.... by SMoynihan · · Score: 4, Informative

      I live in Dublin, and that was some seriously targeted lightning. No sign of storms here, that I saw...

    8. Re:So Cloud v Cloud.... by c0mpliant · · Score: 2

      I was thinking the EXACT same thing... I know it was raining pretty hard on Saturday, but I didn't see any lightning or hear any thunder

      --
      There is no -1 disagree
    9. Re:So Cloud v Cloud.... by 2phar · · Score: 2

      The lights were flickering here in South Dublin last night around 10pm and there had been an outage of some sort 3-4 hours earlier. I was wondering what was going on, because brownouts/blackouts are extremely rare here. The only real memories I have of power outages in Dublin were rolling blackouts due to industrial action back in the 1970s.

    10. Re:So Cloud v Cloud.... by c0mpliant · · Score: 2

      Yeah it was daecent weather, too bad I still recovering from Saturday night to be able to enjoy it!
      Yeah I just asked around the office here and no one heard any either. I'm beginning to wonder if someone was just playing with a set of balloons and had an unfortunate accident involving static...

      --
      There is no -1 disagree
  2. Just imitiating Verizon. by DWMorse · · Score: 5, Funny

    I see how it is. Verizon workers go on strike, MSFT and Amazon gotta call in for something strike-related that's bigger and flashier. Show-offs.

    --
    There's a spot in User Info for World of Warcraft account names? Really?
  3. My Sympathies by smpoole7 · · Score: 3, Insightful

    Considering that my radio stations have been getting hammered for weeks now by this horrible weather in the Southern United States, my sympathies are with them.

    I don't care how much protection you put on your system (and when you have giant lightning rods that are hundreds of feet tall, like we do, you DO try to protect things), an occasional strike is going to slip through. When it does it can get ... messy. :)

    --
    Cogito, igitur comedam pizza.
    1. Re:My Sympathies by Hartree · · Score: 2

      "When it does it can get ... messy. :)"

      I'm often amazed at all the weird failures even when lightning doesn't hit directly and just induces currents.

      Used to see long RS232 runs that woudn't fail instantly, but would act flakey soon after a near strike, and then take a day or two to fail completely.

    2. Re:My Sympathies by kent_eh · · Score: 5, Interesting

      I've been in the same spot. (10KW, 3 tower array). It's amazing how far the parts of a capacitor on a P&M panel can spread when propelled by a lightning strike.
      Even with ball gaps, chokes, and all the other effort, ultimately the transmitter has to be connected to the tower. 50 ohms is not that much different than "the shortest path to ground" when you put a few thousand KV against it.

      It took several years after my career change to enjoy the spectacle of a lightning storm

      --

      ---
      "I can't complain, but sometimes still do..." Joe Walsh
    3. Re:My Sympathies by adolf · · Score: 2

      Used to see long RS232 runs that woudn't fail instantly, but would act flakey soon after a near strike, and then take a day or two to fail completely.

      And that, my friend, is why the good lord gave us RS-422 to use for long runs instead RS-232.

      Balanced, differential signalling and sensible grounding FTW.

  4. The Cloud by keithpreston · · Score: 3, Funny

    Sounds about like
    http://xkcd.com/908/

  5. Cloud fail by Baloroth · · Score: 5, Insightful

    My understanding of the point of cloud computing was that it would be distributed. I.e. the failure of any one data or computing center would mean the data was still available. Hence, the term "cloud": nebulous, non-localized. Apparently, someone forgot to tell Microsoft and Amazon what the buzzwords they were using actually mean. I more or less expected that of M$, but the fact that Amazon failed too, well, thats pretty a little surprising. I guess it's kinda the norm for all large corporations.

    Glancing at the article, it looks like this outage effected only a certain area, but still, cloud should mean other data centers would take over. I particularly love the quote "Dublin has become a key cloud computing gateway." If one city serves as a "gateway", its not a cloud system. I understand using it as one data center, but others should take over automatically for that area in case of a failure. If you don't have a failover system, you don't have a real cloud computing platform. You have a wannabe cloud computing platform. Or maybe they are just taking a buzzword and redefining it to suit their purposes. That's... exactly what we should expect, I suppose.

    Or am I completely misunderstanding the meaning of this latest buzzword? It's quite possible, I never quite got down what "Web 2.0" was supposed to mean either. Beyond lots and lots of Flash.

    --
    "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    1. Re:Cloud fail by HTMLSpinnr · · Score: 4, Informative

      For EC2, it's only distributed if you pay to have your "service" running in more than one availability zone.

      --
      $ man woman *
      -bash: /usr/bin/man: Argument list too long
    2. Re:Cloud fail by TubeSteak · · Score: 2

      Or am I completely misunderstanding the meaning of this latest buzzword?

      The main feature of cloud computing is the ability to almost instantly increase your capabilites and capacities.

      At its most basic level, you're talking about a pool of computing resources that can be doled out without regards to the underlying hardware.
      There's no promise that this cloud (the aforementioned pool of hardware) is geographically distributed.
      Like any other hosting, you have to pay extra if you want your data replicated at another hosting center.

      --
      [Fuck Beta]
      o0t!
    3. Re:Cloud fail by Wolfling1 · · Score: 4, Insightful

      Ah, yes. There is that.

      At the moment, my company is aggressively encouraging our customers to avoid the Cloud at all costs. Let me explain why.

      Whilst the technology exists for the cloud to deliver fault tolerant distributed storage, when you choose to put data in the cloud, you are choosing to relinquish control of the data. You are placing it in the hands of someone else. Quite probably an organisation that you do not know intimately. Quite probably an organisation that is based in a different legislative region - probably another country.

      You have little to no capacity to audit their system. You have little to no capacity to test their fault tolerance. And here's the sucker punch - you have little to no legal comeback for the consequences if something bad happens.

      If your data contains any personal information about another person, you are placing the privacy of that person in the hands of an organisation you do not control, and upon whom you cannot enforce any legislative restrictions.

      So, unless you are seriously geared up to investigate and audit your prospective cloud provider - and they are willing for you to do so, the only data you can safely put in the cloud is data that would be basically irrelevant to your core business anyway. Until the fundamental issues of privacy, security and accountability are resolved - or dramatically improved - placing core business data in the Cloud is a massive corporate risk.

      They should not have called it the 'Cloud'. They should have called it the 'Arse' - because if your management are planning to stick their heads in one, they may as well stick their heads up the other. I don't imagine that 'Arse Computing' would be as popular though.

    4. Re:Cloud fail by mcrbids · · Score: 2

      Perhaps it would be a good idea to start by defining what exactly "cloud computing" means?

      Because looking at the Wikipedia article I see only a brief mention of reliability through redundancy: Reliability is improved if multiple redundant sites are used, which makes well-designed cloud computing suitable for business continuity and disaster recovery.

      As CTO of a data hosting "cloud services" provider myself, I'm proud of our track record for reliability and redundancy. All our systems are backed up offsite to multiple locations every 24 hours. At all times we provide off-site hosting facilities at about 50% of the capacity of our primary production cluster for disaster recovery scenarios. All our systems are redundant on site; the loss of any single system can be healed within an hour or so. Our system uptime averages approaches 4 nines over years in an industry where 2 nines is considered exceptional.

      Having systems replicate in near-real-time to multiple locations and autonomically route around large network outages on extremely short notice is an extremely tough and expensive thing to do. It sounds simple enough, but the devil is in the details. The number of things that can go wrong is simply staggering, and trying to account for every possible thing that can go wrong is simply infeasible to proactively account for.

      What do you do when the problem is due to a router outage offsite? What if 70% of routes work, and 30% don't due to a BGP burp? What if latency is 50ms? 100ms? 200ms? At what point is it "down" ? What if network conditions are excellent but there's a problem with DNS? What if a power surge generated by your UPS (yes, it happens!) takes out 20% of your production cluster? Do you heal or switch to hot backup? What if it takes out 30% of your production cluster? 40% 60%?

      People think of "online" is a boolean yes/no question, but it's not. We calculate our uptimes based on expected end-user experience, but rarely is our production cluster actually working at 100%. Nearly always, some system or other is in need of attention and it doesn't constitute an emergency because there's still additional redundancy in the system.

      --
      I have no problem with your religion until you decide it's reason to deprive others of the truth.
    5. Re:Cloud fail by Sarten-X · · Score: 2

      I quit paying attention to your explanation/rant after one particular choice of words:

      avoid the Cloud at all costs.

      I immediately envision a scenario where the cost of setting up management, infrastructure, and equipment is significantly larger than the cost of losing a portion (or perhaps even all) of a company's data or processing capacity. Rejecting cloud services as a viable option regardless of the actual cost is just as asinine as rejecting the option for turning on hot water in the bathroom sink, because it just might be too warm for somebody.

      When I risk my sanity long enough to pick out a few more words, I find you dismissing the cloud as being only suitable for "irrelevant" data. Apparently, all data is either "core business data" or "irrelevant", and there's no such thing as "nice to have around", "those old backups", or simply "not worth handling on our own". Of course, the existence of special-purpose clouds is ignored, along with private and internal clouds.

      As for auditing, uptime, and legal consequences, you've apparently never dealt with a service contract. If the contract mandates five nines of uptime, and includes a clause making them liable for all damages and loss, that's a pretty hefty legal comeback.

      I do sincerely hope I'm never a customer of your company.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    6. Re:Cloud fail by jimicus · · Score: 3, Informative

      Cloud computing is a buzzword meaning "don't run your own hardware, run your business on someone elses". Which might mean anything from a virtual server that you manage at one end of the sophistication scale to a SaaS product at the other.

      All sorts of aspects of this are optional. Including:

      1. Whether or not you manage the underlying operating system - including things like security patches and hardening. You can choose a cloud computing provider that has sysadmins deal with that for you and just run the application yourself; they are a LOT more expensive than Amazon.

      2. How much effort your provider puts into making their systems geographically redundant. Few will talk openly about this; I'm prepared to bet hard cash this is because the vast majority that offer you a virtualised server are just using a web interface to expose a fairly vanilla Xen-with-a-SAN infrastructure to the world with everything sat in one place. Providers that will run the OS for you and can honestly say their infrastructure accounts for complete data centre loss are like hen's teeth.

      3. If you've gone for a SaaS provider - how much effort their developers went to to ensure their application can stand up to everything up to and including total loss of a data centre. And whether or not they test for such an occurrence.

    7. Re:Cloud fail by jimicus · · Score: 3, Interesting

      As for auditing, uptime, and legal consequences, you've apparently never dealt with a service contract. If the contract mandates five nines of uptime, and includes a clause making them liable for all damages and loss, that's a pretty hefty legal comeback.

      I agree with you entirely, it's an absolutely beautiful piece of legal comeback. But every service contract I've ever seen is so full of ifs, buts and other assorted get-outs that it's very rare to actually be able to hold someone to it.

      The one time I have seen an SLA that was actually quite good, the company in question didn't refuse to honour it. Oh no. They went one better - they hadn't even told their staff that it existed, so if you asked about it you'd get a response along the lines of "What's an SLA, then?" The only way you'll get an SLA honoured in those circumstances is to take your provider to court, and you can bet that if you do they'll drop you like a hot potato. So you probably wouldn't bother in any but the most egregious of circumstances.

    8. Re:Cloud fail by Overzeetop · · Score: 2

      (1) How would you not be backing up to a local resource?
      (2) Why would you not be using encryption at your user site before sending data to the remote server?

      If you're small, choose a small service like SpiderOak.
      If you're large, build a custom front end.

      I don't have particularly sensitive data, so I don't encrypt at my end, though in theory I could (but it would be a pain) and SO had issues with their distributed synchronization back then (note: if I were big enough, I could have hired an IT person to manage the my servers and set op a client side sync that worked). I have no less than 4 copies of my data at two separate local locations, at least one of which is always off-line, in addition to the remote service.

      Distrust is a good thing, but it can, imho, be managed.

      --
      Is it just my observation, or are there way too many stupid people in the world?
  6. Re:two words : Surge Protector by sjames · · Score: 2

    Have you ever seen a surge protector after a direct strike? The MOVs don't help much once they vaporize.

    A surge protector is mostly useful against the more common near misses.

  7. Microsoft renamed its product by countertrolling · · Score: 5, Funny

    Office 364

    --
    For justice, we must go to Don Corleone
  8. Serves them right by RobinEggs · · Score: 2, Informative

    Those massive data centers only existed because Microsoft and Amazon channeled profits through Irish subsidiaries to avoid US taxes. They serve some legitimate functions for customers in the UK as a matter of convenience (why build two data centers?), but they're primarily money laundering centers.

    I'd call a few lightning strikes the least of the punishments those data centers - and the entire infrastructures to which they're attached - really deserve.

    1. Re:Serves them right by XaXXon · · Score: 2

      The AWS services out of dublin aren't through an Irish subsidiary. It's just regular AWS.

      If you know differently, please document it.

  9. Re:Don't say I didn't warn you! by fuzzyfuzzyfungus · · Score: 2

    True enough. I was just trying to eke out a static + clouds = lighting joke.

  10. Power Co-Generation by anubi · · Score: 4, Interesting

    While working at Chevron Oil Pascagoula Mississippi refinery, I noted Chevron had the same problem. Loss of electrical power to the refinery would be catastrophic. No one wants to be around tons of petrochemical products undergoing serious chemical reactions when one loses control.

    To mitigate this threat, Chevron worked with Mississippi Power to operate a power generation facility at the refinery.

    I would think that anywhere there is a substantial "data processing farm" with critical power requirements, business arrangements should be made with the power generation utilities to run a natgas power plant in the immediate area.

    The utilities often run these plants as "topping" plants, as they needed anyway to even out short-time load variances on the line.

    But, in the event of a serious loss of grid power, it can be awful handy to have a few megawatts of power coming from down the street.

    --
    "Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]

    1. Re:Power Co-Generation by antifoidulus · · Score: 2

      RTFA! They do have generators! The lightning strike was apparently so powerful that is affected the backup generators synchronization equipment.

      "Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators,â Amazon said in an update on its status dashboard. âoeThe transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them.â

    2. Re:Power Co-Generation by Billly+Gates · · Score: 2

      When I was an undergrad I worked a crappy job at a Florida amusement park. Lightning capital of the world is in Pasco County Florida which was about 30 minutes away.

      They generated their own power and the power lines and even the roller coasters were designed to be struck by lightning. Let me tell you they were struck every 3 or 4 days during the raining season in summer from the monsoons from the Carribean and Gulf. During a bad storm the power lines and rides could be struck 2 to 3 times each in a 30 minute period. They always kept running as soon as the storm would pass.

      If a shitty amusement park can handle it I would think a datacenter would have much more expensive and critical components. Disney World in Orlando even generates its own power and powers part of Orlando.

  11. Re:Ouch... by lgarner · · Score: 2

    Yes. When an EC2 instance is "turned off" it's destroyed, along with any data- unless you're using an EBS volume (the special provision) which is persistent, or S3. Not to say that there couldn't be issues with the data on those either, in the face of an extremely sudden, unexpected shutdown. Shutting off an EC2 instance is equivalent to deleting a VMware VM. You then have to start a new one from a template (AMI).

  12. Re:Don't say I didn't warn you! by afidel · · Score: 2

    Depends on your needs, if you need capacity only occasionally or have a workload where the peak is an order of magnitude or more from the base then it can make perfect sense to use a cloud provider, it's not like multisite replication and large amounts of bandwidth are cheap when you do them yourself.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  13. Re:at that level the safety's tipped foreing a man by Animats · · Score: 4, Informative

    Also Surge Protectors can't really take a direct lighting strike.

    But lightning arrestors can. A serious lightning arrestor is a spark gap (sometimes open air, sometimes in an inert gas) to ground, with a very heavy cable or busbar to multiple ground rods, and no sharp turns in the path to ground. This is followed up by an inductor which is a few turns of busbar. This gear is usually placed where power lines or antenna feeds enter a building. MOV-type protection is further downstream.

    Antenna towers are struck by lightning frequently, and the associated radio gear routinely continues to operate. This isn't rocket science. It's big hunks of copper.

    The Hartford Steam Boiler Inspection and Insurance Company, in their publication "The Locomotive" (they've been at this since 1867) has a good article on lightning protection. Hartford Steam Boiler insures not only against boiler explosions, but things like downtime due to lightning strikes. But only after their inspectors (they have 1200) have visited the plant and are satisfied with the equipment.

    A question to ask your "cloud" provider - who handles your business interruption insurance, and do they inspect your faclities?

  14. THAT's NOT a CLOUD by Crypto+Gnome · · Score: 2

    Despite ALl the market-hype and brew-haha going on, the simple fact remains:

    If ALL your computing power is in ONE SINGLE DATACENTRE then what you have is a DAMP SPOT not a CLOUD.

    --
    Visit CryptoGnome in his home.
  15. marketing bullshit by Tom · · Score: 4, Insightful

    And there is the marketing bullshit revealed. All the promises of the cloud - down by one lightning strike.

    Because, let's face it, the whole "cloud" thing as they sell it is just advanced virtual hosting with a different name. The only real cloud capabilities are those the big companies build for themselves, and they did things like that 10 years ago already, when nobody had ever heard the term "cloud" used in computing contexts.

    In the end, it's about selling something to people who already have the older version and convincing them to buy the new one. So you give it a different name because a "new" product sells easier than the upgraded version of an "old" product.

    Anyone remember when "Web 2.0" was all the hype? It really wasn't a 2.0 as we all know. There was nothing new in it, all components had been around for a long time. It was a conceptual bundle, but not a new version like the name suggested.But "we're doing more Javascript" now doesn't sell nearly as good as "we're moving to Web 2.0 now".

    --
    Assorted stuff I do sometimes: Lemuria.org
    1. Re:marketing bullshit by aug24 · · Score: 2

      But they weren't down. I have servers in Dublin. I also have striped redundancy across other EC2 datacentres. We had 100% uptime last night and I'm now watching the Dublin based machines recover gracefully.

      Yes, I agree it's "advanced virtual hosting with a different name". But it didn't break its promises.

      --
      You're only jealous cos the little penguins are talking to me.
  16. Re:Don't say I didn't warn you! by dkf · · Score: 2

    Looks like the old "Good, fast, cheap: pick two" adage might need a little rewording. How about "Fast, reliable, cheap: pick two"?

    Since when was "reliable" anything other than one of the classic metrics for "good"? The old adage needs no changes at all.

    --
    "Little does he know, but there is no 'I' in 'Idiot'!"
  17. Microsoft's BPOS by markbark · · Score: 3, Funny

    Business Productivity Online Suite?
    I always thought it stood for "Big Piece of Sh..... never mind