Slashdot Mirror


Car Hits Utility Pole, Takes Out EC2 Datacenter

1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."

250 comments

  1. Murphy's law by pwnies · · Score: 1, Redundant

    Whatever can go wrong will rings pretty true here. Makes for an exciting day of work for them though I suppose; unlike yours truly.
    *Goes back to reading /.*

    1. Re:Murphy's law by turing_m · · Score: 5, Funny

      Nice try, but you still fail to grammar.

      This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.

      --
      If I have seen further it is by stealing the Intellectual Property of giants.
    2. Re:Murphy's law by Fex303 · · Score: 0, Offtopic

      Um.... Whooosh?

      Using 'grammar' as a verb is one of those linguistics jokes I love. (Actually, I love all linguistics jokes.) My usual explanation for when I've done a grammar edit on my posts (on forums which support it) is 'Edit: I'm don't grammar'.

      A key clue to this being a joke is the use of the word 'fail' which these days is often associated with LOLcats. Those damn cats have raised the use of deliberately bad grammar to an artform in and of itself.

    3. Re:Murphy's law by onionman · · Score: 0, Offtopic

      Nice try, but you still fail to grammar.

      This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.

      Here's a wacky thing: the plural form of someone else is actually someone's else which I only discovered one day when the the spell checker kept underlining else's. I'm not correcting you, by the way. My own grammar and spelling are atrocious, so I nearly always fail to grammar. I just thought I'd point out an oddity of the language in case anyone else found it humorous.

    4. Re:Murphy's law by morty_vikka · · Score: 1

      Here's a wacky thing: the plural form of someone else is actually someone's else .

      Ah, I can see the reason for your disclaimer about not having good grammar. "Someone else's" isn't plural, it's possessive! Still an interesting fact though.. does it mean the possessive form of someone else is someone's else? Looks pretty wrong to me...

    5. Re:Murphy's law by DarkTempes · · Score: 1, Offtopic

      I think automated spell checking is a poor way to learn grammar and that such tools are frequently wrong.

      A quick review makes me suspect that the correct possessive form is still someone else's. (Sources: a dictionary, a writing guide, and a google test)

    6. Re:Murphy's law by The+Hatchet · · Score: 1

      It is efficient for money-making to centralize everything, especially consumer services/money. But when you put all your eggs in one basket, only one basket needs to be broken to scramble all eggs.

      Every day we put more in the cloud. Every day we have less power. What happens when everything is in the cloud? all you need is a truck and a utility pole and hell breaks lose. You can then do as you please while you wait for people to remember how to function without working electronic devices. Imagine the field day organized criminals will have when the police move to the cloud.

      --
      Where is the mod rating for "scary"? Also, ...
    7. Re:Murphy's law by The+Hatchet · · Score: 0, Redundant

      Fine grammar is just a formality. Language is a wonderful, ever changing tool. We can use it however we please. Sure, some mistakes are terrible, and accidental, but that does not mean that grammar need be as valuable as gold. We doth need remember that language cannot be controlled without losing that which it is used to create.

      We can say things like "I don't grammar" and they convey meaning just as well as saying "I don't pay great attention to or check my grammar" It might sound a bit off, but it does what language is meant to do: convey meaning. The faster and better we can convey meaning, the better we are language-ing. So indeed the phrase "I'm don't grammar" may be terribly flawed, but it conveys meaning quicker and more efficiently than the other statement, so it can easily be said that it has fulfilled its purpose better than the grammatically correct phrase.

      Every time I meet a grammar nazi in person I spend about half an hour giving them a speech on why they should go to hell.

      Also: I might note that /. comments are terrible for correcting grammar, using crappy comparisons, and crappy attempts at being condescending. It is so much, it often covers up or ignores the important points of a debate. It is just as bad as watching intelligent debates degrade to anger or degrade to moronic babble. I would seriously like to see more focus on what is important, and less on this kind of crap, as a general rule. Maybe then someone could learn something besides how to be better at being a useless, progress impeding grammar nazi.

      I suggest you cease and desist. Then we can all get on with our lives.

      --
      Where is the mod rating for "scary"? Also, ...
    8. Re:Murphy's law by Anonymous Coward · · Score: 0

      So when the cloud fails.....Does it evaporate?

    9. Re:Murphy's law by fuzzyfuzzyfungus · · Score: 1

      There are both economies and diseconomies to centralization. The real issue(in many "cloud" cases), is that some of the things that could be economies of centralization are being skipped in the mad rush for low costs, and since everything is hidden under a shiny layer of web APIs, people don't notice in time.

      In this case, for instance, the cost per server, or per unit work done, to have Real Serious Redundant Power(batteries, generators, multiple utility links, etc.) plummets as the number of servers in a given location increases. As long as people keep in mind that "cloud" equals "buzzword for a set of methods of reducing the transaction costs of outsourcing a variety of IT functions" rather than "magic place where computations are done by happy computrons and packets are carried by unicorns" and ask the appropriate questions, the average reliability of power, bandwidth and other useful stuff should go up. If they fail to keep that in mind and start falling into the stupid(but seductive, and not exactly discouraged by vendors) assumption that "the web API makes it look super easy, so it must be super reliable", market forces will quickly drive the lowest common denominator down to a pretty grim level of service.

    10. Re:Murphy's law by carp3_noct3m · · Score: 2, Funny

      Karma and Murphys law, a deadly combination.

      --
      "It's ok, I'm completely secure as long as my iron is off"
    11. Re:Murphy's law by JWSmythe · · Score: 4, Insightful

          Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.

          Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.

          That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.

          I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.

      --
      Serious? Seriousness is well above my pay grade.
    12. Re:Murphy's law by Anonymous Coward · · Score: 0

      oh noes! finally people is coming down from clouds to reality!

      I'ev been saying this from since the could buzzword started spreading. cloud computing will not magically make availability/redundancy/replication problems fade away. It will help you externalize them, promising a reliability they could not attain: but if you're externalizing a critical component hoping the hosting service will provide you magical availability, then you're doing it wrong from the start.

    13. Re:Murphy's law by Anonymous Coward · · Score: 0

      This is less Murhpy's Law, and more the results of Corp Execs cutting costs by any combination of:

      1. Not keeping enough personnel to maintain equipment or respond
      2. Being unwilling to pay for required maintenance
      3. Not paying enough to keep people around who know what they are doing.

      My own company had a whole bunch of power failures. Everything from "Gee, the 15-year old batteries didn't last as long as we thought they would!" and/or "Oh, you mean we have to filter/rotate/do-stuff with the fuel?" and/or "Hey, we need to take the batteries offline to work on them. But, lets not start the generator (just in case) for some reason".

      It seems this is starting to swing around. That happens when 40% of your collocation customers and 30% of your dedicated customers leave because of power failures. 99.99% uptime my ass...

      Posting anonymously for various obvious reasons. I wish it weren't necessary to hide, but you know how it is.

    14. Re:Murphy's law by turing_m · · Score: 1, Informative

      A key clue to this being a joke is the use of the word 'fail' which these days is often associated with LOLcats. Those damn cats have raised the use of deliberately bad grammar to an artform in and of itself.

      I did give the meaning of his post a good deal of thought before replying. I am familiar with LOLcats, LOLCODE and more than my share of despair.com inspired "fail" jpegs. Whether his reply was a joke or not could go either way. On one hand, structuring the response so that the "joke" is where a punchline would be is one clue. The "fail" is another minor indication, but not enough to sway. If he had made the degree of epicicity of the fail explicit (even in a binary fashion), that would be sufficient for me to conclude that it was a joke - but he didn't.

      OTOH, there was nothing particularly egregious about the OP's grammar or post to warrant the overly pedantic and snarky reply. If the OP hadn't used the semicolon, there wouldn't have been enough of a pause for "unlike yours truly." to have much comedic effect. While not Strunk and White kosher, surely it's acceptable for slashdot. At this point, you have a pointless post and what may or may not be a joke that succeeds or fails based on whether the OP has ridiculously bad grammar. Which equates to pointless post + unfunny joke (fail), or pointless post + unintentional grammatical error in the context of a grammatical correction (epic fail). Which in turn further condenses down to pointless post + fail, and if you consider a pointless post to be a fail, then fail + fail = fail, QED.

      --
      If I have seen further it is by stealing the Intellectual Property of giants.
    15. Re:Murphy's law by The+Hatchet · · Score: 1

      I think perhaps a better lens would be "If all the info and apps are run in one place, then I can disable that place and run this giant bank without worry" or something else terrible of that sort.

      If it ever happens, would you want all your proverbial guns in the safe that is standing between you and the badass that wants your wallet, identity, and liver/kidneys?? Or would you want to be holding them, reading to kick some ass? In a world where all the power lays in centralized systems, an authority of some kind can have your resources shut down in the blink of an eye, and you are done. It doesn't matter if you are good or bad or strange or up or down or charmed, you are screwed. Or if you have to drop a bill and internet comes up, do you want the hundreds of dollars in applications and gear to be worth absolutely nothing?

      You are talking from the business point of view here, and for them there is profit to be had centralizing. It reduces costs and boosts profits if done right. But it always hurts the consumer. It robs them of physical ability and power. It would be like having to log into your house, which exists in china every night when you get home. having to mail your laundry to Ethiopia to be washed on boards instead of in the washer sitting two rooms over. I don't want everything I have the ability to use and get use from riding on my ability to keep a job and pay insane, ever rising bills.

      --
      Where is the mod rating for "scary"? Also, ...
    16. Re:Murphy's law by The+Hatchet · · Score: 1

      That is within the cloud, and those problems will always exist because redundancy is expensive and inefficient in next quarter stock price sense. And next quarter stock price is the only thing that matters to the average share holder that matters. That would be like giving a drug addict a pound of pot and telling him there are no consequences now, but there is a chance that someday something might go terribly wrong and you will be screwed. It just doesn't work. There is no incentive, no motivation for the company to be redundant in its power, storage, or anything.

      But regardless of that, do you want someone standing between you and your applications, data, abilities? I wouldn't want someone standing between me and a gun,me and my money, me and my life, me and my anything mine really. Regardless of the technical efficiency it still fails as a good concept. It only helps shareholders of the clouding company, everyone else be damned.

      And when we realize that we are paying a company that says we are to be damned, we should probably stop. Fat chance though. As in bioshock, he who has the adam owns the place.He who has the IP owns the place.

      You own data? worthless. Corporation with resources owns data? priceless. = problem.

      --
      Where is the mod rating for "scary"? Also, ...
    17. Re:Murphy's law by nmg196 · · Score: 1

      > but you still fail to grammar.

      We *all* fail to grammar occasionally. Especially you.

    18. Re:Murphy's law by Anonymous Coward · · Score: 0

      Mind explaining this joke for us not gifted with English as a native tongue?

    19. Re:Murphy's law by FuckingNickName · · Score: 0, Offtopic

      You are correct that a semicolon is inappropriate here. But

      it's improper to begin a sentence with a conjunction

      is pure bullshit, the sort of arbitrary rule magicked up by English teachers which has frustrated people what speak good English but feel that a language exists for the purposes of communicators, not bureaucrats.

      And to quote the oft-cited New Fowler's, "There is a persistent belief that it is improper to begin a sentence with And, but this prohibition has been cheerfully ignored by standard authors from Anglo-Saxon times onwards. An initial And is a useful aid to writers as the narrative continues."

    20. Re:Murphy's law by Anonymous Coward · · Score: 1, Informative

      I have seen that myself. When a PHB tells me "security has no ROI", I die a little.

      I have been at well architected datacenters. There is a reason they have two physical drops for lines on the building, and it is exactly to deal with the backhoe problem. This way, if someone cuts their primary connection, it would fail over to the peer. Power was distributed the same way, so even though most of the building might go dark, the CRAC/HVAC system would keep running. If both lines got cut, then that is what the diesel generator and batteries were for (batteries were replaced every 2 years.)

      Having multiple drops for anything is a requirement for anything Tier III or above, and in reality needs to be a part of Tier II. Tier I it doesn't need to be an issue, but people have to realize that the risk is there and if the company can handle multi-day downtime. Failing to heed to this is a misrepresentation of quality of service.

    21. Re:Murphy's law by imakemusic · · Score: 1

      Yeah. My spelling is perfect but I still get my grammer wrong sometimes.

      --
      Brain surgery - it's not rocket science!
    22. Re:Murphy's law by dougisfunny · · Score: 1

      Grammar Nazi's are nothing, its the Grammaton Clerics you have to look out for.

      --
      This is not the funny you're looking for.
    23. Re:Murphy's law by tehcyder · · Score: 1

      Mind explaining this joke for us not gifted with English as a native tongue?

      There was no joke, so no amount of explanation would help.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
    24. Re:Murphy's law by jo42 · · Score: 1

      Without fail, there is no learn.
      Some are just too stupid to learn.

    25. Re:Murphy's law by Wovel · · Score: 1

      Do you keep all your money in a mattres?

    26. Re:Murphy's law by Anonymous Coward · · Score: 0

      Don't worry. I'm a technical writer and I need the work. Send it all my way.

    27. Re:Murphy's law by Bing+Tsher+E · · Score: 1

      Or if you have to drop a bill and internet comes up, do you want the hundreds of dollars in applications and gear to be worth absolutely nothing?

      That can already be said about your Electric Power bill.

      I keep a non-electric pencil sharpener on hand. It's not screwed to the wall, but I still have one or two manual screwdrivers I could use to install it in a pinch. Even a hand crank drill to drill the screw holes, actually.

    28. Re:Murphy's law by Loki_1929 · · Score: 1

      So how do you enjoy working at the 1&1 datacenter in Kansas?

      --
      -- "Government is the great fiction through which everybody endeavors to live at the expense of everybody else."
    29. Re:Murphy's law by Dishevel · · Score: 2, Funny

      TLDR ; lol?

      --
      Why is it so hard to only have politicians for a few years, then have them go away?
    30. Re:Murphy's law by Dishevel · · Score: 1

      You are correct that a semicolon is inappropriate here. But

      it's improper to begin a sentence with a conjunction

      is pure bullshit, the sort of arbitrary rule magicked up by English teachers which has frustrated people that speak English well but feel that a language exists for the purposes of communicators, not bureaucrats.

      And to quote the oft-cited New Fowler's, "There is a persistent belief that it is improper to begin a sentence with And, but this prohibition has been cheerfully ignored by standard authors from Anglo-Saxon times onwards. An initial And is a useful aid to writers as the narrative continues."

      That is all.

      --
      Why is it so hard to only have politicians for a few years, then have them go away?
    31. Re:Murphy's law by Anonymous Coward · · Score: 0

      Conclusion:

      If you gonna make bad intentional grammar, not subtle too much, so it not flies above /. reader head. Make did whooshes over moderators even, you sees.

      AC to no mod undoes, this thread no, other.

    32. Re:Murphy's law by Smallpond · · Score: 1
    33. Re:Murphy's law by TheRaven64 · · Score: 1

      Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.

      Cloud has no technical meaning. It has a management meaning, which is very valuable. The management meaning is 'someone else's problem'. The phrase 'put it in the cloud' is management buzzword speak for 'make it someone else's problem'. It doesn't mean it's more reliable (it can mean quite the reverse), it means that management has someone outside the company to blame when it breaks.

      --
      I am TheRaven on Soylent News
    34. Re:Murphy's law by The+Hatchet · · Score: 1

      Banks are legally forced to make sure I am allowed my own money, and with only checking/debit account, I am not making the risks that many take with their money. I always have enough on me for a week or so, in case something happens.

      Banking is highly regulated, extremely so to avoid panic and terrible things. Plus I don't keep all my value in one medium or one place. Just a smart, convenient way to live.

      On the other hand, cloud computing, app licensing are about as regulated as product quality in china. There is barely anything stopping them from escalating the bullshit, and watching this escalation I avoid the bullshit. If you were a jew in germany in 1938, would you have stayed because you made a comparison to keeping your money in the bank? An escalation of hostilities is what it is, regardless of where. My bank shows no intention of raping me for my money, whereas these companies are basically screaming "HAIL DRM! HAIL DRM! HAIL DRM!"

      what is it about you /. ers and making terrible comparisons??? Every situation is not the same just because you make a shitty comparison. Just because you say "well, if you drive a red car, its like eating sushi so your a communist" The logic behind the comparisons falls apart the second anyone actually would bother to engage their brain. Do everyone a favor and engage it before commenting again.

      --
      Where is the mod rating for "scary"? Also, ...
    35. Re:Murphy's law by The+Hatchet · · Score: 1

      Yea but an electric bill is not gone till after you drop more superfluous services. I doubt anyone would cancel their electric before their internet. Again, with the shitty comparisons.

      Hell, most on welfare can even afford electric bills. That is not exactly much of a risk. But internet? If it has to go it has to go.

      Let me ask you, what happens when the police guns have handprint verification to fire, and link back up to centralized servers to check the handprint? You want to rob a bank, just knock out power to the police station for a couple of hours. What are they gonna do, throw the bullets?? The reality is we are currently pushing some of these systems into place.

      You can say it about an electric bill, but it is a different situation, and due to the fact that most can pay their electric, its a terrible comparison. Really it is. Slashdotters love making shitty comparisons that are invalid to the original subject.

      --
      Where is the mod rating for "scary"? Also, ...
    36. Re:Murphy's law by JWSmythe · · Score: 1

          When I worked for a large provider, we actually did what the "cloud" was suppose to offer. We ran redundant servers in geographically diverse locations. No one knew if I yanked a server, or a switch went down, or even the datacenter had a power outage. The site just worked. I'd know, because I'd get paged. I'd see the bandwidth shift on our graphs. Other than that, no one cared. Well, I cared. If it was a whole datacenter, I'd call, let them know we saw the problem, and got the estimated time for the repair, for my own reference.

          All of our "magic" ability was our redundancy, and enough extra bandwidth to take whatever was thrown at us. I don't know if they even do it the same way any more. People have reported outages to me, which I simply respond to them, "I don't work there any more. I don't care. I couldn't do anything about it if I did care."

      --
      Serious? Seriousness is well above my pay grade.
    37. Re:Murphy's law by JWSmythe · · Score: 1

          So you are one of those damned sushi eating communists, aren't you? :)

          On the money, the banks have other rules that can be nasty.

          On the Friday before Christmas one year, Wells Fargo decided that there was fraudulent activity on my account. All I had done was deposit my paycheck that cleared fine, and written bills from it. I had bought a plane ticket home, and reserved a rental car there. When I went Friday to deposit my last paycheck before I left, the ATM seized my card. They were nice enough to inform me that my account was frozen due to fraud. Two hours later and a lot of screaming (I was polite at first, but that didn't last beyond the first half hour), they agreed to give me some of my money in cash, but the account itself was still frozen. Every bill check that I sent out bounced. My plane ticket was cancelled (by them). A friend of mine covered my expenses for the trip, and it wasn't until about two weeks later that I finally got it straightened out. Over the following month I got all my bills straightened out by paying them from a new bank account at another bank.

          You're guaranteed your money, but it isn't always immediate access. Sometimes it is easier to keep your money under your mattress. I keep enough cash handy to survive for a week or so, just in case it happens again.

      --
      Serious? Seriousness is well above my pay grade.
    38. Re:Murphy's law by SleazyRidr · · Score: 1

      Or as I like to call it, Muphry's law.

      (I'm lazy, you can find the page on wiki.)

    39. Re:Murphy's law by FuckingNickName · · Score: 1

      *Facepalm*. You've missed a pre-Internet meme as old as the hills, and haven't even corrected it proper. You mean:

      people who speak English well

    40. Re:Murphy's law by The+Hatchet · · Score: 1

      Yea, but even that is smooth enough, if you keep enough on you. Errors like that occur typically due to protective measures, and wells fargo ain't exactly the most wonderful company out there. With my bank of america account, there have been potential issues, but their system for straitening it out has allowed me to do everything I need to quickly, easily, and without hassle.

      We are talking about applications though, and similarly, if you keep all of your software in a bank account far away, that has none of the restrictions our banks do, and have little to none in your own power, you are much more at risk. It would be more like keeping all your money in a bank account with the a loan shark in the next state. Sure it is still technically yours, but some tough policies can rob you out, just like cloud based computing.

      --
      Where is the mod rating for "scary"? Also, ...
    41. Re:Murphy's law by onionman · · Score: 2, Insightful

      Here's a wacky thing: the plural form of someone else is actually someone's else .

      Ah, I can see the reason for your disclaimer about not having good grammar. "Someone else's" isn't plural, it's possessive! Still an interesting fact though.. does it mean the possessive form of someone else is someone's else? Looks pretty wrong to me...

      Yes, I certainly meant "possessive," not "plural," and I don't claim any expertise at all with language. (I'm a math professor in part because I was always so bad at writing.)

      Anyway, an English professor whom I asked about the puzzle explained to me that the correct, although archaic form is indeed someone's else. I pointed out to her that many on-line references use someone else's as the possessive form, and she explained that many on-line references are written by individuals who are catering toward the "business writer."

      Evidently, the business audience isn't so much concerned with what is correct grammatically as opposed to what sounds correct because it is used most frequently. Hence, sites like dictionary.com will often list the most common usage even if it isn't technically correct.

      For example, if you want to refer to the car belonging to the attorney general, it would be the attorney's general car not the attorney general's car. However, most readers would find the first form off-putting, so a business writer would prefer the second.

      Of course, this leads to an endless digression as to grammar being a fixed set of rules to hold the language together as a standard or an amorphous description of common usage which must change with the times.

      Well, I should probably stop commenting on this before I get too many more "offtopic" mods.

    42. Re:Murphy's law by JWSmythe · · Score: 1

          You're so right. On anything, when you trust others to manage your resources, you have to trust them.

          I just helped someone move their domain. Something happened (long story), and they couldn't do anything with it. They didn't have a web page, but they did have lots of email going through there. It took a while to get it released to my control, but it finally got done.

          I prefer not to trust anyone else with things I can do myself. That includes hosting, money, and fixing my car (I had to throw that in).

      --
      Serious? Seriousness is well above my pay grade.
    43. Re:Murphy's law by bmullan · · Score: 1

      EC2 is designed as an IaaS which lets you the consumer decide how much HA and redundancy you want to implement.

      AWS S3 ensures all data is stored redundantly in separately powered, geographically dispersed facilities.

      AWS's EC2 supports load-balancing and auto-scaling. Using both of those you can ensure that failure of any one "instance" is quickly repaired by starting another instance... usually 5-6 minutes. You can and do control which "zones" an instance starts in so if you want safety you can implement whatever you need level you want.

      I doubt major corporations have any better operations than IaaS providers like AWS, RackSpace etc. You plan for the unexpected.
      I've helped work on major outages at dozens of fortune 100 companies so I know that no one is immune.

    44. Re:Murphy's law by turing_m · · Score: 1

      gj, lol

      --
      If I have seen further it is by stealing the Intellectual Property of giants.
    45. Re:Murphy's law by daybot · · Score: 1

      LOLcats? srsly? I guess you're one of those oldies who still says ROTFL instead of rofl or pastes a link to the roflcopter...

    46. Re:Murphy's law by Fex303 · · Score: 1

      fail + fail = fail, QED

      Nope. fail + fail = win

      But thanks for playing.

  2. Farmville updates on Facebook stop by kriston · · Score: 5, Insightful

    And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
    Nothing of value was lost.

    --

    Kriston

    1. Re:Farmville updates on Facebook stop by Anonymous Coward · · Score: 0

      Except for meeeeee.....my stuff Wilted!!!! QQ /sarcasm

  3. Where's your cloud now? by TooMuchToDo · · Score: 4, Funny

    "The cloud" doesn't solve everything. Film at 11.

    1. Re:Where's your cloud now? by Anonymous Coward · · Score: 1

      Amazon EC2 is just a xen VM, not true cloud computing. Discuss amongst yourselves.

    2. Re:Where's your cloud now? by Anonymous Coward · · Score: 0

      "The cloud" doesn't solve everything. Film at 11.

      Hey, you. Get off of my cloud.

    3. Re:Where's your cloud now? by Anonymous Coward · · Score: 0

      'cloud computing' is completely meaningless.

    4. Re:Where's your cloud now? by GaryOlson · · Score: 1

      But a thick cloud with high density can cover up a lot of ugly infrastructure no one wants to see. Just ask the people who live in San Francisco.

      --
      Every mans' island needs an ocean; choose your ocean carefully.
    5. Re:Where's your cloud now? by Sarten-X · · Score: 3, Funny

      The definition is a very nebulous concept.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    6. Re:Where's your cloud now? by plover · · Score: 4, Funny

      I'm kind of foggy on the details myself.

      --
      John
    7. Re:Where's your cloud now? by dakameleon · · Score: 1

      Not much chance it'll be reigned in any time soon, though.

      --
      Man who leaps off cliff jumps to conclusion.
    8. Re:Where's your cloud now? by SeaFox · · Score: 1

      Read the article, and enlightenment will hit you like a bolt out of the blue.

    9. Re:Where's your cloud now? by roman_mir · · Score: 1

      well the details are a bit obscure but I hear that what it is, is billions of tiny droplets of water that are suspended in the air above ground, not sure how that helps with computing though.

    10. Re:Where's your cloud now? by L4t3r4lu5 · · Score: 4, Funny

      I'm sorry, I don't get the joke. I must have mist something.

      --
      Finally had enough. Come see us over at https://soylentnews.org/
    11. Re:Where's your cloud now? by Richard_at_work · · Score: 1

      Nice generalisation - *this* cloud doesn't solve anything (but I never considered EC2 to be properly cloud anyway). Its the equivalent of how one vendors badly designed RAID card doesn't invalidate the entire concept of RAID.

    12. Re:Where's your cloud now? by Anonymous Coward · · Score: 0

      Aw, dude... Rough pun.

    13. Re:Where's your cloud now? by arndawg · · Score: 2, Funny

      yeah the cloud concept is pretty ashy these days.

    14. Re:Where's your cloud now? by tylerni7 · · Score: 2, Informative

      Sure, but a thick cloud with high density can also cover up a lot of important things, like roadways and utility poles.

    15. Re:Where's your cloud now? by MistrBlank · · Score: 1

      Agree. Cloud Computing is a salesman to manager keyword.

    16. Re:Where's your cloud now? by mortonda · · Score: 1

      weighing the options, I don't see the con - dense, I am.

    17. Re:Where's your cloud now? by mortonda · · Score: 1

      What can you dew with this technology?

    18. Re:Where's your cloud now? by Rich0 · · Score: 1

      I think the key is to build more tolerance into the app layer.

      The biggest problem is replicating the data. If your data is replicated across multiple EC2 availability zones then in theory you can just launch new instances in the new zones and be up and running again. If you have a plan for doing that then you could be up fairly quickly I'd think.

      There are different ways to approach this sort of thing. One is trying to keep servers from ever going down. The other is to let them go down anytime they want to and be ready to handle that. The latter really seems more cloud-like to me.

      Ah, the beauty of using ephemeral terminology. We can both use the same words and completely disagree about what they mean... :)

    19. Re:Where's your cloud now? by Anonymous Coward · · Score: 0

      You can't be Cirrius.

    20. Re:Where's your cloud now? by StikyPad · · Score: 2, Funny

      Don't worry, it's over your head.

    21. Re:Where's your cloud now? by PowerEdge · · Score: 1

      No, joke. Cirrusly. This is a very cirrus situation. I am cirrus.

    22. Re:Where's your cloud now? by eggman9713 · · Score: 1

      I'm sorry, I don't get the joke. I must have Myst something.

      --There, fixed that for you. ;)

  4. The poll... by Dayofswords · · Score: 1

    The poll goes perfect with the story.

    The cloud is nice, but unreliable, it is.

    --
    Someday we'll hit the human carrying capacity. And the band will just play on.
    1. Re:The poll... by Jeian · · Score: 1

      It has nothing to do with "the cloud", other than that the datacenter affected happened to host one. It could've been a dedicated server and it would've had the same problem.

    2. Re:The poll... by jamesh · · Score: 1

      Is it actually worse than maintaining your own server room though? I test our UPS every 6 months and it works flawlessly every time, but the last two power failures caused it to drop the load instantly. Nothing is perfect, the difference with the cloud is when it goes wrong it can go wrong on a large scale and you are more likely to read about it in the news.

    3. Re:The poll... by Richard_at_work · · Score: 1

      At my last job, our UPS worked flawlessly - until a fan needed replacing, and the switch back from maintenance bypass to protected flow caused a massive overvolt condition on two of the phases, killing a large chunk of our switches, PCs and a lot of redundant power supplies in the servers.

      When safety equipment goes bad, there's not a lot you can really do.

    4. Re:The poll... by Anonymous Coward · · Score: 0

      why, do you think that they could do without a datacenter? take cloud computing and remove the buzzword.

      now, you have this cluster of virtual machine, that ought to be able to migrate live following demands and resource availability. and this is nothing new.

      if the connection of the datacenter is severed, all the vm back there will be dead. no chance for a planned migration and a nice shutdown. all the job currently running on those machines are aborted (but I hope you run concurrent virtual machine job queues with a distributed and redundant control network to reply failed jobs - a pretty much standard practice for high availability/grid computing stuff)

      if the job dispatcher/coordinator were on the same datacenter, that is lost too. any shared resource not synced with all the geographically distributed vm before the outage have to be considered lost, probably corrupted

      those were the limit of grid computing back then, and those are their limit today, even with the cloud computing buzzword in between. now, if you have an infrastructure critical enough to require the premium price of a geographical distributed redundant network, what would you chose, removing the marketing buzzword:

      the new player, with only two year of actual service, no proven resilience, no public service availability statistics?

  5. It's failure on multiple levels by GilliamOS · · Score: 5, Insightful

    Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

    --
    "There might be intelligent beings created by God in outer space even if there are none here on Earth." -Anonymous
    1. Re:It's failure on multiple levels by Anonymous Coward · · Score: 0

      Civilized countries bury their cables underground.
      Only the US seems to use poles anymore.

    2. Re:It's failure on multiple levels by OnlineAlias · · Score: 4, Insightful

      You said it. They failed to test. I design/run datacenters, and have had exactly this kind of thing happen recently. No outage, hardly anyone even noticed. My most critical stuff runs active/active out of multiple data centers...you could nuke one of them and everything would still be up.

      I'm actually a little blown away that the all powerful Amazon could possibly let this kind of thing happen. They are supposed to be pro team, a power failure is high school ball.

    3. Re:It's failure on multiple levels by fractalVisionz · · Score: 4, Informative

      It seems you didn't RTFM. Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

    4. Re:It's failure on multiple levels by GaryOlson · · Score: 4, Interesting

      Most Americans these days are over-pampered self-absorbed malcontents. If the poles are not out in front where crews can service without going on property -- or even using predefined right of ways -- too many people complain or sue for negligible property damage.

      Where I grew up, the power poles ran on the property lines behind and between the houses. Once, lightning took out the transformer on the power pole [great light show and high speed spark ejection] ; and people were willing to take down the fence, put the dogs in a kennel, and remove landscaping which had encroached on the power pole so the crew could replace the transformer and other service. Today, I expect everyone shows up with a digital camera to document "property damage" to file for compensation for landscaping which has illegally encroached on the equipment.

      Many places various issues prevent burying the power cable: high water table, daytime temperatures which do not cool the ground -- and the power cables, or even fire ants.

      --
      Every mans' island needs an ocean; choose your ocean carefully.
    5. Re:It's failure on multiple levels by Itninja · · Score: 1

      No outage, hardly anyone even noticed.

      So, how is this different? A teeny, tiny percentage of users even noticed this and no data was lost. It's foolish to think one's data center is immune to outages (power or otherwise) from time to time, no matter how well it's designed. But apparently this is the latest in several outages over the past few weeks which is kind of like amateur hour.

      --
      I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
    6. Re:It's failure on multiple levels by MichaelSmith · · Score: 0

      Civilized countries bury their cables underground.
      Only the US seems to use poles anymore.

      You need to get out more.

    7. Re:It's failure on multiple levels by ToasterMonkey · · Score: 1

      From the summary I gathered the problem was with the mechanical switch that disconnects external power when the generators are brought online, not a lack of capacity. Still requires testing, but it isn't going to be done often because isn't this the result when the power doesn't transition smoothly?

    8. Re:It's failure on multiple levels by DogDude · · Score: 1

      Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

      That's why Google is locating all of their datacenters near natural power sources and is a registered utility agent whatchamajigger. I think that they agree with you.

      --
      I don't respond to AC's.
    9. Re:It's failure on multiple levels by profplump · · Score: 1

      It is, but you can test during pre-defined maintenance windows when downtime is expected, or you can migrate active services to other hosts and leave these running as backups during the test, so that a failure does not bring down the primary.

    10. Re:It's failure on multiple levels by omglolbah · · Score: 1

      It is much better to have a scheduled test with people ready to take care of any issues that may or may not pop up than to have a piece of equipment fail at a random time with few prepared...

      I sure as hell would rather have a blip every now and then than knowing that the system might fail catastrophically when something unexpected happens..

    11. Re:It's failure on multiple levels by TubeSteak · · Score: 4, Insightful

      Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

      Sounds like Amazon's tech monkeys didn't do their job when they received the hardware from the factory.
      Or is it normal to just plug in mission critical hardware and not check that it is setup properly?

      "We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting," Amazon reported.

      I guess TFA answered that question.
      If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.

      --
      [Fuck Beta]
      o0t!
    12. Re:It's failure on multiple levels by afidel · · Score: 1

      Very true, especially since burying the kind of cables that a datacenter requires is very expensive and can lead to some interesting failure modes that don't happen with tower mounted high voltage lines =)

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    13. Re:It's failure on multiple levels by RoFLKOPTr · · Score: 1

      Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies.

      It's not a matter of testing. These systems aren't things that you can just "test", because what if there is a problem? Then you have intentionally shut off power to your entire datacenter. Otherwise you could have scheduled downtime and just assume everything will fail, so have everybody shut off their servers in advance just in case, but then how often can you do that?

      No, it's a problem with the fundamental design of the power backup systems. I know somebody in charge of the electrical end of constructing CaliforniaISO's new headquarters and datacenter. They manage California's entire power grid. They have two redundant utility power connections that stay separate all the way to each server rack which each have two power inputs and a fail-over switch. Each side runs off utility IN PARALLEL to battery backup. If utility fails, no switching needs to be made because power will already be running through battery backup, and can stay that way for 24 hours per side. As soon as the utility fails, both backup generators will start and will be able to power the entire building within two minutes and I think they each have enough fuel for two weeks. Each generator is tied into separate sides of the power system, and each has their own separate facilities and their own separate fuel tanks the size of tanker trucks. Each rack will also have a local UPS unit that can keep the power flowing for about 10 minutes... enough to at least get the servers shut off safely.

      That, my friend, is how you set up a UPS system. No testing required, because the entire system can't possibly fail short of some(body/thing) getting around the incredible building security and destroying cables or equipment. But no amount of testing or redundancy could possibly foresee or stop that. Pretty much the only fail points would be the switches in each rack, but the entire datacenter also has multiple points of data and processing redundancy making that a non-issue as well.

    14. Re:It's failure on multiple levels by Anonymous Coward · · Score: 0

      I design/run datacenters

      Can I have a job please?

    15. Re:It's failure on multiple levels by DerekLyons · · Score: 2, Insightful

      Amazon for not load-testing their emergency backup power on a regular basis

      And you know they don't test it how? Oh, right. Testing is a magic wand that solves everything - except it doesn't. I've seen stuff fail literally seconds after being successfully tested. Welcome to the real world.
       

      and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

      Horseshit. This has nothing to do with the grid, and everything to do with local supplies - which rarely if ever have redundancy. (Mostly because it increases the difficulty and cost of maintenance and considerably increases the capital cost - while only providing a benefit in a one-in-a-million situation.)

    16. Re:It's failure on multiple levels by jibjibjib · · Score: 1

      I'm in Melbourne, Australia and we have almost all our power cables above ground.

    17. Re:It's failure on multiple levels by L4t3r4lu5 · · Score: 1

      you could nuke one of them and everything would still be up.

      How dramatic. The upstream ISP could also disconnect one for lack of payment of bills, but that wouldn't be nearly as exciting as nuking it, would it!

      Bravo on designing in redundancy, though. Maybe Amazon is hiring...

      --
      Finally had enough. Come see us over at https://soylentnews.org/
    18. Re:It's failure on multiple levels by bbn · · Score: 1

      Some countries, like the one I live in, have all cables in the ground. That includes the high power cables powering industries much more power hungry than a small datacenter.

      It is slightly more expensive. Therefore countries that value the esthetics of the landscape will bury the cables. Less developed countries will have them on poles...

    19. Re:It's failure on multiple levels by maxume · · Score: 1

      That's also well explained by them trying to get electricity as cheaply as possible.

      --
      Nerd rage is the funniest rage.
    20. Re:It's failure on multiple levels by tehcyder · · Score: 1

      It seems you didn't RTFM. Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.

      You're missing the point. If a chain has one million links, it doesn't matter how good the other 999,999 are, if just one fails you've got a fucked chain.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
    21. Re:It's failure on multiple levels by mlts · · Score: 1

      Actually, any professional grade UPS is set up this way. Online systems have the data center power always coming off the batteries 24/7. Only if there is a complete battery failure does power get switched to the utility company. This is opposed to standby UPSes that have power going from the utility, and try to switch over to battery. This also provides very clean power, so brownouts or spikes do not affect the equipment.

      Any serious data center has an online UPS like this where someone can flip off the utility power, and the DC will not skip a beat until the batteries croak. However, by then, the diesel generator should be on and working.

    22. Re:It's failure on multiple levels by afidel · · Score: 1

      Less developed like Germany, France and the US? Burying 120kv lines isn't just a little more expensive, it's a LOT more expensive and like I said the failure modes and causes are interesting. I'm assuming from your posting history you're from Denmark, Denmark's is about 1/4 the area of the state I live in. It's easy to do ridiculously expensive things if you only have to do them on a small scale.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    23. Re:It's failure on multiple levels by bbn · · Score: 1

      Germany has higher population density than Denmark.

      I do not see any powerlines on every streetcorner in the cities in Germany like is often seen in USA. Those have been buried for half a century in most of Europe. Also in some areas in the states.

      Apologies for the choosing of words "less developed". It was meant as a joke on countries or states that choose to spend less on esthetics and not as a statement to the usefulness of spending money on such things.

      Cables in the ground do tend to be better protected however, which is part of why we bury them here. Long term it might even be more cost effective that way.

    24. Re:It's failure on multiple levels by drinkypoo · · Score: 1

      If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.

      If they were smart, they wouldn't have this problem: It makes basically zero sense to have network hardware preconfigured for you. You could script the initial config load with expect, it's not like you need a brain.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    25. Re:It's failure on multiple levels by egcagrac0 · · Score: 1

      If a system isn't tested, you can't know that it works.

      Test it before you go live, if you have to. Manually switch to a backup system first if you must, to test only the offline parts.

      If you don't test it, you don't find problems. You must assume that it won't work.

      Push the button once a week. Make sure the generators start.

    26. Re:It's failure on multiple levels by david_thornley · · Score: 1

      Technically, the power grid is the overall distribution system, based on lines of about 57 kilovolts on up to whatever they can switch these days. That's a true grid, and not what was at fault.

      Once that power hits a substation, and gets turned into something halfway usable, it's distributed in what is topologically a tree rather than a grid. Running an actual grid is tricky, and is generally only worth it on the large scale. (It's entirely possible that new computerized control systems will allow distribution to be on a grid, since they'll be able to deal with instabilities and possible feedback loops. I don't think such are in general use today.)

      The tree will of course have switches that can change the topology and provide power to certain sections by different routes, but that is done by human decisions, once the problem is well-enough known. They aren't in any way a substitute for a good power-switching system in the data center.

      Moreover, this was a utility pole near the data center. You will never get multiple paths at the really local level as a matter of course. All the houses on my block run from the same transformer, and if it had any redundancy it'd be in getting power to the transformer. Take out the pole with the transformer on it, and the block goes dark.

      It's possible that the data center could have had redundant connections to the distribution system, but unless they were from somewhat separated sections it wouldn't do much good. (If it was from somewhat separated sections it would have problems with the reporting: the dispatcher, on being told that half the power was out at the data center, wouldn't know which half. Last time I was involved with an outage management system, the decision was to not even try to disambiguate.)

      All of this is well known to anybody in the power distribution business, and Amazon had no business setting up a data center without knowing this. I see no reason to blame the utility company.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
    27. Re:It's failure on multiple levels by Slashdot+Parent · · Score: 1

      Actually, the failure is your own, for not reading the fine EC2 manual.

      Amazon is clear on this point: they are providing cheap VDSs on commodity hardware. Any individual VDS may terminate at any given point in time. As a result, you need to architect your app so that it tolerates the failure of a node.

      If you think about it, Amazon kindof has a point on this one. We can divide servers into two types: those that have failed, and those that have not failed yet. Too many applications are architected as though hardware and network failures never happen. They do happen (even in the best-run datacenters), and EC2 forces you to plan for that inevitable failure.

      What's your plan when a node or a switch fails in your physical datacenter? Or if your internet link is severed by an intoxicated backhoe operator? Will it cause you more acid indigestion than typing 'ec2-run-instances'?

      --
      They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
    28. Re:It's failure on multiple levels by RoFLKOPTr · · Score: 1

      Test it before you go live, if you have to.

      Well no shit. Of course you test before you go live, but you can't exactly just shut off utility power to test things after you're up and running. And the generators are tested every week, but they can't do an actual live test and force the datacenter to run off genny power. They have a couple load banks outside next to the generators which are essentially massive toaster ovens the size of small U-Haul vans. They use those to put load on the generators.

      Still, though, you cannot test the ENTIRE power backup system on a mission-critical application, and if the system is designed right and has proper real-time monitoring equipment, you won't NEED to.

    29. Re:It's failure on multiple levels by egcagrac0 · · Score: 1

      If the system works, there's nothing to worry about.

      Testing transfers really shouldn't be a scary thing, particularly when grid power is available.

      I'd rather find out about a problem on a sunny afternoon when I can have more than enough techs standing by to fix the problem, than at 2am on a Sunday during a thunderstorm/earthquake. I know I don't speak for everybody.

    30. Re:It's failure on multiple levels by RoFLKOPTr · · Score: 1

      If the system works, there's nothing to worry about.

      But if it doesn't, then you just knowingly shut off power to your entire datacenter. You can't do that. You'd need scheduled downtime where you assume that the entire system will fail... and there's many applications where downtime is not an option. A system like I described above has almost no chance whatsoever of failure except in the event of massive catastrophe in which the building's structure is severely damaged, or equipment overheats, batteries explode, and the generators blow head gaskets. Those are not things you can discover or predict in procedural testing.

    31. Re:It's failure on multiple levels by Anonymous Coward · · Score: 0

      Are you deliberately missing the point? That's why you run power in parallel via multiple methods.

      Even my "datacenter" does this, and we only have about 1000 servers or so.

    32. Re:It's failure on multiple levels by egcagrac0 · · Score: 1

      I have to agree with AC here. You're missing the point.

      In any mission critical high availability system, testing is required.

      If you're not willing to unplug the power cord that supplies your UPS, you don't trust your UPS. Redesign your system.

      If you're unsure of the outcome, I have no problem with you saving all your files first... maybe even syncing the drives and remounting filesystems read-only... but you still need to pull the plug, or knock out a switch, or... whatever it takes to verify that things will fail appropriately.

      Don't be afraid to pull the plug. Thinking you have a backup when you don't is much worse than thinking you don't have a backup when you do. Even worse is selling a backup that you don't have.

      I know they do this in hospitals - everything switches to generator power (monthly, I believe). They find the glitches in the system. They fix the glitches. They want to be sure that when lightning takes out the substation 3 blocks away, their lights (and MRI machines, and the machine that goes "Ping!") come back on.

      If you're afraid of your backup systems, your backup systems do not provide a safety net. If your application is critical, you need to have absolute confidence that it will keep going. Testing is how you get that confidence.

      (Ever see a tightrope walker or trapeze artist? Ever seen them deliberately test their net by falling into it? I have.)

  6. Obvious solution by nebaz · · Score: 4, Funny

    Utility poles clearly need countermeasures. Hellfire missiles and such. That'll teach 'em to mess with a poor defenseless pole.

    --
    Rhymes that keep their secrets will unfold behind the clouds.There upon the rainbow is the answer to a neverending story
    1. Re:Obvious solution by binarylarry · · Score: 2, Informative

      Think of the poor strippers man!

      --
      Mod me down, my New Earth Global Warmingist friends!
    2. Re:Obvious solution by FooAtWFU · · Score: 1
      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    3. Re:Obvious solution by Ocyris · · Score: 1

      Install a CIWS on every utility pole and transformer. http://en.wikipedia.org/wiki/Phalanx_CIWS

    4. Re:Obvious solution by wronskyMan · · Score: 1, Insightful

      You know who else messed with poor defenseless Poles?

      --
      --- You shall know the truth, and the truth shall make you mad- Neal (not Cowboy) Boortz
    5. Re:Obvious solution by Anonymous Coward · · Score: 0

      "poor defenseless pole" and not one reply about Hitler or Nazi Germany? for shame!

    6. Re:Obvious solution by f3rret · · Score: 1

      Stippers don't use utility poles, gotta watch out for splinters you know.

      --
      Admit nothing. Deny Everything. Make Counter-accusations.
    7. Re:Obvious solution by Darth_brooks · · Score: 1

      That'll teach 'em to mess with a poor defenseless pole.

      I remember him. Wes Kowalski. Coke bottle glasses taped up in the middle, Pocket protector. The class bullies never left the poor kid alone. I didn't know he went to work for the utility company.....

      --
      There are some people that if they don't know, you can't tell 'em.
  7. ...in soviet russia by Konster · · Score: 0, Offtopic

    In Soviet Russia, utility pole hits YOU!

  8. What are you doing, Dave? by Bob_Who · · Score: 1

    Stop driving like a dork, Dave...I'm getting sleepy...

  9. An untested DR plan is a worthless DR plan by realmolo · · Score: 3, Interesting

    Seriously, Amazon screwed up in a fairly major way with this.

    What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?

    Answer: Nothing. You'd be surprised how may US small-to-medium sized business are one fire/tornado/earthquake/hurricane away from bankruptcy. I'd bet it's over 80% of them.

    1. Re:An untested DR plan is a worthless DR plan by FictionPimp · · Score: 1

      The place I work just had the exact same problem. DC caps went bad and nobody noticed. Power went out and the backup didn't have enough juice to let the batteries kick in and move to the generator. At least I don't feel really bad now, just bad.

    2. Re:An untested DR plan is a worthless DR plan by Albanach · · Score: 3, Insightful

      Seriously, Amazon screwed up in a fairly major way with this.

      What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?

      What on earth leads you to suggest they don't have working disaster recovery? The experienced some disparate power outages and say they're implementing changes to improve their power distribution.

      I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.

      I've hosted in another data center where someone hit the BIG RED BUTTON underneath the plastic case, cutting off power to the floor.

      I'm sure Amazon could have done thing better and will learn lessons. That's life in a data center.

      Nonetheless, Amazon allow you to keep your data at geographically diverse locations. As a customer you can pay the money and get geographic diversity that would have mitigated. If you don't take advantage of that, you can hardly blame Amazon for your decision.

    3. Re:An untested DR plan is a worthless DR plan by AK+Marc · · Score: 1

      Same with data backups. People just put in untested redundancy, a backup program that says "completed" and live happy. At least until the first time something fails.

      Testing costs time and money. It's easier to point to a job status that says "completed" or an invoice for the right pieces and say "it was the vendor's fault."

    4. Re:An untested DR plan is a worthless DR plan by lena_10326 · · Score: 1

      Amazon has an insane amount of redundancy with dozens of physical data centers spread over the world. They regularly perform game day disaster scenarios taking out entire data centers to test the recovery of the infrastructure and Amazon applications.

      In this instance, you'll note only a few clients were impacted because a switch had incorrect configuration. There is not much you can do about some types of human errors, which can come from all sorts of unexpected angles. Regardless, a number EC2 nodes were lost but were replaceable with EC2 nodes in other data centers. If clients lost data then it was due to clients not following the principle of building in redundancy into their applications. Amazon can only implement redundancy in the infrastructure, not client applications. Amazon advises EC2 customers not to build in single node dependencies into their apps. This cannot not be made more clear in their documentation and support.

      You are very ignorant regarding your speculation about Amazon's infrastructure.

      --
      Camping on quad since 1996.
    5. Re:An untested DR plan is a worthless DR plan by crazybit · · Score: 1

      What on earth leads you to suggest they don't have working disaster recovery?

      The fact that their service was partially cut due to a power failure. We know accidents DO happen and power failures DO happen, like the explosion in The Planet's power control room [1]. The guy's at Amazon cloud should be prepared for predictable problems like a power outage, specially when one of their selling arguments is service continuity.

      [1]

      --
      - Human knowledge belongs to the world
    6. Re:An untested DR plan is a worthless DR plan by TubeSteak · · Score: 1

      I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.

      Doesn't the "U" in "UPS" stands for "Uninterruptible"?
      Soooo.. Forgive my ignorance, but how does hardware hooked up to a UPS have "a brief outage"?

      --
      [Fuck Beta]
      o0t!
    7. Re:An untested DR plan is a worthless DR plan by afidel · · Score: 1

      Uh, this is why Amazon tells you up front if you want true HA you have to have VM's in multiple zones to assure that they are served from different datacenters with no shared point of failure.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    8. Re:An untested DR plan is a worthless DR plan by afidel · · Score: 1

      That's what maintenance contracts with at least biannual preventative maintenance is for =)

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    9. Re:An untested DR plan is a worthless DR plan by thegarbz · · Score: 5, Interesting

      It is exactly that level of understanding that can cause most outages (and even failures of safety critical systems). There is one part of the UPS that is uninterruptible and that is the voltage at the battery. Between the voltage at the battery and the computer you have cables, electronics, control systems, charging circuits, and inverters. Beyond that if it's an industrial sized UPS there'll be circuit breakers, distribution boards, and other such equipment, each adding failure modes to the "uninterruptible" supply.

      I'll give you an example of what went wrong at my work (a large petro chemical plant in Australia). Like a lot of plants most pumps are redundant, and fed from two different sub stations, that doesn't prevent loss of power but the control circuits in those sub stations run from 24V. Those 24V come from two different cross linked UPS units (cross linked meaning that both redundant boards are fed from both redundant UPS). So in theory not only is there a backup at the plant, backup substations, and backup UPSs but in theory any component can fail and still keep upstream and downstream systems redundant.

      Anyway we had to take down one of the UPS for maintenance reasons following a procedure we'd used plenty of times before. The procedure is simple: 1. Check the current in the circuit breakers so that the redundant breakers can withstand the load, 2. close the circuit breakers upstream of the UPS that is being shut down, 3. Close main isolator to the UPS. So that's exactly what we did, and when we isolated one of the UPS, the upstream circuit breaker tripped from the OTHER UPS and control power was lost to half the plant as it was now effectively not only isolated from battery backup, but from the main 24V supply.

      So after lots of head scratching we did some thermal imagery of the installation. The circuit breaker which tripped in sympathy when we took down it's counterpart was running significantly hotter than the main one. The cause was determined to be a lose wire. So even though the load through the circuit breaker was much less than 1/2 of the total load, when we took down the redundant supply and the circuit breaker got loaded, the temperature pushed it over the edge.

      A carefully designed dually redundant UPS system providing 4 sources of power failed when we took down 2 of them in a careful way due to a lose wire in a circuit breaker. A UPS is never truly uninterruptible, and even internal batteries in servers would be protected by a fuse of some kind to ensure the equipment goes down, but ultimately survives a fault

    10. Re:An untested DR plan is a worthless DR plan by FictionPimp · · Score: 1

      Yea, our fault was that we let our maintenance department handle the power. They apparently let everything slide. It's back under IT's control now. The thing had been emailing them failed test messages for some time.

    11. Re:An untested DR plan is a worthless DR plan by Albanach · · Score: 1

      The guy's at Amazon cloud should be prepared for predictable problems like a power outage, specially when one of their selling arguments is service continuity.

      If you're looking for service continuity and using Amazon EC2, I'd hope you read their SLA first, since they offer a 99.95% SLA over 365 days for availability in a particular region. That's about four and a half hours downtime per year.

      Amazon will do their bit to meet that goal. If you want something more reliable, you need to host your service in more than one location. The Amazon cloud allows you to do this.

    12. Re:An untested DR plan is a worthless DR plan by sr8outtalotech · · Score: 1

      That's a fair assessment. At most of the small companies I've worked for whenever I suggest implementing some sort of IT risk management I get treated like I was trying to shakedown the owner for his kids lunch money.

    13. Re:An untested DR plan is a worthless DR plan by Slashdot+Parent · · Score: 1

      What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?

      Actually, Amazon is quite clear on this issue: any individual EC2 instance should be considered to be disposable. Your sole recourse if an instance fails lies within the RunInstances and StartInstances APIs (read: if your instance fails, launch a new one).

      If you think about it, this makes sense. You can plan for common hardware and network faults, but the funny thing about failures is they don't always happen the way you envisioned. Furthermore, they don't always happen in recoverable ways. Consider the fire at ThePlanet in '08. Backup generators were working great until the fire marshal arrived and ordered them powered down. What's your recovery plan when the fire marshal shows up and says, "If you want my firefighters to risk their lives putting out your fire, you're cutting all power."?

      The EC2 plan is to simply launch a replacement instance. In a different datacenter, if you wish.

      --
      They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
  10. UPS's by MichaelSmith · · Score: 4, Interesting

    The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.

    It also blew the power supply on an alphaserver and put a nice burn mark in the breaker panel. So the UPS guy comes out and he doesn't have two of the right sort of fuse. Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.

    1. Re:UPS's by seanvaandering · · Score: 4, Funny

      Get that guy out of your datacenter pronto... no one can be THAT bloody unlucky in one shot.

    2. Re:UPS's by e9th · · Score: 1, Insightful

      Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.

      Strips of steel with holes in them? You're kidding, right?

    3. Re:UPS's by MichaelSmith · · Score: 2, Informative

      Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.

      Strips of steel with holes in them? You're kidding, right?

      No. It would be 50*15*5 mm steel with a 10mm hole drilled in each end. A bolt goes through each hole into a threaded attachment point.

      Now that you mention it I recall that a four inch nail is good for 100A slow blow but thats cylindrical so it conducts nicely. You'd think the rectangular cross section would not conduct quite as well (sharp corners, etc) but maybe it is also tuned for the desired current. A little saw cut half way between the holes would do that.

    4. Re:UPS's by voodoo+cheesecake · · Score: 1

      I guess they didn't have enough chewing gum wrappers on hand!

    5. Re:UPS's by grcumb · · Score: 1

      Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.

      Strips of steel with holes in them? You're kidding, right?

      Yeah, so what? I mean what could possibly go wro

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    6. Re:UPS's by fenix849 · · Score: 1

      He must be because all the other fuses are ceramic or glass containing a specific thickness wire made from some sort of conductive material that will melt at a given amount of current..

      The melting bit is optional if you don't mind being killed by your toaster.

    7. Re:UPS's by mystik · · Score: 1

      Funny -- Something almost exactly like that happened @ my datacenter last year. Some 'licensed' electrician accidentally shorted something in the main Power junction, which took the whole damn thing offline, generators, batteries and all. We were down 1+ hrs while they had to have the techs come on site to ensure that things were safe and online. Meanwhile, a small group of admins just outside w/ pitchforks (myself included) were waiting for the all clear to swarm the datacenter to get our equipment back online ....

      --
      Why aren't you encrypting your e-mail?
    8. Re:UPS's by dbIII · · Score: 1

      Steel is an incredibly bad conductor as metals go - just think of a blacksmith safely holding a bit of steel that is red hot at the other end if you've never done that yourself. Electrical conductivity is related to thermal conductivity.

    9. Re:UPS's by e9th · · Score: 1

      If his 100A fuses are made of steel, he's probably using them as a toaster.

    10. Re:UPS's by Black+Gold+Alchemist · · Score: 1

      Only in metals. In other systems, quantum mechanics takes over.

      --
      Responsibility is an addiction
      Virtue is a temptation
      Community is a cartel
    11. Re:UPS's by tokul · · Score: 1

      Strips of steel with holes in them? You're kidding, right?

      Nope.

    12. Re:UPS's by seanadams.com · · Score: 1

      shorted the white phase to ground

      What the hell is the "white phase"? Unless I am missing some newfangled data-center lingo, you are talking about the neutral, which is not a "phase" at all, and could never produce such a fault current when "shorted" to ground since it is already tied to ground at the panel. Am I missing something?

    13. Re:UPS's by omglolbah · · Score: 2, Informative

      Just be glad nobody got killed...

      Shorting out something in a main power junction could easily have created a fairly nasty fire...

    14. Re:UPS's by MichaelSmith · · Score: 2, Informative

      shorted the white phase to ground

      What the hell is the "white phase"? Unless I am missing some newfangled data-center lingo, you are talking about the neutral, which is not a "phase" at all, and could never produce such a fault current when "shorted" to ground since it is already tied to ground at the panel. Am I missing something?

      You have three actives (red, white, dark blue here in .AU), a neutral and an earth. The wikipedia page says different countries have different color codes so maybe that is the confusion.

    15. Re:UPS's by ShakaUVM · · Score: 1

      The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.

      To be perfectly frank, I'm a bit scared of what sort of security system your datacenter has when the system can cause a blowout of that magnitude.

    16. Re:UPS's by MichaelSmith · · Score: 1

      The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.

      To be perfectly frank, I'm a bit scared of what sort of security system your datacenter has when the system can cause a blowout of that magnitude.

      I don't know what they were up too. Probably just installing a new proximity card reader or something and they wanted to use UPS power.

      The facilities department where had their own rules. Once they installed partitions in the computer room and they had a guy grinding aluminium so little particles sprayed on our monitors and fell into the ventilation slots.

      Parts of the security system would not let you out of the building or into the working areas without operator intervention. You could be stuck in the stairwell over the weekend and go thirsty.

    17. Re:UPS's by seanadams.com · · Score: 3, Informative

      The hots are black, red, and blue (in that order of prevalence) in the US.

    18. Re:UPS's by ColdWetDog · · Score: 1

      The hots are black, red, and blue (in that order of prevalence) in the US.

      Remember, Aussies are upside down (because of their location) and backwards (as a result of their English upbringing).

      White it is....

      --
      Faster! Faster! Faster would be better!
    19. Re:UPS's by drinkypoo · · Score: 1

      Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.

      I don't know about yours, but mine are strips of copper. They could even be plated copper. I hope you replaced them later. Otherwise you may have installed bus bars where there's supposed to be fuses.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    20. Re:UPS's by Anonymous Coward · · Score: 1, Informative

      Only on small stuff are the hot wires black, red or blue.
      Brown, Orange and Yellow are used to mark out 3 phase systems.
      Still, a white wire is typically neutral.

      Oh, and on larger fuses, they're often a copper (or even better, silver) strip. in a barrel filled with sand. When the strip melts, the resulting arc melts the surrounding sand, fusing everything together into a non-conductive mess. The result is a fuse that can quickly and reliably interrupt a large current, even when its highly inductive. Simply melting a strip of metal can allow an arc to linger across the gap for a dangerous amount of time.

    21. Re:UPS's by sjames · · Score: 1

      Sounds like you guys also didn't have color codes right. White is supposed to be neutral.

  11. Unreasonable expectations by KGBear · · Score: 4, Interesting

    I expect this is just a scaled up version of the problems I deal with every day. And I'm sure I'm not the only one. Users have grown so dependent on system services and management has grown so apart from the trenches that completely unreasonable expectations are the norm. Where I work for instance it's almost impossible to even *test* backup power and failover mechanisms and procedures because users consider even minor outages in the middle of the night unacceptable and managers either don't have the clout or don't understand the problem well enough to put limits to such expectations. As a result often times the only tests such systems get happen during real emergencies, when they are actually needed. I don't know how, but I feel we should start educating our users and managers better, not to mention being realistic about risks and expectations.

    1. Re:Unreasonable expectations by SheeEttin · · Score: 1

      users consider even minor [test] outages in the middle of the night unacceptable

      ...and this is why we have redundancy.
      Test the backup hardware. Works? Switch over to it, test the main hardware. Works? All good, no (or negligible) downtime.

    2. Re:Unreasonable expectations by omglolbah · · Score: 1

      Yes, but what the management probably worries about is "what if the redundant system fails while you are testing the primary?".

      So they wont let us lowly engineers do the test... opting instead of the chance of a disaster...

      I'm glad I work in the oil business... safety is ALWAYS the most important thing... since any failure will be horribly expensive :-p

    3. Re:Unreasonable expectations by Anonymous Coward · · Score: 0

      But what about the automated switch over system?
      Doesn't work
      *downtime and your customer wants your balls*
      *"Maybe downtime the customer wont let you try it*

    4. Re:Unreasonable expectations by Rob_Bryerton · · Score: 1

      Not to worry, they will learn the hard way during a failure when kilo- or mega-$ are lost. Then you'll get your scheduled outages :) ...and some middle management will get new jobs

  12. Hurrr, durrr by planetoid · · Score: 1, Informative

    Stop building those things so fucking close to the roads, maybe?

    --
    Slashdot requires you to wait longer between hitting 'reply' and submitting a comment.
    1. Re:Hurrr, durrr by MichaelSmith · · Score: 2, Insightful

      Stop building those things so fucking close to the roads, maybe?

      What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.

    2. Re:Hurrr, durrr by JeanBaptiste · · Score: 1

      Cloud computing? sounds scary. Not in MY backyard.

    3. Re:Hurrr, durrr by plover · · Score: 5, Funny

      What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.

      That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.

      --
      John
    4. Re:Hurrr, durrr by MichaelSmith · · Score: 2, Funny

      What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.

      That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.

      In that same job we had a bunch of CCTV cameras on St Kilda road in Melbourne right outside the arts center. Its a mess of tram gear and traffic signals the there is a lot of fibre under the road.

      Ever stuck your fork into a plate of spaghetti, then spun it around? This guy had to bore a hole straight down right in the middle of the road. There is a number where you can "dial before your dig" but he omitted that. He wound up with ~50 metres of fibre wrapped around his borer. Quite a mess.

    5. Re:Hurrr, durrr by tehcyder · · Score: 1

      That gets my unofficial +1 snorts-coffee-over-the-keyboard mod.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
  13. Transfer switches suck? by pavera · · Score: 2, Interesting

    The DC that my company colos a few racks in had this same thing happen about a year ago (not a car crash, just a transformer blew out). But the transfer switch failed to switch to backup power, and the DC lost power for 3 hours.

    What is up with these transfer switches? Do the DCs just not test them? Or is it the sudden loss of power that freaks them out vs a controlled "ok we're cutting to backup power now" that would occur during a test? Someone with more knowledge of DC power systems might enlighten me...

    1. Re:Transfer switches suck? by jroysdon · · Score: 1

      I don't get why you wouldn't have dual-redundant power supplies on all devices (routers, switches, servers), one on transfer switch A and the other on transfer switch B, each connecting to different backup power sources. Further, these should be tested on a regular basis (at least monthly). Test transfer switch A on the 1st, and transfer switch B on the 15th.

      Seems like a design flaw here and/or someone was just being cheap.

    2. Re:Transfer switches suck? by aaarrrgggh · · Score: 1

      The problem usually isn't the transfer switch itself, but how it works with everything else. Transfer switches usually only really fail with contact damage.

      Cascading failures are a bigger problem for most co-lo's, as they try and maximize infrastructure utilization to a fault.

      Restoring power can be quite difficult.

    3. Re:Transfer switches suck? by Technonotice_Dom · · Score: 2, Interesting

      I don't get why you wouldn't have dual-redundant power supplies on all devices (routers, switches, servers), .... [snip]

      Seems like a design flaw here and/or someone was just being cheap.

      It would be the latter. The AWS EC2 instances aren't marketed or intended to be high availability individually. They're designed to be cheap and Amazon do say instances will fail. They provide a good number of data centres and specifically say that systems within the same zone may fail - different data centres are entirely independent. They provide a number of extra services that can also tolerate the loss of one data centre.

      Anybody who believes they're getting highly available instances hasn't done a basic level of research about the platform they're using and deserves to be bitten by this. Anybody who does know the basics of the platform will know the risks and will be able to recover from a failure, possibly even seamlessly.

    4. Re:Transfer switches suck? by Renraku · · Score: 2, Interesting

      It's not really the DC power system that's the issue.

      The people are the issue.

      Example: You're the lead technician for a new data center. You request backup power systems be written into the budget, and are granted your wish. You install the backup power systems and then ask to test them. Like a good manager, your boss asks you what that will involve. You say that it'll involve testing the components one by one, which he nods in agreement with. However, when you get to the 'throw the main breaker and see if it works' part and he realizes that this one test might make them less than 99.99999% reliable if it fails, he disagrees and won't approve the testing.

      I can see where they're coming from here. They don't want downtime. They just aren't thinking far enough ahead. Ten minute test downtime or hours of unmitigated downtime. I abso-fucking-lutely guarantee you that the technicians will be blamed. Not management.

      --
      Job? I don't have time to get a job! Who will sit around and bitch about being broke and unemployed then?
    5. Re:Transfer switches suck? by Jeffrey+Baker · · Score: 2, Interesting

      The answer is "yes". Transfer switches often fail and are rarely tested. This is also true of other power equipment. If it's rarely used the probability of it working in an emergency are somewhat low.

      However, in this case the transfer switch worked fine, but it had been misconfigured by Amazon technicians. According to their status email from yesterday (posted in their AWS status RSS feed) the outage was a result of the fact that one transfer switch had not been loaded with the same configuration as the rest of the transfer switches in the datacenter. The "failed" switch performed as configured and powered down.

  14. Oil's Well by Aeonite · · Score: 2, Insightful

    It's a good thing that oil rigs are better managed than data centers. Who knows what might happen if one of them ever had a problem like this?

    1. Re:Oil's Well by omglolbah · · Score: 1

      Yep, but what is important to keep in mind is that if an oil rig has to shut down for a day due to a power issue the oil will still be in the ground.
      The company might lose money due to having promised a certain supply (especially with gas!) but the resource is not lost.

      In a datacenter the uptime is all there is. Value is lost.

      The oil rigs in Norwegian waters are fairly secure when it comes to power faults. If the system cannot guarantee power it goes into a shutdown sequence to set everything in a "safe" position.
      Shutdowns are actually not that rare. Minor shutdowns of parts of a platform or refinery is not very dramatic. It is just a case of getting the bugger up and running in a safe way.

      I'd give details but unfortunately I am under NDA :-p

    2. Re:Oil's Well by Anonymous Coward · · Score: 0

      If a car takes out an oil rig we've got bigger problems than just some spilled oil.

    3. Re:Oil's Well by Anonymous Coward · · Score: 1, Informative

      I do believe there is a wooshing sound you missed. He is referring to the BP gulf oil spill. Although that was not caused by a power failure.

    4. Re:Oil's Well by DNS-and-BIND · · Score: 1

      Who knows what might happen if one of them ever had a problem like this?

      *WHOOSH* He's not talking about the BP oil spill.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    5. Re:Oil's Well by Anonymous Coward · · Score: 0

      Uhh, Big time Whoosh!

    6. Re:Oil's Well by Sulphur · · Score: 1

      In a chemical plant where downtime runs $30,000 an hour, they have two high lines bringing in power. If one is down for repair, then a crane hitting the other can kill power.

      True story: A dual process control computer consisted of two complete computers with two separate power supplies. The only thing they shared was the case, until they were wired into the same breaker.

      --

      ++good said Winston.

    7. Re:Oil's Well by mjwx · · Score: 2, Informative

      It's a good thing that oil rigs are better managed than data centres. Who knows what might happen if one of them ever had a problem like this?

      I have a friend who is an engineer on one of the projects in the North West Shelf (of Western Australia) a few weeks back he asked "how can they build a rig in the gulf of Mexico for one third of our costs". Two days later One blew up an he got his answer.

      --
      Calling someone a "hater" only means you can not rationally rebut their argument.
    8. Re:Oil's Well by rvw · · Score: 1

      Yep, but what is important to keep in mind is that if an oil rig has to shut down for a day due to a power issue the oil will still be in the ground.

      Yeah right. Don't know where you've been the last week or so, but right now millions of barrels of oil are spilling in the Gulf of Mexico. You may think that's not a power issue, but it was an uncontrolled system failure, in many senses comparable. And that situation is probably much worse than this incident at Amazon.

    9. Re:Oil's Well by Anonymous Coward · · Score: 0

      Seeing as we are unfamiliar with sarcasm, I shall close the register at this point.

    10. Re:Oil's Well by Anonymous Coward · · Score: 0

      Maybe he's numbers are a bit off or the North West Shelf is especially expensive but Gulf costs are about in line with prices in Brazil and Norway.

    11. Re:Oil's Well by omglolbah · · Score: 1

      No. The problem in the gulf of mexico was -not- a power issue.

      It as a complete and utter failure of procedure.

      The blowout preventer had a hydrolic leak. This should have caused major alarm in any decent control system.

      On top of this apparently they claimed that loss of communication to the device could have caused it to not fire... Eh excuse me but these things are on "fail closed" circuits in ANY decent control system.

      And seriously, a platform shutdown is quite different from one blowing up. There has been quite a few shutdowns on the rigs I've worked with and none of them dramatic. The control system is designed so that if any part of it fails the whole system -will- go to a safe state without being told to by an operator.
      Hell, most of the safety valves cant be opened from the control room. You have to actually go out in the field and manually reset them with a key..

      The system failure on Deepwater Horizon was a combination of shitty engineering, shitty maintenance and shitty management. What happened should not be possible with a properly designed control system.

      And no, I'm not talking out of my arse. I actually work in the control system business and have enough inside knowledge to know that this incident was a monumental cockup and not just an "accident".

      I'd go into detail but I'm under NDA ;)

  15. Oh noes by Anonymous Coward · · Score: 0

    Is it just me or is the placement of a lot of recent links aggravating? Shouldn't the link be from "Amazon cloud computing data center lost power" and not the bit about a utility pole getting struck?

    I think it should be, but I'm no Angus Mickleburger.

  16. I'm confused by OverlordQ · · Score: 3, Funny

    Why couldn't they just get power from the cloud?

    --
    Your hair look like poop, Bob! - Wanker.
    1. Re:I'm confused by MichaelSmith · · Score: 1

      The cloud should have a faxing service so I can get free paper from my fax machine.

    2. Re:I'm confused by martin-boundary · · Score: 1

      Why couldn't they just get power from the cloud?

      The Indian who usually does the Rain and Lightning Dance was on vacation.

    3. Re:I'm confused by bpcomp · · Score: 1

      Why couldn't they just get power from the cloud?

      They didn't have the right security key attached to the kite.

    4. Re:I'm confused by Anonymous Coward · · Score: 0

      Because it isn't a stormcloud.

    5. Re:I'm confused by roman_mir · · Score: 1

      They needed 1.21 Gigawatts and the cloud had it, but the Delorean didn't start this time.

    6. Re:I'm confused by Darth_brooks · · Score: 1

      Because power from a cloud tends to come in "flash" transmissions which studies have shown tend to be damaging to electronics....

      --
      There are some people that if they don't know, you can't tell 'em.
  17. Not really by Sycraft-fu · · Score: 4, Informative

    All a fuse is is a piece of metal that will melt fairly quickly when a given amount of current is passed through it. Idea being that it heats up and melts before the wires can. So, the bigger the current, the more robust the metal connecting it. A 100A fuse is usually a fairly large strip of steel.

    Now I'll admit that just grabbing an approximate size of steel and placing it in as the GP did isn't going to yield a nice precise fuse. It may have been too high a current. However, it'd work for getting things running again and probably provide a modicum of protection in the event of a short.

  18. stupid mods, trickz are for kidz by Anonymous Coward · · Score: 0

    Funny != insightful

    1. Re:stupid mods, trickz are for kidz by Coopjust · · Score: 3, Interesting

      Often, mods will give a funny post "insightful" instead of "funny" because it gives the user positive karma (whereas funny does not affect karma). Not a use intended by CmdrTaco, I'd imagine, but it's a common practice.

  19. Redundancies, Redundancies by Anonymous Coward · · Score: 0

    That is why datacenters should have (and my company does) dual-upses, dual-transfer switches, and dual-generators. They also should not load any circuit over 50% to ensure a cascading failure won't happen if power is lost on one side.

    1. Re:Redundancies, Redundancies by mirix · · Score: 4, Insightful

      Redundancy costs money. If it costs more than downtime, you don't get it.

      --
      Sent from my PDP-11
    2. Re:Redundancies, Redundancies by afidel · · Score: 1

      Actually 40% since the NEC says to derate all circuits to 80% to accommodate inrush current.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  20. Cloud is a poor metaphor anyway by dbIII · · Score: 1

    It's also completely expected by those not sold on pure science fiction.
    All it can take is a backhoe in the wrong place at the wrong time or an anchor cable dragging to cut you off from the very real single or two bits of infrastructure that people fantasise is their bit of the "cloud".

  21. Failure is often not a boolean by mcrbids · · Score: 5, Interesting

    For years, I co-located at the top-rated 365 Main data center in San Francisco, CA until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.

    So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)

    And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.

    When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.

    All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.

    Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:

    1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.

    2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.

    3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.

    4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.

    Moral of the story? Never, EVER have all your eggs in one basket.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Failure is often not a boolean by Rob_Bryerton · · Score: 1

      Get your servers some remote access cards & save yourself a drive to the dc to press a power button ;)

    2. Re:Failure is often not a boolean by drinkypoo · · Score: 2, Interesting

      2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.

      Having had the spade connector that carries power from the jack at the back of a machine in a 1000W power supply fail and apparently (from the pattern of smoke in its case) actually emit flames I can say that brownout protection is definitely worth some money.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    3. Re:Failure is often not a boolean by michaelmalak · · Score: 1

      pizza boxes, and other refuge were to be found in every corner

      Malapropism alert!

      P.S. The previous Slashdot Princess Bride meme has been replaced.

    4. Re:Failure is often not a boolean by Anonymous Coward · · Score: 0

      Just imagine if those were Windows boxes. The time it would take to get them back up and cleaned would be measured in days, not hours. Another lurking TCO variable that gets left out of Microsoft's "studies".

    5. Re:Failure is often not a boolean by Anonymous Coward · · Score: 0

      That is exactly why each of my racks has it's own UPS to pick up any 10-30 minutes of failure the DC may have. You can bitch at the DC and try to explain to your customers how it was not your fault (who really cares?) or just make sure you have a backup plan.

      Also I set my servers to restart when normal voltage is restored in case they do need to shut down. I mean, why should someone need to be around to push the little button when it can be automated?

  22. Totally Unexpected by NicknamesAreStupid · · Score: 1

    Who, while driving through a cloud, would ever expect to hit a utility pole? Clouds do not have utility poles. Now, tule fog has utility poles. That is not why they call it 'tule' (not a nickname for utility, but for a grass), but many a utility pole has been unduly undone because someone drove through the tule fog and into the utility pole.

    If Amazon is going to put utility poles in its 'cloud', then they are really in a fog. Call it fog computing.

  23. Who Cares? by Aerosiecki · · Score: 2, Interesting

    Doesn't EC2 let you request hosts in any of several particular datacentres (which they call an "availability zones") just so you can plan around such location-specific catastrophes? No matter how good the redundant systems, some day a meteor will hit one datacentre and you'll be S.O.L. no matter what if you put all your proverbial eggs in that basket.

    Only a fool cares about a single-datacentre outage. This is why it's called "*distributed*-systems engineering", folks.

    --

    Cherish. Live. Dream.
  24. The real threat by Anonymous Coward · · Score: 0

    And here was me thinking planes were the real threat to cloud computing. I'm beginning to think I don't quite get this newfangled technology lark...

  25. Best Laid Plans by Zygamorph · · Score: 1

    Reminds me of a company I worked for. They had a data centre divided into 5 zones, each zone had a UPS. Each zone was connected to the neighbouring zone with a transfer switch and each UPS could handle 2 zones until the diesel generators kicked in. Each year for 5 years management decided that the cost of the downtime to do annual maintenance was too high so it wasn't done. Outside power finally goes away and 4 of the five zones stay up. The investigation determined that the battery ( natch the power is out) powered transfer switches on both neighbouring zones failed because the battery failed. Turns out putting in new batteries was part of the annual maintenance check list and they had a shelf life of 4+ years.

    How about the company with the diesel generator that has 5 hours of fuel. They test it for 1 hour every year. On year 5 the power goes out and the generator runs for one hour before running out of fuel. Seems the test procedure didn't include refuelling the generator.

    The point is that even with what you think is the best of planning and testing some time stuff happens.

    1. Re:Best Laid Plans by egcagrac0 · · Score: 1

      Good planning includes maintenance.

  26. It is just the same... But it shouldn't be! by Anonymous Coward · · Score: 0

    Yeah, it isn't worse than maintaining a server in your room. But the thing is, it should be orders of magnitude better. One of the main reasons to use the cloud is that there, with loads of servers centralized, redundancy and the like are a lot better taken care of than what you could achieve by yourself.

    Now, I don't claim that Amazon fucked up here. It seems that they only had one faulty switch and most of things worked exactly as they should have. Good enough IMO. But the argument "You have the same problems if you do it yourself" is kinda useless if one of the reasons to use the cloud is to get rid of those risks.

  27. He hit the cloud! by Anonymous Coward · · Score: 0

    He hit the cloud! The Cloud! OMG..

    And the 'undefined' location of services suddenly seems very much 'defined'..

  28. Again: The IT Uptime Lightweights by RobotRunAmok · · Score: 3, Insightful

    When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.

    I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?

  29. Re:Again: The IT Uptime Lightweights by jiteo · · Score: 1

    I think part of the problem might be money. TV Networks and hospital emergency rooms realize that their business and people's lives depend on their uptime. Many bosses in IT, not so much. So while I suspect most IT guys would love n+1 everything with regular tests, that takes resources, which are sometimes not allocated Note: quantification is left deliberately vague because I have 0 numbers to back up my point with. Except for that previous 0.

  30. Re:Again: The IT Uptime Lightweights by Mr.+Flibble · · Score: 1

    When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.

    I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?

    Us IT guys? We run around like we are wearing self-important Star Trek Blue Shirts alright. We just don't realize that our shirts are actually Red.

    --
    Try to hack my 31337 firewall!
  31. Re:Again: The IT Uptime Lightweights by Shimbo · · Score: 4, Informative

    When was the last time anyone heard of a TV Network going dark for an hour?

    Hmm, let me think. How about yesterday?

  32. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    People would die in an ER room lost power, that would cost the hospital and doctors an absolute fortune. Same with TV stations, they'd lose all that ad revenue. But this isn't unheard of. I've seem plenty of screwed up channels in the last few years.

    IT depts rarely have the budgets to allow for decent fall over systems. Just a bunch of UPS and the odd backup generator to allow graceful shutdowns. You can bet hospital servers are the same, they're not on the same circuits as medical equipment. You can see special power sockets for the decent supplies, with labels warning they shouldn't use them for anything other than the intended life support equipment.

  33. Re:Put critical power infrastructure underground by psbrogna · · Score: 1

    Really? Going subterranean with the infrastructure is your example of how to do things right? Doesn't the underground part of California periodically move around in unpredictable & dramatic ways? And not just 10 o'clock news dramatic, but the kind of Earth ripping that scoffs at the works of man.

  34. Cloud Wars? by psbrogna · · Score: 1

    It wasn"t a Google street mapping car was it? Now THAT would be a good story.

  35. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    Who do you think manages those TV Networks? Those Hospital Emergency Rooms? It's IT guys. The up-time tends to meet the needs. The stock markets are kept up quite well. I would be willing to bet the internal systems at Goldman Sachs, Fidelity, Bank Of America, etc. are all quite stable. Some little $100k revenue business website I run on EC2 with no geographical load balancing on different trunks is *not* as important. And usually it's not the IT guys that set the up-times, it's the management/budgeting process that makes those decisions. It's usually not worth the cost to maintain the uptimes you're talking about.

  36. Re:Again: The IT Uptime Lightweights by mr_nazgul · · Score: 2, Informative

    It's not a matter of I.T. guys not taking the proper steps.
    It's a matter of price versus "what if". YOU try to convince a pointy haired boss to spend thousands and thousands of extra dollars on something that "may" happen.

    It's often hard enough to convince higher ups to just upgrade old infrastructures that are maxed out on resources. Even if you have proof of issues or near failures. The ONLY time they will happily spend money on upgrades and making your infrastructure more robust is after there has been a critical failure and they actually see their bottom line being hurt and even then if you don't get the approval and dollars fast enough, you run the risk of "What are the chances THAT will happen again?"

    More often than not, infrastructure is patches built on patches, one I.T. guy coming in trying to "correct" mistakes of his/her predecessor (who they then realize was working with an underwhelming budget), THEN realizing that it's such a mish mash of bubblegum and duct tape, that any serious fixes would require serious downtime with a complete overhaul. Otherwise you run the risk of the whole thing imploding like a blackhole.

    How many I.T. guys seriously have the guts to walk up to their boss after being on the job for only a week and say, "I need 50k and you're network will be going up and down for two weeks as I rebuild and fix it all."

    I tried it. I, however, had the ammunition that my company went from 3 people to 40 people in 18 months with another 20 predicted in the next 6 months and that the two box servers were maxed out AND that we were renovating a newly purchased building so we could plan everything from cabling, to telephony to security and future planning for 250+ people.

    It also didn't hurt that my boss knows that I.T. is an investment when done right and NOT an expense. Even then with everything on my side it still took 3 months of planning, proving, mapping, designing and quoting from vendor after vendor before approval went through.

    --
    Good.. Bad.. I'm the guy with the gun.
  37. Re:Again: The IT Uptime Lightweights by MistrBlank · · Score: 1

    I think money has a huge deal in this too. I love when I tell someone in order to make their $200k current setup 24-7 with five 9's (or greater) uptime that it will costs millions (usually due to ridiculous network costs across sites), they quickly sign off to keep things the way they are. But yet I take the flak still when it does go down.

    Or worse, they don't realize that a 5 nines uptime doesn't mean that the system never comes down for maintenance. Then failure ensues from that disaster.

  38. What kind of car? by halcyon1234 · · Score: 1

    So a car just happens to take out a strategic Amazon datacenter? By any chance, was the car a Mini Cooper? Were the paramedics able to attach a neck-brace to the driver over his black turtleneck? And what's with the strange email send moments before? "The JOB will be done in a flash [sent from my iPhone]"

  39. Re:Again: The IT Uptime Lightweights by seven+of+five · · Score: 1

    People would die in an ER room lost power, that would cost the hospital and doctors an absolute fortune.

    True? People die in hospitals from hospital-borne infections all the time. I'm not seeing a tidal wave of lawsuits that would motivate them to clean up.

  40. Terrorists! by jeroen94704 · · Score: 1

    Clearly this was a test run for an upcoming terrorist attack! If one downed pole can bring down a data center, imagine what 19 downed poles (the number of hijackers on 9/11) could do! It would destroy the economy and lead to famine in the land! Clearly, we need to restrict who can drive a car!! We need government tracking of all cars!!! Background checks for everybody requesting a drivers license!!!! A kill/switch on every vehicle!!!!! Its a dirty job, but someones gotta do it!

    --
    He who laughs last, thinks slowest.
  41. Re:Again: The IT Uptime Lightweights by TooMuchToDo · · Score: 2, Informative

    Usually, TV stations (that get fined for being off the air for not using their spectrum) and hospitals (which, you know, you can die at if the power goes out depending on your circumstances) have an easier time getting money for redundancy because the bad results are more expensive than if LOLcats is down.

  42. Re:Again: The IT Uptime Lightweights by MadGeek007 · · Score: 1

    IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.

    Never send someone from medical to do an engineer's job.

  43. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.

    Budget.

    I'm sure the IT guys could accomplish it, but it's a matter of priorities on where the money goes. Redundant grids and generators don't grow on trees.

    The major hospitals where I live are usually connected to 2 and often 3 different grids--the electrical grid is / was often designed around their needs (some have been around for 100+ years, so they were around even before the grid was). Hell, my house was built before electrical service (there are still some pipes in the walls that were used for gas lighting).

    It's all about risk analysis.

  44. Re:Again: The IT Uptime Lightweights by mortonda · · Score: 1

    When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.

    I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?

    Probably combination of them, depending on the location, but I have seen many, many times where the money is not there to do it right - and because that happens so often, too many IT admins don't *know* what *right* is.

  45. Re:Again: The IT Uptime Lightweights by LWATCDR · · Score: 1

    Well for one thing power companies go way out of their way to keep hospitals up. Also I did work in Hospital IT. Guess what?
    The back up gen set didn't have the power to handle the AC!. We had be in the the machine room ready to do a power down on the System 38 if the temp got to high. That was just during a test!
    During a real power outage we where to shut down the S38 to keep the lab system on the DG Eclipse up and running.
    Yes it was a long time ago.

    --
    See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
  46. Re:Again: The IT Uptime Lightweights by Jawnn · · Score: 2, Insightful

    When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? The people who set the budget for IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to set the proper priorities to ensure -- really ensure -- their uptime.

    There. Fixed that for you.

    The reason you rarely see an ER go down for want of power is that, knowing that lives depend on it, the people responsible for providing for it are willing to spend what it takes, in capital investment and in manpower for ongoing maintenance and operation so that an acceptable level of availability is guaranteed. Amazon and (last year) Rackspace, not so much.

  47. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    In a world with Windows clients who needs uptime?

  48. A Summary of Comments by Maarx · · Score: 1

    For anyone too lazy to read all the comments, let alone the article, allow me to summarize them:
    1) "I am a guy or know a guy in a datacenter who once sneezed and took down the entire eastern seaboard. This is perfectly understandable."
    2) "I am a guy or know a guy in a datacenter. You could set off a nuclear bomb inside it and everyone could keep playing Farmville without missing a beat. This is unacceptable."
    3) "This is capitalism's fault. The United States is a terrible country. People are fat and lazy."
    4) "I'm pretending I don't know what cloud computing is so I can make a pun about meteorological clouds."
    5) "I honestly don't even know which thread I'm posting in. The comment I'm responding to is already entirely off topic."

  49. False by Slashdot+Parent · · Score: 1

    "The cloud" doesn't solve everything. Film at 11.

    Actually, "The Cloud" totally bails you out in cases like this. Consider the rackspace outage mentioned in TFA, or ThePlanet's huge outage back in '08. If you were affected by one of those events, you were totally hosed.

    On the other hand, had your app been running in EC2, you could simply relaunch your dead instance in another datacenter. You can use Amazon's automated service to do this, or you can roll your own, if you'd like.

    EC2 users who had adverse outcomes due to the power outage simply failed to architect their application for their underlying hardware. Amazon is frank with users (it's all over their user guides and FAQs) that they are providing cheap instances on nodes built with commodity hardware. If you run your app on EC2, the redundancy and failover is the responsibility of your app, because AWS is not providing this. AWS is very clear about this: any given node might, at any given point in time, simply vanish. In practice, you get pretty decent uptime with EC2, but you cannot depend on this!

    With a cluster of EC2 instances running across different Availability Zones and some decent monitoring/failover, it's actually pretty easy and cheap (compared with running your app in multiple physical datacenters) to achieve respectable uptime for your mission-critical apps. On the other hand, if you have single points of failure all over the place, or if you (gasp) just run your app on a single instance with no automated monitoring/failover (in other words, you have architected your app exactly how AWS recommends against), you are going to be really disappointed.

    But even if you do have an application running on a single instance, if you use an EBS-backed instance, you should be able to relaunch your instance, and be back up and running as though nothing happened. Obviously your app would be down in the meantime, but you have way more flexibility than if your physical node goes down in a traditional datacenter.

    --
    They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
  50. Re:Again: The IT Uptime Lightweights by Aut0mated · · Score: 0

    Insightful, more like troll... Do you have any idea how hard it is to run an ISP/Data Center? Do you work in one? I do, and let me tell you what we always say "If it was easy, everyone would do it".

    TV Network going dark... yes satellite stations in markets do go down for hours when their towers are hit.

    Emergency Rooms... how many times have you heard of them having to be evacuated due to power loss in severe weather. It happens.

    Self Important Blue Shirts?... come on man... WTF does that even mean... you just had to add that extra bit of holier than thou...

    I get so tired of all these arm chair IT experts getting modded insightful etc... when it's nothing more than a rant about something they really have no clue about. Running an ISP isn't like plugging your SOHO router in at home. ( it aint like dusting crops back home kid, tit for tat on the pun )

    Granted this company obviously doesn't run proper disaster recovery drills monthly like we do, or maybe they do and simply had a coincidental switchover fail.

    However, to make a lump statement then add "I'm sure there are exceptions" in a holier than thou statement and be modded insightful, just goes all over me, and if you haven't noticed the Internet IS a 'critical system' in todays world. Not everyone uses it for just gaming and torrents like so many on here think the Internet is solely built for. Most people use it to run their business, pay their bills, use their phone, go to school, check their bank/credit card accounts, etc... To me that is a VERY "critical system" and your comment to me is nothing more than arrogant and ignorant.

  51. took our site down by christovas · · Score: 1

    Took our site down battleempire.com

    --
    War is not determined by who is right, but who is left.
  52. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    When was the last time anyone heard of a TV Network going dark for an hour?

    Hmm, let me think. How about yesterday?

    This was probably the result of a server or application failure!

  53. Re:Again: The IT Uptime Lightweights by Archangel+Michael · · Score: 1

    "really ensure -- their uptime."

    That's because blueshirt IT guys are really redshirts who are expendable.

    But seriously, most IT guys can't put into dollar and cents what the cost of data and power redundancy really is.

    When failure happens, the BHBs want to know "how much does this cost", the problem is getting the BHB to realize what it costs BEFORE failure happens.

    And while you can't prevent failure 100% of the time, you can have contingencies in place to deal with failures and mitigate against more common types of failures.

    The actual cost of stuff not running during power failures is higher than most people know, but they don't ever think about it in the proper way.

    At where I work, we lose power, for periods of longer than 1 hours, a couple three times a year. Often it can last several hours and occasionally a day or two.

    While the power is gone, NOBODY works. ANYWHERE. The man power costs alone are huge, but hidden. Until IT can put that cost into REAL numbers (average salary per hour x hours average per outage), the PHB will never realize the need for redundancy UNTIL its too late.

    UPSes that once were a solution for graceful shutdowns, now they are not, as shutting systems down is not really an option at all. We've grown dependent upon the technology to be there ... always.
     

    --
    Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
  54. Crazy people by Anonymous Coward · · Score: 0

    What kind of masochists would work in datacenters? You get no notice at all when things go right, but you get reamed when things go wrong - even things outside of your power (no pun intended). I hope you all get paid well. :)

  55. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    I worked on electronic equipment made by a major player that went into network TV stations. They closely measured 'seconds of dead air' caused by their equipment. Apparently the networks thought that was important to not annoy their viewers - since many viewers have their fingers poised on remote control channel changing buttons.

  56. Does any of this failover stuff actually work??? by Wabbit+Wabbit · · Score: 1

    I have never once read about a data center catastrophe where a fancy shmancy failover system --or ANY kind of failover system for that matter-- actually worked. Does anyone test or maintain this stuff?

    Anyone got any success stories to share?

    --
    Nothing is inexplicable; only unexplained -Tom Baker, Doctor Who
  57. Inappropriate for May, but whatever... by Dogtanian · · Score: 1

    My spelling is perfect but I still get my grammer wrong sometimes.

    Perhaps your grammar got run over by a reindeer?

    --
    "Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
  58. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    If anyone actually thought that "Cloud Computing" wasn't just a synonym for "co-location in a datacenter", then they're a fucking fool.

  59. Re:Again: The IT Uptime Lightweights by lsatenstein · · Score: 1

    If the TV station is out due to a power failure, so are the sets in all the homes. It is time for the FM radio to be turned on. In Montreal, one station runs at 500 watts and covers around a 10km radius from it's antenna.

    --
    Leslie Satenstein Montreal Quebec Canada
  60. Re:Again: The IT Uptime Lightweights by Anonymous Coward · · Score: 0

    Everyone has a budget to work within. Even Amazon.

  61. Re:Does any of this failover stuff actually work?? by Thundersnatch · · Score: 1

    This stuff probably works 90% of the time. We lose utility power several times a year for one reason or another (usually some construction gaffe). It's a small DC in a downtown Chicago high-rise, so not a lot of the complexity of mega-facilities, but the basic pieces are the same. UPSs and/or other equipment functions as designed, and nobody notices. Who would read a blog post titled "UPS and generator not a waste of money"?

    That said, so many external parties are involved and the complexity is so high that failures do happen. Our longest outage in this facility was caused by a plumber working on an another floor for another tenant. He had all the credentials to get into the right spaces, and actually fixed what he was there to fix, but also somehow managed to interrupt the redundant chilled water supply on both sides of the building at the same time.