More Uptime Problems For Amazon Cloud
1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."
Nuf said
For me, it is far better to grasp the Universe as it really is than to persist in delusion
I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.
We need to invest trillions in roads, water, and electrical infrastructure to keep this country going.
If you let the basic building blocks of civilization rot, don't be surprised when everything else follows suit.
[Fuck Beta]
o0t!
It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.
You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.
So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.
They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.
You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.
Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.
My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.
If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.
However "Netflix, which uses an architecture designed to route around problems at a single availability zone." seems to have efficiently spread the pain of a North Eastern outage to the rest of the country. Sometimes I think redundancy in solutions is better left turned off.
Nullius in verba
http://www.pepco.com/home/emergency/maps/stormcenter/
-- IANAL, this isn't legal advice, and definitely isn't legal advice for you. Also, Squee!
Instagram's servers in that cloud server were also affected, and more people griped about that on my facebook feed than netflix.
as for "an electrical storm", that's a bit of an understatement. The issue was actually more the 80 mph wind gusts as well as the lightning continuing on for 2 hours after the wind and rain had passed (meaning crews couldn't get out there overnight).
The result is some 2 million people without power, 1 million around DC alone. Dominion Power (which services the area where the data center resides, about 5 miles from my house) lost power for more than half of its northern virginia customers, and even now has only restored power to about 60,000* out of 461,000 that lost it. On the Maryland/DC side of the potomac, half a million people may be without power for days through a 100 degree each day heat wave (and more storms like last nights coming...).
* fortunately that would include me...though i'm writing this via my sprint phone as a wifi hotspot 'cause our cable modem is still down ;-)
"But remember, most lynch mobs aren't this nice." (H.Simpson)
-- Joe
it seems like the switching system failed and or the back up power generators did not kick on.
Maybe natural gas ones are better. The firehouses have them. I also see them at a big power sub station as well.
I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).
According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".
Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.
with cable the nodes need power and there batteries will run down and then the cable co needs to have on site portable generators at the nodes with no power.
The phone systems have RT (less of them then cable systems) that are the same way.
Amazon is a huge target - but how many other data centers went down in the Virginia area also? Did they come back up as fast as Amazon?
And Netflix is an Amazon Cloud customer... What's the matter with them? Are they just too dumb to host in house?
Why exactly would a cable operator bother with backup power? I mean if the neighborhood has now power than people aren't running T.V.s or computers (unless laptops but still their modem would be down). It is probably a different beast with something the size of a Amazon datacentre though, they probably can go to the ISP and say "hey look we'll by 5M a month of internet for you but we need redundancy. Piss on all your home users for all we care but we get internet no matter what.".
"If they don't have proper backup generators, they have no business running a data center."
*Or* they are in a business that recognizes that shit happens, even at the datacenter level, and provide services so you can spread your load out of more than one datacenter, making the x10 expenditures needed to go from a "decent" datacenter to a "top notch" one moot and avoidable.
Hey, doesn't that look like this funny "cloud" concept they are waving so oftenly?
well there are long runs from the headend to the each neighborhood so some area may have power but hours later the cable goes not as the lines pass though areas that don't have power.
Which is the problem. Not the power outage itself. ... in 30 minutes, 1hour... alright, but 9 freakin' hrs ?
If the power outage happened, and the servers where back let's say
In my specific case I didn't suffer as much because I have another instance in different zone with db replication and all that, serving as a backup server, and my project there, although very critical (20 people are getting wages out of it) is very low on resource usage... I can imagine there where quite a lot of people that lost quite a lot of money because of this. It's really unacceptable for a DC to have a 9 hrs downtime, whatever the reason is... because.. that's just the standard people are used to.
I never experienced anything like this at any other company in the last 10 years I'm working as a linux admin.. although at all those companies, I used real servers.
Because that cable operator also provides phone service.
The revolution will be mocked
whoops, I forgot to say OUTSIDE of a city you can use a propane generator FROM A PROPANE TANK. Which, of course, means it can still function after a 'quake. And if you live in someplace where it's legal to have a tank AND where you can get city gas, you can get the best of both worlds.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
You lucked out, then. I've driven around Fairfax, Arlington and PG counties as well as DC today. I haven't seen a major road without some kind of debris blocking it, nor an area which has 100% power restored at this point.
This was a bad storm, but could certainly have been far worse. Even still, the grocers and stores are out of ice and people are swarming out of their homes like rats abandoning ship in some areas. These same people would be fucked if the S really HTF.
What else can happen when an unstoppable force collides with an immovable object?
My company uses Amazon Web Services to host some of our product, and I got a call at 7 am to help bring our stuff back up. A bunch of our instances were stopped, and a bunch of Elastic Block Store volumes were marked Impaired. We're working on making our environment more "cloudy" to make better use of multiple availability zones, regions, and automation to better survive an outage like this, but we're not there yet.
Didn't you get the memo? Netflix barely runs now and this is working as planned. Time Warner had four internet outages in Raleigh THIS WEEK.
Everything everywhere is slowly grinding to a halt. So let's send more work to China and India. Who cares anymore.
We don't have downtime. We have "uptime problems."
systemd is Roko's Basilisk.
"No, if you are a professional stuff doesn't 'happen'"
No, if you are a professional you evaluate risks and adjust your behaviour to an acceptable level and you don't expend a bazillion to protect half a bazillion.
In example, Google designed their applications in a way that stand for a failing server: what's the benefit in their case going with RAID10, doubled PSUs and hot swappable RAM and CPUs? What gives to the table but lost money?
Amazon offers out of Fortune 100 people the ability to do the same, only at the datacenter level. But then, if you can stand a whole datacenter failure by properly using the services they offer, what's the advantage of making the expenditure of making their datacenters five nines instead of four?
"They are still amateurs"
They are there for the money and they are making a lot of money: that's what make them professionals.
I'll tell you who's being unprofessional: all those that think that their critical services are propely protected within a single datacenter just because they read it was "the cloud" in a colourful brochure.
To migrate Click Here!
At least for those that have a DR migration plan.
Got Code?
What are you, 14? Democracies don't like War, because they don't like their sons, fathers, brothers, and husbands getting killed. It generally takes quite a lot to motivate Democracies into war, because of the hatred of casualties. Even when it is the best option. Example: going to war against Hitler in 1934, or 1936, or in 1938.
Out here in the real world, the sum total of human experience suggests a strong military is like insurance or a seat belt. You hope you never have to use it, but its a godsend if you need it. Indeed having a strong military deters attacks. Nobody goes down to Venice Beach to pick fights with body builders, or down to the Gracie's gym to start fights.
Like insurance, working out, eating right, avoiding bad areas, a strong military is a pain in the ass. It costs a lot. It is a pain and non-productive to maintain. And sure, you could save a lot by going without auto or health insurance. You could eat more cheaply at McDonalds than cooking healthy meals at home. Its cheaper to live in the ghetto than a nice area.
As far as market value of defense stocks, the market capitalization of Lockheed Martin is 28.27 Billion, of Apple Computer 546.08 Billion. The market value of L'Oreal at 54.83 billion is about twice that of Lockheed Martin, suggesting lipstick pays a lot more than military avionics. Defense firms since their inception have been very cyclical, made relatively little money, and are merging like crazy as war spending winds down. But unless you're going to change human nature with Harry Potter's magic wand, carrying otherwise unprofitable defense firms is worth it because making drones, airplanes, missiles, tanks, ships, and helicopters to kill well-armed enemies is a very narrow engineering niche with knowledge quickly lost.
As soon as your computer runs on unicorn farts and rainbows, we can all forget about dominance in the Persian Gulf and other oil areas. Until then, I'd prefer to drive to work and run the AC not live like a dirty smelly hippie. That AC making life bearable in 118F Kansas? Runs on oil not tree-hugging and drum circles.
I have a UPS for my cable modem, router, Ooma box, and wireless phone so VOIP will still work in an outage, if the cable signal is up (i.e. even with my computer turned off). Whether I can actually expect the cable to be up in an outage, I have no idea.
Well, this is America, you are welcome to your belief, even if its horribly wrong.
---- Booth was a patriot ----
I personally think it's funny that people would even say that (if yoru a professional stuff doesn't happen BS). As someone who works in the infrastructure business I can tell you with 100% certainty that no design, location or setup will be perfect. Regardless of how well you plan you are one natural disaster away from a service interruption and any single point in the system can be taken down by some guy in a backhoe digging where he shouldn't.
Even if you designed a data center with 100 layers of redundancy on power and connectivity there is a damn good chance that all those communication, power and other lines go through a single point somewhere miles away from the data center, probably where they all cross the interstate or a river. Infrastructure just don't have that much redundancy and in the real world there are lots of places where there is a single crossing, be that a river, interstate or any other property or natural feature that restricts access. So one guy in a backhoe digging where he shouldn't can do things like take out a whole cities power and communication lines. It's not common but it does happen.
"You have exceptionally low standards."
No, I don't.
I have standards tied to reality and I know Amazon offers[1] "a Monthly Uptime Percentage [...] of at least 99.9% during any monthly billing cycle". On top of that, I know what the value of an SLA exactly is.
[1] http://aws.amazon.com/s3-sla/
"I personally think it's funny that people would even say that (if yoru a professional stuff doesn't happen BS). As someone who works in the infrastructure business I can tell you with 100% certainty that no design, location or setup will be perfect."
Truly. In fact, the professional is the one that knows that shit happens, what's the recurrence of a certain kind of shit, how it will impact the business and what's the best countermeasure to achieve the best bang for the buck: sometimes is avoiding the shit to happen, sometimes adopt measures for the shit to happen but still not let it to affect the business and sometimes let the shit happen and just cross your fingers so it doesn't happen in your time -are you really covering a global nuclear war scenario, really?
It's easy to say "this wouldn't happen to me" ...when you are not in a position that this could happen to you.
It isn't amazon's job to setup a DR plan for their entire datacenter. That is the customer's job. You can pay for Amazon support and they will gladly help you setup a DR plan for your setup, but there is no way what-so-ever that it is their responsibility.
They provide you with an easy way to rent their hardware from 1 or more locations, and you can build in redundancy into your own system.
Meh, it's PEPCO for the most part. They wouldn't have been working Friday night anyway. Ought to be an interesting bit of discussion with the utility commission regarding their current desired rate hike.
Jesus was all right but his disciples were thick and ordinary. -John Lennon
I'm sorry, but if your service is taken down by a single data center failure, you are not using the cloud to its full potential. Data centers do go down, drop out of sight, or otherwise become unusable now and again. Plan on it, design for it, and use the tools available to manage it.
And your ip phone is going to work when your house has no power?
Yes, they are supposed to. That's why a VoIP cable modem has a battery in the unit, to ensure you can still communicate during normal power outages. If you're going to be without power for a week nothing short of a generator or POTS is going to help (ok some voice only cellphones can go a week in standby).
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.