More Uptime Problems For Amazon Cloud
1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."
Nuf said
For me, it is far better to grasp the Universe as it really is than to persist in delusion
I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.
We need to invest trillions in roads, water, and electrical infrastructure to keep this country going.
If you let the basic building blocks of civilization rot, don't be surprised when everything else follows suit.
[Fuck Beta]
o0t!
It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.
You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.
So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.
They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.
You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.
Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.
My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.
If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.
However "Netflix, which uses an architecture designed to route around problems at a single availability zone." seems to have efficiently spread the pain of a North Eastern outage to the rest of the country. Sometimes I think redundancy in solutions is better left turned off.
Nullius in verba
http://www.pepco.com/home/emergency/maps/stormcenter/
-- IANAL, this isn't legal advice, and definitely isn't legal advice for you. Also, Squee!
Instagram's servers in that cloud server were also affected, and more people griped about that on my facebook feed than netflix.
as for "an electrical storm", that's a bit of an understatement. The issue was actually more the 80 mph wind gusts as well as the lightning continuing on for 2 hours after the wind and rain had passed (meaning crews couldn't get out there overnight).
The result is some 2 million people without power, 1 million around DC alone. Dominion Power (which services the area where the data center resides, about 5 miles from my house) lost power for more than half of its northern virginia customers, and even now has only restored power to about 60,000* out of 461,000 that lost it. On the Maryland/DC side of the potomac, half a million people may be without power for days through a 100 degree each day heat wave (and more storms like last nights coming...).
* fortunately that would include me...though i'm writing this via my sprint phone as a wifi hotspot 'cause our cable modem is still down ;-)
"But remember, most lynch mobs aren't this nice." (H.Simpson)
-- Joe
it seems like the switching system failed and or the back up power generators did not kick on.
Maybe natural gas ones are better. The firehouses have them. I also see them at a big power sub station as well.
I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).
According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".
Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.
If they don't have proper backup generators, they have no business running a data center.
---- Booth was a patriot ----
with cable the nodes need power and there batteries will run down and then the cable co needs to have on site portable generators at the nodes with no power.
The phone systems have RT (less of them then cable systems) that are the same way.
Amazon is a huge target - but how many other data centers went down in the Virginia area also? Did they come back up as fast as Amazon?
And Netflix is an Amazon Cloud customer... What's the matter with them? Are they just too dumb to host in house?
Why exactly would a cable operator bother with backup power? I mean if the neighborhood has now power than people aren't running T.V.s or computers (unless laptops but still their modem would be down). It is probably a different beast with something the size of a Amazon datacentre though, they probably can go to the ISP and say "hey look we'll by 5M a month of internet for you but we need redundancy. Piss on all your home users for all we care but we get internet no matter what.".
Not only did I get annoyed for like 3 whole minutes last night at the tail end of the netflix downtime but I also couldn't download an important software patch form a vendor on Friday because it was hosted on the amazon cloud download service thing. By the way, Netflix apparently doesn't have a damn thing for single point of failure adaptation, seeing as how their entire website itself was down and wouldn't even respond to a ping. They can't even load a freaking "sorry, we're having problems" page on a backup host? Yeah, real adaptive. Oh and good call hosting your site on the same service that your videos stream from. That's really smart.
How did this happen anyway? The cloud is magic...MAGIC!!!!!! You cannot destroy magic! It must have been a dark wizard. That or all the cloud product salesmen are full of shit.
well there are long runs from the headend to the each neighborhood so some area may have power but hours later the cable goes not as the lines pass though areas that don't have power.
Which is the problem. Not the power outage itself. ... in 30 minutes, 1hour... alright, but 9 freakin' hrs ?
If the power outage happened, and the servers where back let's say
In my specific case I didn't suffer as much because I have another instance in different zone with db replication and all that, serving as a backup server, and my project there, although very critical (20 people are getting wages out of it) is very low on resource usage... I can imagine there where quite a lot of people that lost quite a lot of money because of this. It's really unacceptable for a DC to have a 9 hrs downtime, whatever the reason is... because.. that's just the standard people are used to.
I never experienced anything like this at any other company in the last 10 years I'm working as a linux admin.. although at all those companies, I used real servers.
Because that cable operator also provides phone service.
The revolution will be mocked
whoops, I forgot to say OUTSIDE of a city you can use a propane generator FROM A PROPANE TANK. Which, of course, means it can still function after a 'quake. And if you live in someplace where it's legal to have a tank AND where you can get city gas, you can get the best of both worlds.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
You lucked out, then. I've driven around Fairfax, Arlington and PG counties as well as DC today. I haven't seen a major road without some kind of debris blocking it, nor an area which has 100% power restored at this point.
This was a bad storm, but could certainly have been far worse. Even still, the grocers and stores are out of ice and people are swarming out of their homes like rats abandoning ship in some areas. These same people would be fucked if the S really HTF.
What else can happen when an unstoppable force collides with an immovable object?
What datacenter was this? Was it a private Amazon datacenter or was it someone else's?
What real datacenter can't operate for a week without power? That's ridiculous!
All my data and media are still accessible. I never swallowed the cloud Kool-aid, though.
My company uses Amazon Web Services to host some of our product, and I got a call at 7 am to help bring our stuff back up. A bunch of our instances were stopped, and a bunch of Elastic Block Store volumes were marked Impaired. We're working on making our environment more "cloudy" to make better use of multiple availability zones, regions, and automation to better survive an outage like this, but we're not there yet.
Didn't you get the memo? Netflix barely runs now and this is working as planned. Time Warner had four internet outages in Raleigh THIS WEEK.
Everything everywhere is slowly grinding to a halt. So let's send more work to China and India. Who cares anymore.
We don't have downtime. We have "uptime problems."
systemd is Roko's Basilisk.
It seems to me the real problem here is an understanding of exactly what the term 'cloud' implies (and doesn't). Evident by comments across the web, cloud implies automated availability, redundancy, scalability, and management. Unfortunately, this is where it seems the primary misunderstanding seems to occur with AWS in that yes, while it does provide all 4 of those features, it provides them however on a geographic region level (aka 'zones'). Whether this misunderstanding is a result of Amazon marketing efforts or not, I cannot say and I'm honestly too lazy right now on a Saturday to find citable sources.
In any case, as I've observed, many users are quick to note that AWS best practices recommend using boxes across multiple regions (zones), to prevent service interruption in an event such as this. I have no idea as to the percentage of AWS users who actually follow that best practice or not, but judging from the amount of websites and services encountering problems, it seems to be somewhat low (full disclosure: completely my opinion here). Anyway, Amazon obviously has a huge PR and marketing problem here that it seems like they can address in one of two ways. Either better educate and mandate users follow "optional/recommend/gonna-be-mandatory-soon" or take the choice out of user's hands and make the system do it on it's own. The latter of which would better match the term 'cloud' and all that it implies...
Oh, sure, you can have gov't come up with work projects, but none of them will be sustainable and useful
How do you figure highways and freeways to not be useful? Even if you don't drive on them, you use products that are transported on them. I don't see Coca-Cola or WalMart funding road building projects.
This boggles the mind. I work in power backup industry and data centers of the scale of Amazon's have redundancies on top of redundancies. Somebody isn't doing their job if they lost the critical load in a thunderstorm.
To migrate Click Here!
At least for those that have a DR migration plan.
Got Code?
What are you, 14? Democracies don't like War, because they don't like their sons, fathers, brothers, and husbands getting killed. It generally takes quite a lot to motivate Democracies into war, because of the hatred of casualties. Even when it is the best option. Example: going to war against Hitler in 1934, or 1936, or in 1938.
Out here in the real world, the sum total of human experience suggests a strong military is like insurance or a seat belt. You hope you never have to use it, but its a godsend if you need it. Indeed having a strong military deters attacks. Nobody goes down to Venice Beach to pick fights with body builders, or down to the Gracie's gym to start fights.
Like insurance, working out, eating right, avoiding bad areas, a strong military is a pain in the ass. It costs a lot. It is a pain and non-productive to maintain. And sure, you could save a lot by going without auto or health insurance. You could eat more cheaply at McDonalds than cooking healthy meals at home. Its cheaper to live in the ghetto than a nice area.
As far as market value of defense stocks, the market capitalization of Lockheed Martin is 28.27 Billion, of Apple Computer 546.08 Billion. The market value of L'Oreal at 54.83 billion is about twice that of Lockheed Martin, suggesting lipstick pays a lot more than military avionics. Defense firms since their inception have been very cyclical, made relatively little money, and are merging like crazy as war spending winds down. But unless you're going to change human nature with Harry Potter's magic wand, carrying otherwise unprofitable defense firms is worth it because making drones, airplanes, missiles, tanks, ships, and helicopters to kill well-armed enemies is a very narrow engineering niche with knowledge quickly lost.
As soon as your computer runs on unicorn farts and rainbows, we can all forget about dominance in the Persian Gulf and other oil areas. Until then, I'd prefer to drive to work and run the AC not live like a dirty smelly hippie. That AC making life bearable in 118F Kansas? Runs on oil not tree-hugging and drum circles.
I have a UPS for my cable modem, router, Ooma box, and wireless phone so VOIP will still work in an outage, if the cable signal is up (i.e. even with my computer turned off). Whether I can actually expect the cable to be up in an outage, I have no idea.
is a good idea? If I live in Vancouver, Canada, and I'm visiting you online, and you're a company based in Manchester, England, and a storm in Virginia, United States results in me not being able to access your site, something is very wrong with someone's business model. I'm the customer, so it's not me... who does that leave?!?
Clouds by their nature are insubstantial when it comes to being a solid footing upon which to build a business, maybe this will be the wakeup call people need to pull their heads out of the cloud, and their bums, and host their own content, rather than outsourcing it somewhere else.
Also worthy of note, a thunderstorm is more powerful than Anonymous, which you'll recall tried taking down Amazon over the Wikileaks donation stoppage thing, and had to admit they couldn't. Just saying.
Bummer.. no entertainment. If the outage lasts more than 24 hours, they'll credit you 1/30th of the month's fees.
Why you shouldn't be buying your mission critical network services from a entertainment company. the mind set is wrong.
Meh, it's PEPCO for the most part. They wouldn't have been working Friday night anyway. Ought to be an interesting bit of discussion with the utility commission regarding their current desired rate hike.
Jesus was all right but his disciples were thick and ordinary. -John Lennon
I'm sorry, but if your service is taken down by a single data center failure, you are not using the cloud to its full potential. Data centers do go down, drop out of sight, or otherwise become unusable now and again. Plan on it, design for it, and use the tools available to manage it.
And your ip phone is going to work when your house has no power?
Yes, they are supposed to. That's why a VoIP cable modem has a battery in the unit, to ensure you can still communicate during normal power outages. If you're going to be without power for a week nothing short of a generator or POTS is going to help (ok some voice only cellphones can go a week in standby).
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.