Major Outage At the Amazon Web Services

← Back to Stories (view on slashdot.org)

Major Outage At the Amazon Web Services

Posted by ryuzaki0 on Thursday April 21, 2011 @03:45AM from the but-the-cloud-fixes-everything dept.

ralphart writes "The Northern Virginia datacenter for Amazon Web Services appears to be having a major outage that affects EC2 services. The Amazon Forums are full of reports of problems. Latest update from the status page: 2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution."

32 of 247 comments (clear)

Min score:

Reason:

Sort:

No Way! by Frosty+Piss · 2011-04-21 03:48 · Score: 5, Funny

But how can this be possible? It's The Cloud . This sort of this simply doesn't happen.

--
If you want news from today, you have to come back tomorrow.
1. Re:No Way! by alphatel · 2011-04-21 03:51 · Score: 2
  
  It didn't happen. The cloud can erase history in a planck!
  
  --
  When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
2. Re:No Way! by Anonymous Coward · 2011-04-21 04:05 · Score: 2
  
  But it's not supposed to happen, because "if" (when!) it does, the impact is HUMONGOUS. "You're welcome to store all your data in our fast, easy and safe cloud storage. Downtime? Don't worry, it'll only experience hour long outages intermittently." Yeah, that's how they sold it in the first place, isn't it?
  This will become quite the event in data warehouse circles I bet, because the cost of 'being in the cloud' just doubled; it's not enough to buy storage from one provider. The "always there" quality that's supposedly the benefit of cloud storage is a facade.
3. Re:No Way! by cduffy · 2011-04-21 04:08 · Score: 2, Insightful
  
  This will become quite the event in data warehouse circles I bet, because the cost of 'being in the cloud' just doubled; it's not enough to buy storage from one provider. The "always there" quality that's supposedly the benefit of cloud storage is a facade.
  You can buy from one provider -- every major cloud provider has multiple availability zones. But yes, lots of people buy in only one zone because it's cheaper, and then suffer for that mistake -- in situations just like this.
4. Re:No Way! by 0123456 · 2011-04-21 04:10 · Score: 4, Informative
  
  A major outage on most professional cloud setups means it is down for a few hours. A major outage at work means the full day. It is like saying driving my car is so much safer then flying because I never got into an accident.
  Last time I remember a day-long outage at work was 1994, and that was because the license server failed so we couldn't run our own software (we couldn't recompile it to remove the DRM because the compiler also needed a license to run).
  I seem to remember that the Mac guys at the company also had a long outage when they couldn't connect to one of their Mac servers, but eventually someone actually went to the server room and discovered that it had been stolen.
  Back on topic, I just don't see all these day-long outages that apparenty seem to happen all the time in companies that haven't moved their servers to The Cloud(tm).
5. Re:No Way! by TooMuchToDo · 2011-04-21 04:33 · Score: 2
  
  But when it's your gear, you have some control over the situation. When it's "in the cloud", you sit and get yelled at by the CXO and sweat if you'll still have a job while cloud provider X works to fix the problem (and their liability? whatever you paid for the service).
6. Re:No Way! by Synn · 2011-04-21 04:44 · Score: 2, Insightful
  
  "Back on topic, I just don't see all these day-long outages that apparenty seem to happen all the time in companies that haven't moved their servers to The Cloud(tm)."
  You must not get out much. Atlantic.net I had a 11 hour outage due to the staff not understanding how to update a Cisco router. Then a 4 hour outage when they screwed up billing and shut down our service with no warning. Then there was that time they didn't like our DNS traffic and shut down DNS with no warning or notice. That was a fun hour or so of me trying to figure out why our applications were having issues.
  So.. we go to replace them and one of the places we visit(that was highly recommended), first words outta their mouth was "So I'm sure you heard about that 6 hour outage we had xxx back. Here's what we've done to make sure it doesn't happen again..."
  Long outages happen. I'm pretty much in the firm belief that if your app can't scale out across large geography automatically the only thing giving you solid uptime is pure luck. Any time someone wants 5 9's with a single data center I just laugh. But setting up your app and data to work that way takes Work.
  Which, btw, would've also prevented outages for companies here. Only 1 zone was affected, EC2 also has zones on the west coast, Ireland and Asia pacific. If you built your app to use those and balanced via ELB, you likely wouldn't be impacted with this outage. But again, that takes $$$. Most companies don't want to spend that and frankly most companies probably don't need to.
7. Re:No Way! by lgw · 2011-04-21 05:19 · Score: 4, Insightful
  
  his will become quite the event in data warehouse circles I bet, because the cost of 'being in the cloud' just doubled; it's not enough to buy storage from one provider. The "always there" quality that's supposedly the benefit of cloud storage is a facade.
  The cloud doesn't have to be perfect - it just has to be as good in the eyes of VPs as the contractors they'd otherwise hire to run their internal datacenter. What's the value of an IT guy in the eyes of an MBA? Yeah, this sort of reality check wont phase them at all.
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
8. Re:No Way! by im_thatoneguy · 2011-04-21 06:01 · Score: 2
  
  We were out for a good portion of the day Monday after a bird flew into the telephone pole outside our office and then caused a critical server to go wonky after the UPS battery ran out and we didn't have the auto-shutdown settings correct.
Re:Reddit is down because of this by cobrausn · 2011-04-21 03:50 · Score: 5, Funny

You're posting on Slashdot, so I believe you already found the answer.

--
How does it feel to be a liar with pants constantly on fire?
Severe weather in Virginia likely the culprit by stopacop · 2011-04-21 03:50 · Score: 3, Informative

Severe weather hit the area. They shutdown Surry Power Station in Surry County, Virginia after a tornado took the power out that powers the power station.

--
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?
1. Re:Severe weather in Virginia likely the culprit by getagrip · 2011-04-21 03:59 · Score: 4, Informative
  
  I am in Northern Virginia. There is no power outage or severe weather here.
2. Re:Severe weather in Virginia likely the culprit by pdbaby · 2011-04-21 04:00 · Score: 2
  
  Amazon's Availability Zones are designed to have separate power, cooling and network so I don't think this is the issue. It was (is) a problem with their disk subsystem in multiple availability zones so I suspect they were in the process of pushing out some new storage controller code and some bug didn't appear until the later stages of their rollout. From their status log it looks like they're manually correcting the issue with each disk.
  
  --
  Global symbol "$deity" requires explicit package name at line 2. - If only $scripture started "use strict;"
3. Re:Severe weather in Virginia likely the culprit by metrometro · 2011-04-21 04:02 · Score: 2
  
  Amazon's comments on the outage do not mention weather as a cause: http://status.aws.amazon.com/
  "8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them. "
4. Re:Severe weather in Virginia likely the culprit by xnpu · 2011-04-21 04:10 · Score: 2
  
  Those news reports do not rule out the possibility that he's in a place in Northern Virginia without severe weather or a power outage. How do you conclude that he is wrong?
5. Re:Severe weather in Virginia likely the culprit by Anonymous Coward · 2011-04-21 04:15 · Score: 2, Informative
  
  First: Please look at a map. Surry County is east of Richmond on the way to VA Beach. An outage at Surry Power Station would not affect a data center over in Dulles, VA. That power station does not server this area at all.
  Second: Read the news. Every comment above is wrong in one way or another. Here is a local news article about what happened down there, if you are curious:
  http://www.examiner.com/progressive-in-richmond/surry-power-station-under-repair-the-aftermath-of-tornado
  You people know nothing, and you post crap without doing any research at all.
6. Re:Severe weather in Virginia likely the culprit by inject_hotmail.com · 2011-04-21 06:03 · Score: 2
  
  Why not put them on the roof? I think any datacenter designer would say that, first thing...I mean, they stored their precious depleted uranium and plutonium on the roof...why not the generators too?
  
  The real problem everywhere...and I do see it everywhere...is that the people paid to be the people that 'know' simply don't know, or have no sense of creativity or foresight. I mean come on, they built a tsunami wall because they have a high probability of tsunamis, and then they go and put the most mission-critical, life-saving, life-altering power generators in the path of a tsunami-we-are-protected-from+1. For crying out loud, when I moved into my house I very easily decided that I won't put anything in my basement that I -really- care about...and I'm not even -near- a flood plane...let alone on a coastal fissure infested area known as the frickin' ring-of-fire!
  
  I should be in charge of everything...that way crap would get done, it wouldn't be obsolescently planned, no one would die from corporate gree^H^H^H^H"mistakes", and life would be easy for everyone.
Re:Reddit is down because of this by jpmoney · 2011-04-21 03:59 · Score: 2

People still go to digg? Oh, I see what you did there.
I actually went to Digg this morning since Reddit is down. I haven't been in months since I removed them from my RSS reader. All I have to say is "ouch". Front page stories with a whopping 5 comments? Its pretty sad.

--
unf.
The dark side of outsourcing by HangingChad · 2011-04-21 04:03 · Score: 2

Slashdot and Digg have one day traffic surges because Reddit is down. I'm getting way too much done today not being distracted by the GoneWild girls. This productivity must cease at once!
Does go to show what can happen when your business depends on an outsource provider. Everyone has to depend on service providers to some extent, but sometimes it's a good exercise to see how many of your company eggs are in one basket. Redundancy is expensive, but so is losing business. Even Google has had Gmail interruptions, lost some customer data and experienced slow downs.

--
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
Re:Reddit is down because of this by badran · 2011-04-21 04:03 · Score: 2

Productivity in Offices will reach record levels today.
Emergency Plan by sycorob · 2011-04-21 04:11 · Score: 4, Interesting

I didn't even realize that one of our partners was using Amazon EWS until suddenly they were down all day. Amazon is really stable historically, but it's frustrating when you're out of business and all you can do is wait and see if Amazon will fix it soon.
In the "old school" thinking, smart companies have a redundant data center somewhere, humming along and waiting to be switched on if the main data center ever goes down. "The cloud" was supposed to solve that - massive redundancy within Amazon's services were supposed to protect you from outages. Not the case, apparently, since it looks like Amazon is going to fall below their promised 99.95% uptime (4.38 hours per year downtime).
I think the answer is to have redundant cloud services online, so you could switch from Amazon to Google or DevGrid if you had issues. The problem is, there's nothing quite like Amazon right now, it's not easy to switch from Amazon to some random service. This might be the biggest argument against virtual services - lack of standardization makes it hard to move from one to another, and hard to set up backup services in case of emergency.
1. Re:Emergency Plan by MariusBoo · 2011-04-21 04:26 · Score: 3, Insightful
  
  Actually in the case of EC2 the smart thing would have been to have your instances spread over different availability zones...
2. Re:Emergency Plan by ron_ivi · 2011-04-21 04:27 · Score: 2
  
  Just using Amazon West as well as Amazon East would have saved customers from this outage.
  I think Amazon actually does great at covering all the technological single-points-of-failure.
  The only reason I'd want a second cloud vendor is for the sales/account related single-point-of-failure of the Amazon Account being frozen due to a sales miscommunication or a MPAA/RIAA takedown notice,etc.
3. Re:Emergency Plan by Alarash · 2011-04-21 04:31 · Score: 3
  
  Even by using only AWS you can set up redundancy across multiple North America's regions. Even across continents, with one data center in Ireland and one in Singapore. But obviously it costs extra as they bill you the bandwidth between the regions. That's how you use The Cloud (c) (tm) (R). Using a single data center to set up redundancy is dumb because it's not redundancy. You need high availability for your VMs, but also for your data center.
  
  This is why banks or large businesses, for instance, have two or more data centers they always keep synchronized and have at least 50 kilometers between them. Thinking "well it's in one AWS data center so it's safe" is wrong, and this incident is a fine example of that.
4. Re:Emergency Plan by Anonymous Coward · 2011-04-21 05:01 · Score: 2, Informative
  
  50km is not a far enough distance. I witnessed this first hand for the employer I worked for on the Gulf Coast during Katrina. That storm jacked up about 120 miles, took down our primary AND failover sites.
5. Re:Emergency Plan by mikeytag · 2011-04-21 05:02 · Score: 2
  
  Nail on the head here. We were affected today and while I have full offsite backups of everything we don't have a second datacenter to switch on because of cost and complexity. It's not too difficult to have webservers span different parts of the globe, but DB servers like MySQL are a whole different story and usually very crucial.
6. Re:Emergency Plan by hey! · 2011-04-21 05:09 · Score: 2
  
  Actually, I'm more concerned about the *organization* as a single point of failure. If you rely on, say, Oracle (ugh), and Oracle goes bankrupt or a court orders them to stop selling their database or they simply decide to stop supporting some feature, you're still in business, and have a pretty good shot at moving to some similar database management system.
  If you built a mission critical system on Amazon's cloud services, a single court order not aimed at you could put you out of business. If Amazon was forced or decided to get out of the cloud hosting business, you'd have a heck of a time transitioning over to another cloud service because Amazon's services are so architecturally unique.
  
  --
  Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
7. Re:Emergency Plan by Slashdot+Parent · 2011-04-22 03:36 · Score: 2
  
  It's also worth pointing out that all cloud SLAs are basically useless: if Amazon falls below their advertised uptime they'll refund you some of your charges - but they'll never refund more than what you've paid them: they don't compensate you for all the money you're losing (and the AWS charges are likely pocket change compared to this)
  FYI, I don't think this outage even falls under EC2's SLA. The Region was still technically on line. Only EBS was down.
  Granted, many customers depend heavily on EBS, but the SLA doesn't cover an outage in just one specific EC2 feature. That being said, I wonder if AWS will honor SLA claims anyway, as a PR move. This outage is just so clearly Amazon's fault: a network hiccup causes EBS to overload in one Availability Zone, which cascades into all Availability Zones in the Region.
  Personally, I think that they should honor SLA claims. But you're right, any money recovered would be chump change compared to the cost of the downtime.
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
6 weeks before the AWS summit 2011 by grapeape · 2011-04-21 04:26 · Score: 3, Interesting

Gotta wonder what kind of flack Amazon is going to take for this one. I've had a couple clients looking into cloud services including moving to AWS and have already had one of them call me and cancel a meeting about it. While I understand stuff happens, the entire sales pitch for AWS was redundancy and build as you grow. Redundancy has obviously not worked in this case, while I usually support cloud services, this is definitely going to be a hard example to counter when trying to sell it to potential customers.
1. Re:6 weeks before the AWS summit 2011 by TooMuchToDo · 2011-04-21 04:39 · Score: 4, Informative
  
  It's not short sighted at all. When someone else runs your gear, all you can do is sweat until they get things back online, and they can take their time under what's known as "commerically reasonable SLAs". When you own your own gear, your own colo, etc., how much effort you use to get back up and running is up to you.
  "The Cloud" for mission critical businesses is a joke.
2. Re:6 weeks before the AWS summit 2011 by Slashdot+Parent · 2011-04-22 03:54 · Score: 2
  
  Only 1 region is effective. If your app was set to work with multiple zones then it likely wouldn't be impacted by this outage.
  Not true. My application works just fine in multiple Availability Zones, yet it was knocked out yesterday due to an entire Region getting knocked offline.
  And before you tell me that the application should have been multi-Region, I'm not buying it. AWS has always maintained that deploying an app across multiple AZs is HA. AZs are supposed to be considered as separate datacenters: separate power, separate uplink, etc. And yes, separate EBS infrastructure (you can't attach an EBS volume to an instance that was launched in a different AZ). Multi-Region is for geographic reasons (reduced latency, compliance with EU data laws, etc.) or Disaster Recovery.
  In yesterday's case, a network hiccup triggered EBS to eat itself in one AZ. Fine, I'm totally cool with that. I understand that stuff happens. But for that EBS failure to bring EBS down in all Availability Zones, I am absolutely not cool with. That that happened reveals a serious architectural flaw in the supposed isolation between AZs. Make no mistake about it, it is a huge egg to the face of AWS's EBS team.
  Would making my app multi-Region have saved my bacon? Sure. And so would have deploying across multiple providers, etc. But the point is, I shouldn't have to do that. AWS told their customers that we don't have to do that. So as far as I'm concerned, nobody gets to say, "Well, you should have been multi-Region." That's just hindsight's 20/20 vision talking.
  Personally, it didn't take much effort to get my app back online. Most of the effort was me trying to decide whether or wait it out or go into DR mode. Around the time I decided to go ahead and restore in us-west-1, AWS got EBS-backed instances working in an AZ, so I just relaunched in us-east-1. In all, I didn't lose much. But some people really go hosed by this, and I can't say I blame them for being upset. They did the Right Thing, and they still got hosed.
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Re:Reddit is down because of this by Richard_at_work · 2011-04-21 04:46 · Score: 4, Informative

Don't worry - Slashdot just did something similar. When I try and reply to comments through my accounts comments history page, its horribly horribly broken. Each attempt to click in the reply box loads a new comment further up in the comment tree, and scrolls the page to the newly loaded comment. Scroll back down, click in the box again and it loads anotehr comment and shunts me back up the page. It can get really fucking annoying when you are trying to reply to a comment thats quite a way down a long tree.