Slashdot Mirror


Amazon EBS Failure Brings Down Reddit, Imgur, Others

Several readers have sent word of a significant Amazon EBS outage. Quoting: "Amazon Web Services has confirmed that its Elastic Block Storage (EBS) service is experiencing degraded service, leading sites across the Internet to experience downtime, including Reddit, Imgur and many others. AWS confirmed on its status page at 2:11 p.m. ET that it is experiencing 'degraded performance for a small number of EBS volumes.' It says the issue is restricted to a single Availability Zone within the US-East-1 Region, which is in Northern Virginia. AWS later reported that its Relational Database Service (Amazon RDS) and its Elastic Beanstalk application plaform also experienced failures on Monday afternoon."

9 of 176 comments (clear)

  1. Low Availability? by mkosmo · · Score: 4, Interesting

    I have to admit, due to this outage I just logged in to Slashdot for the first time in a year. We're experiencing our own outages at work, unrelated to AWS, but I'd hate to be an AWS admin during one of these major outages. This makes me wonder why Reddit, Imgur, etc., don't have presences in multiple availability zones to prevent this kind of outage.

    1. Re:Low Availability? by segedunum · · Score: 5, Interesting

      We're experiencing our own outages at work, unrelated to AWS, but I'd hate to be an AWS admin during one of these major outages.

      I used to be an admin working on AWS through some of these outages, and it's not pleasant let me tell you. The amount of redundancy you need to get through this makes putting stuff in the cloud prohibitively expensive and things are basically out of your hands. When you run your own servers you know how long it will take to replace a piece of hardware or take emergency measures to keep things running. At least you know you have control over the process. Amazon? They recover what they can of your EBS disks in a few days without telling you anything and in the case of the European outage they actually screwed the EBS snapshots with a recovery job they ran. Thankfully I ran backups every night that took all data off Amazon's system. All I didn't know was when I could be back up and running.

      Using AWS for throwaway computing where you just want some computing power for a few weeks of the year? Yes, fine. Permanently running stuff in it? Nope.

    2. Re:Low Availability? by segedunum · · Score: 5, Interesting

      They do. It's a multi-AZ outage, despite what Amazon is saying.

      Amazon's multiple availability zones stuff is total bullshit. It has become painfully apparent during every single one of these outages that the so-called availability zones are not separate because an EBS problem propagates everywhere. No one can actually work the availability zones out either because what Amazon cunningly does is call zones by different letters for different customers, so availability zone 'a' for one might be availability zone 'c' for another so no one can actually compare. That fact alone sent my bullshit meter off the scale. It just seems excessively evasive and sneaky for my taste.

      If you want redundancy you are going to have to go to completely geographically separate zones. Keeping those zones in sync is prohibitively expensive for the vast majority. Either that or you have a backup cloud provider, but again you have to be so paranoid and trust Amazon so little that you have to be able to have your data out and off Amazon's infrastructure at least nightly at a moment's notice. Sorry, but that just doesn't work.

    3. Re:Low Availability? by segedunum · · Score: 4, Interesting

      ....and in the case of the European outage they actually screwed the EBS snapshots with a recovery job they ran. Thankfully I ran backups every night that took all data off Amazon's system. All I didn't know was when I could be back up and running.

      I felt this was worth emphasising. These are EBS snapshots, not just the EBS disks - the ones supposedly stored in S3 and immune to corruption. Your backups, in other words. If you use RDS you rely on these completely for backup.

      AWS is OK to get yourself up and running without paying huge amounts up front for hardware, but be aware that you just simply cannot trust this infrastructure.

    4. Re:Low Availability? by segedunum · · Score: 4, Interesting

      Multi AZ IS "completely geographically separate zones" and yes...

      Availability zones are not geographically separate nor is there any evidence that they are geographically or even logically separate from the nature of every major EBS outage there has been.

      Amazon is very clear that US East 1a,b,c,d are all the same physical data center. However, West is not. It's in Oregon (as opposed to VA for East)

      a, b, c and d are availability zones. US East, West etc. are different regions. I'm afraid you're not understanding just what is meant by availability zones or just muddying the waters.

      I've seen no evidence that true Multi AZ instances (as described by Amazon) are down. If you've got some though, I would be interested to see it because I would be pretty concerned.

      As I've said above, Amazon makes it as difficult as possible to verify availability zone failures because AZ 'a' for one customer might be 'c' for another and 'b' for another, so you can't verify anything with others. However, it becomes very clear when you get on Amazon's forums and look at major sites that have implemented in multiple zones from their perspective that they are down and have EBS problems in different zones they have. You don't get much more evidence than that.

      If you're not concerned when looking at that then I smell some apologism I'm afraid.

  2. multi AZ? by i_hate_robots · · Score: 3, Interesting

    An honest question, why don't these large, big-name sites utilize the Multi Availability Zone failover that Amazon offers? It seems these AWS outages make for good headlines, but shouldn't any large site be co-located in multiple physical locations to ensure uptime? If they WERE using Multi AZ, or there is some other technical reason why it wouldn't help, I'm really curious to know why...

    1. Re:multi AZ? by segedunum · · Score: 4, Interesting

      Do you have any evidence of this? Because I haven't seen any. And it sounds tin-foil-hat.

      Sites who implement multiple across multiple zones are down and the forums are full of customers who complain about EBS slowdowns and problems regardless of the availability zones they personally use. You're an apologist if you haven't grokked this yet.

      Actually, I run a load-balanced, redundant site on AWS. I ask the question because Multi-AZ (as defined by AWS) means geographically different...

      This is total rubbish. Availability zones are not geographically separate, and don't give me that 'as defined by AWS' crap to give yourself a back door (they don't, anyway). Expanding to multiple regions which is the only thing you can do is not the same thing.

      as in US West (in Oregon) vs US East (in Virginia) - NOT just the difference between US-East-1a,b,c,d (which Amazon makes very clear are in the same data center). That's why it's odd that Virginia's issues would affect Oregon (or any of the other AZs)

      No, Amazon is very, very clear on what an availability zone actually is. Stop trying to make AZs out to be separate regions to get yourself out of this. They are not.

      Try being helpful next time and answering the genuine question instead of smarting off because you can't get on reddit.

      I'm afraid you don't run any geographically separate system that spans multiple regions because it is prohibitively expensive to do so. You don't maintain AMIs and backups in different regions and you don't pay for the extremely large amount of bandwidth you need to keep those regions mirrored and synchronised.

      Sorry, but you aren't doing what you say you're doing and you don't know what the difference between availability zones and regions actually are, which was central to the question you asked. You were called out on it.

  3. Re:Bright and Sunny Skies Today! by gstoddart · · Score: 4, Interesting

    Do you still think that putting your digital life in the "cloud", without any ability to fall back on a physical hard drive or device, is a good idea?

    My first thoughts as well.

    A friend was recently telling me about an issue they were having at work ... they host stuff for other people, and have very high-availability SLAs. Unfortunately, the support they have from some of their own internal people is "weekdays 9-5". So when an outage happened, they were dead in the water, because their own people basically said "sorry, we don't do after hours support".

    Your SLA is only as good as your weakest link. Granted, some of these sites may not have SLAs, but if you have an external vendor providing some of this stuff, and their service levels suck, then your service level can't be any better.

    For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

    --
    Lost at C:>. Found at C.
  4. Re:I hope this doesn't affect Facebook. by sortius_nod · · Score: 4, Interesting

    I'm just glad I moved my hosting away from AWS. It seems they've had a few problems lately in their datacentres. Local Aussie hosting seems to have better bandwidth anyway.