Slashdot Mirror


Amazon EBS Failure Brings Down Reddit, Imgur, Others

Several readers have sent word of a significant Amazon EBS outage. Quoting: "Amazon Web Services has confirmed that its Elastic Block Storage (EBS) service is experiencing degraded service, leading sites across the Internet to experience downtime, including Reddit, Imgur and many others. AWS confirmed on its status page at 2:11 p.m. ET that it is experiencing 'degraded performance for a small number of EBS volumes.' It says the issue is restricted to a single Availability Zone within the US-East-1 Region, which is in Northern Virginia. AWS later reported that its Relational Database Service (Amazon RDS) and its Elastic Beanstalk application plaform also experienced failures on Monday afternoon."

25 of 176 comments (clear)

  1. Productivity up by Phisbut · · Score: 5, Funny

    Productivity reached a record high this afternoon.

    --
    After 3 days without programming, life becomes meaningless
    - The Tao of Programming
    1. Re:Productivity up by fustakrakich · · Score: 5, Funny

      Should we expect a baby boom in nine months?

      --
      “He’s not deformed, he’s just drunk!”
    2. Re:Productivity up by MichaelSmith · · Score: 4, Funny

      Not from reddit users (or slashdotters for that matter).

  2. But But But by Anonymous Coward · · Score: 5, Insightful

    It's the cloud! It's like never like down, and webscale!

  3. Interestingly enough... by Anonymous Coward · · Score: 5, Funny

    Since no one can go on reddit, they will come back to /. only to find out why reddit is down!

  4. Other Victims by Revotron · · Score: 4, Informative

    Coursera is also down as a result.

  5. define "leading" ... by magarity · · Score: 4, Funny

    /. is working just fine.

    Are those karma points in the mail?

  6. Oblig by sortius_nod · · Score: 4, Funny

    It's as if millions of geek voices cried out in terror & were suddenly silenced.

  7. Same region as the storm in June by bill_mcgonigle · · Score: 4, Informative

    Bad luck if you're hosted in the US-East-1 Region, I guess.

    Heh, I should really start advertising the LVS clusters I tend to as 'private clouds with better uptime than Amazon'.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  8. Low Availability? by mkosmo · · Score: 4, Interesting

    I have to admit, due to this outage I just logged in to Slashdot for the first time in a year. We're experiencing our own outages at work, unrelated to AWS, but I'd hate to be an AWS admin during one of these major outages. This makes me wonder why Reddit, Imgur, etc., don't have presences in multiple availability zones to prevent this kind of outage.

    1. Re:Low Availability? by Anonymous Coward · · Score: 4, Informative

      >Reddit, Imgur, etc., don't have presences in multiple availability zones to prevent this kind of outage

      They do. It's a multi-AZ outage, despite what Amazon is saying.

    2. Re:Low Availability? by segedunum · · Score: 5, Interesting

      We're experiencing our own outages at work, unrelated to AWS, but I'd hate to be an AWS admin during one of these major outages.

      I used to be an admin working on AWS through some of these outages, and it's not pleasant let me tell you. The amount of redundancy you need to get through this makes putting stuff in the cloud prohibitively expensive and things are basically out of your hands. When you run your own servers you know how long it will take to replace a piece of hardware or take emergency measures to keep things running. At least you know you have control over the process. Amazon? They recover what they can of your EBS disks in a few days without telling you anything and in the case of the European outage they actually screwed the EBS snapshots with a recovery job they ran. Thankfully I ran backups every night that took all data off Amazon's system. All I didn't know was when I could be back up and running.

      Using AWS for throwaway computing where you just want some computing power for a few weeks of the year? Yes, fine. Permanently running stuff in it? Nope.

    3. Re:Low Availability? by segedunum · · Score: 5, Interesting

      They do. It's a multi-AZ outage, despite what Amazon is saying.

      Amazon's multiple availability zones stuff is total bullshit. It has become painfully apparent during every single one of these outages that the so-called availability zones are not separate because an EBS problem propagates everywhere. No one can actually work the availability zones out either because what Amazon cunningly does is call zones by different letters for different customers, so availability zone 'a' for one might be availability zone 'c' for another so no one can actually compare. That fact alone sent my bullshit meter off the scale. It just seems excessively evasive and sneaky for my taste.

      If you want redundancy you are going to have to go to completely geographically separate zones. Keeping those zones in sync is prohibitively expensive for the vast majority. Either that or you have a backup cloud provider, but again you have to be so paranoid and trust Amazon so little that you have to be able to have your data out and off Amazon's infrastructure at least nightly at a moment's notice. Sorry, but that just doesn't work.

    4. Re:Low Availability? by segedunum · · Score: 4, Insightful

      Remember that availability zone 'a' might be 'd' for others. Amazon does not let you work out what availability zones everyone really has.

    5. Re:Low Availability? by eln · · Score: 4, Funny

      That's old Web-2.0 thinking. We're in the era of the cloud now, and the cloud is magic. Trust the cloud.

    6. Re:Low Availability? by segedunum · · Score: 4, Interesting

      ....and in the case of the European outage they actually screwed the EBS snapshots with a recovery job they ran. Thankfully I ran backups every night that took all data off Amazon's system. All I didn't know was when I could be back up and running.

      I felt this was worth emphasising. These are EBS snapshots, not just the EBS disks - the ones supposedly stored in S3 and immune to corruption. Your backups, in other words. If you use RDS you rely on these completely for backup.

      AWS is OK to get yourself up and running without paying huge amounts up front for hardware, but be aware that you just simply cannot trust this infrastructure.

    7. Re:Low Availability? by hawguy · · Score: 4, Insightful

      Seems to me that the answer is just to host things yourself, instead of relying on another company's infrastructure.

      How do you host anything without relying on another company's infrastructure? Do you purchase right-of-way's between your site and all of your customers and string your own fiber? Do you run your own power plant? Do you build your own UPS, right down to the batteries so you don't need to trust a UPS vendor? Do you build and service your own CRAC's?

      It's impossible for any company to *not* rely on another company's infrastructure even if just for internet connectivity, the only question is where to draw the line - do you really want to rack and stack your own servers? Do you trust a vendor to do periodic preventative maintenance on your generators, or do you use your own staff? Do you certify your own staff to service your fire suppression system, or do you contract out to a vendor? Do you want to own your own network equipment and do your own network admin? Do you want to swap out servers and disk drives when they fail? Do you keep staff electricians on-hand to take care of electrical issues? Do you want to run a 24x7 NOC to monitor and maintain your datacenter?

      While a large company may be able to keep many of these tasks in-house, many small companies can't afford the staff it would take to control all of their infrastructure.

    8. Re:Low Availability? by segedunum · · Score: 4, Interesting

      Multi AZ IS "completely geographically separate zones" and yes...

      Availability zones are not geographically separate nor is there any evidence that they are geographically or even logically separate from the nature of every major EBS outage there has been.

      Amazon is very clear that US East 1a,b,c,d are all the same physical data center. However, West is not. It's in Oregon (as opposed to VA for East)

      a, b, c and d are availability zones. US East, West etc. are different regions. I'm afraid you're not understanding just what is meant by availability zones or just muddying the waters.

      I've seen no evidence that true Multi AZ instances (as described by Amazon) are down. If you've got some though, I would be interested to see it because I would be pretty concerned.

      As I've said above, Amazon makes it as difficult as possible to verify availability zone failures because AZ 'a' for one customer might be 'c' for another and 'b' for another, so you can't verify anything with others. However, it becomes very clear when you get on Amazon's forums and look at major sites that have implemented in multiple zones from their perspective that they are down and have EBS problems in different zones they have. You don't get much more evidence than that.

      If you're not concerned when looking at that then I smell some apologism I'm afraid.

  9. Bright and Sunny Skies Today! by IonOtter · · Score: 4, Insightful

    Do you still think that putting your digital life in the "cloud", without any ability to fall back on a physical hard drive or device, is a good idea?

    --
    [End Of Line]
    1. Re:Bright and Sunny Skies Today! by gstoddart · · Score: 4, Interesting

      Do you still think that putting your digital life in the "cloud", without any ability to fall back on a physical hard drive or device, is a good idea?

      My first thoughts as well.

      A friend was recently telling me about an issue they were having at work ... they host stuff for other people, and have very high-availability SLAs. Unfortunately, the support they have from some of their own internal people is "weekdays 9-5". So when an outage happened, they were dead in the water, because their own people basically said "sorry, we don't do after hours support".

      Your SLA is only as good as your weakest link. Granted, some of these sites may not have SLAs, but if you have an external vendor providing some of this stuff, and their service levels suck, then your service level can't be any better.

      For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

      --
      Lost at C:>. Found at C.
    2. Re:Bright and Sunny Skies Today! by Anonymous Coward · · Score: 5, Funny

      For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

      That's because you don't have an MBA.

    3. Re:Bright and Sunny Skies Today! by hawguy · · Score: 4, Insightful

      Your SLA is only as good as your weakest link. Granted, some of these sites may not have SLAs, but if you have an external vendor providing some of this stuff, and their service levels suck, then your service level can't be any better.

      For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

      Because many companies are not willing to spend what it takes to get availability greater than what they can get at Amazon - especially if they take advantage of multi-AZ or multi-region redundancy.

      Sure, having a physical server at the office that you know you can fix by buying parts at the local computer store sounds attractive. Until the day you find that your building has burnt to the ground. Or a truck knocked over the utility pole providing network and electricity to your building. Or you discover that when you looked at the flood maps to make sure you weren't in a flood zone, the maps didn't account for a water main breaking and flooding the basement where your telecom equipment is... or the clogged roof drains that let 20,000 gallons of water to build up on the roof during a rainstorm until the roof collapsed and flooded your datacenter. Or the earthquake (or hurricane or tornado or flood or whatever) that takes down your site for days or weeks or even months, and your employees are more concerned with surviving than trying to get your critical systems back online.

      Meeting an SLA for your own facility only works when that facility is running, and often the company that rents office space has little control over the facility.

      My company has a number critical services running in one Amazon region with replication to a second region for failover. The second region costs very little, just a single instance to hold data replicated from the primary instance, then if we need to spin up the servers in the secondary region, it takes about 10 minutes to push the data from the local copy to the other servers once we start them up.

      We could automate the whole process, but Amazon problems are rare enough that it hasn't been worth it.

      We do have a couple servers in us-east-1a but so far those servers appear to be fine, although the AWS management interface has not been working for managing servers in that region/AZ. If we ran servers out of our local office instead of Amazon, we would have had at least 2 instances of complete downtime in the past year - one 3 hour internet outage, and a 48 hour power failure on a weekend when a transformer blew and the power company didn't have an available spare and had to truck it in from out of area.

  10. Re:I hope this doesn't affect Facebook. by sortius_nod · · Score: 4, Interesting

    I'm just glad I moved my hosting away from AWS. It seems they've had a few problems lately in their datacentres. Local Aussie hosting seems to have better bandwidth anyway.

  11. wow, mainframe problems in the cloud by Dan667 · · Score: 4, Insightful

    If only there were some lessons learned over decades and decades of mainframe use that that could be applied to the cloud.

  12. Re:multi AZ? by segedunum · · Score: 4, Interesting

    Do you have any evidence of this? Because I haven't seen any. And it sounds tin-foil-hat.

    Sites who implement multiple across multiple zones are down and the forums are full of customers who complain about EBS slowdowns and problems regardless of the availability zones they personally use. You're an apologist if you haven't grokked this yet.

    Actually, I run a load-balanced, redundant site on AWS. I ask the question because Multi-AZ (as defined by AWS) means geographically different...

    This is total rubbish. Availability zones are not geographically separate, and don't give me that 'as defined by AWS' crap to give yourself a back door (they don't, anyway). Expanding to multiple regions which is the only thing you can do is not the same thing.

    as in US West (in Oregon) vs US East (in Virginia) - NOT just the difference between US-East-1a,b,c,d (which Amazon makes very clear are in the same data center). That's why it's odd that Virginia's issues would affect Oregon (or any of the other AZs)

    No, Amazon is very, very clear on what an availability zone actually is. Stop trying to make AZs out to be separate regions to get yourself out of this. They are not.

    Try being helpful next time and answering the genuine question instead of smarting off because you can't get on reddit.

    I'm afraid you don't run any geographically separate system that spans multiple regions because it is prohibitively expensive to do so. You don't maintain AMIs and backups in different regions and you don't pay for the extremely large amount of bandwidth you need to keep those regions mirrored and synchronised.

    Sorry, but you aren't doing what you say you're doing and you don't know what the difference between availability zones and regions actually are, which was central to the question you asked. You were called out on it.