Slashdot Mirror


How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com)

Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.

Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.

69 comments

  1. The whole point of "prime day" by darkain · · Score: 4, Interesting

    The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

    1. Re:The whole point of "prime day" by Anonymous Coward · · Score: 4, Interesting

      That sounds good and that's probably what used to happen.

      Last year (when I was still employed by them) Prime day was a VERY big deal; they count on that toward the value of their stocks.

      Unfortunately, Amazon is a very reactionary company - they look at the job that is in front of them (RIGHT in front of them) and focus on that to the exclusion of all else. It's likely that multiple departments knew the failures were going to happen but they were unable to get upper management to take them seriously (because working on something that is supposed to already work slows us down from working on what does not yet work, you see). I bet management is taking them seriously now.

      I guarantee some poor underpaid schmuck will lose his job over this instead of the management staff that's responsible for an overwhelming workload.

    2. Re:The whole point of "prime day" by Anonymous Coward · · Score: 2, Informative

      That is rarely the case.

      I have never seen Amazon or specifically Ec2 to be an organization of blame. The reality is they ask people to go mind numbingly fast a lot of the time and shit happens. There are a number of things people do to try to curtail failure, but it's pretty much by the pants most days. This is the reason there is a reflection on failure and what could have been done better. Consistently fail as an organization and higher ups start asking what you are doing as an org to solve your problems. The company as a whole? If multiple organizations have the same failure then it's usually some systemic fault and something needs to be spun up to solve this problem for the organization.

      The problems that were nightmares 5 years ago are hardly cause for concern these days. At least, so I'm told, based on very specific conversations regarding such things.

      Capacity in a system that doesn't autoscale like ec2 services is going to be a sore spot. I'm sure someone was digging into the couch cushions or trading servers for emergency capacity. In fairness, six hours is fairly good because you have to consider... time to identify issue, time to identify bottle neck, time to locate new capacity if not available (there be constraints on type) and then provisioning/deployment time. Six hours with some horse trading seems about right and that assumes their services automatically make use of expanded pools. Retail was fairly flexible before I left. I hope they didn't introduce some new bug.

      Then again... they could have just introduce one singular autoscaling failure and it took that much time to patch and deploy a fix.

      The worst thing you can do in a crisis is react and make a change without fully assessing the change. Evil begets evil... so to speak.

    3. Re:The whole point of "prime day" by geekmux · · Score: 1

      ...This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

      "Currently out of capacity for scaling,"

      And I'm still scratching my head trying to figure out what the real problem was, because running out of hardware is kind of what the comment above implies, not that there was some glitch to fix in "new infrastructure code".

      And scale things up by hand? On Prime Day? That's kind of like telling a NASCAR pit crew they're gonna have to change tires on the car while it's still racing around the track.

    4. Re:The whole point of "prime day" by Anonymous Coward · · Score: 1

      People were not provisioning servers manually. They were triggering provisioning actions manually instead of relying on autoscaling. In particular, autoscaling actions are throttled to avoid rapid changes that were actually needed there.

      I recognize the quotes in the article - they are from the trouble for the event (all Amazon was watching it, so no wonder it leaked). So let me just say, there was no one single cause of failure.

    5. Re:The whole point of "prime day" by Anonymous Coward · · Score: 0

      People were not provisioning servers manually. They were triggering provisioning actions manually instead of relying on autoscaling. In particular, autoscaling actions are throttled to avoid rapid changes that were actually needed there.

      I recognize the quotes in the article - they are from the trouble for the event (all Amazon was watching it, so no wonder it leaked). So let me just say, there was no one single cause of failure.

      "Looking at scavenging hardware."

      I do understand your feedback, but once again the implied points back to a single cause of failure; improper planning. Scavenging hardware is something a two-bit startup does when their marketing suddenly goes viral. That should not be Amazon on Prime Day.

    6. Re: The whole point of "prime day" by LordWabbit2 · · Score: 2

      Welcome to IT, a work colleague once said "we can do the impossible, but miracles take a bit longer".

      --
      There are three kinds of falsehood: the first is a 'fib,' the second is a downright lie, and the third is statistics.
    7. Re:The whole point of "prime day" by jellomizer · · Score: 1

      However being Amazon has years of data and tending information, they should have been able to predict how much demand was going to happen, and made sure the infrastructure is prepped, and ready for the event.

      It is kinda a high stake game to play for load testing.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    8. Re:The whole point of "prime day" by Anonymous Coward · · Score: 5, Interesting

      Recent Amazon employee here, posting AC for obvious reasons,

      The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure.

      No, it's a straightforward copy of Singles Day. Singles Day is the biggest shopping day in the world, so Amazon figured they could invent their own shopping holiday and people would go for it.

      They need to be able to test AWS with massive demand on unpredictable pages

      Almost none of Amazon retail runs on AWS. There are islands here and there, but for the most part they're still unrelated after all these years.

      The big difference this year is that something didn't work right

      Yeah, here's what idn't work right: in a cost-cutting effort, upper management imposed a huge paperwork burden in order to scale up your fleet for prime day. Some teams clearly decided to take risks with a smaller fleet instead of jumping through flaming hoops to justify the exact number of servers they'd need to scale up to.

      How'd that work out for you, Amazon?

    9. Re:The whole point of "prime day" by Anonymous Coward · · Score: 0

      I do understand your feedback, but once again the implied points back to a single cause of failure; improper planning. Scavenging hardware is something a two-bit startup does when their marketing suddenly goes viral. That should not be Amazon on Prime Day.

      Yes, you got it. They imposed a big new paperwork burden for this year's planning, with a bunch of extra work to prove you really needed a larger fleet for Prime Day. Clearly some teams erred on the side of "I'm sick of all this paperwork; I'm supposed to be writing software".

      It's hard enough to plan this shit in the first place, without also having to battle the bean counters.

    10. Re:The whole point of "prime day" by Wolfrider · · Score: 1

      --This does not appear to be a failure of Prime Day's IT team - it's a Failure of Manglement. Stress the people under you too hard and don't give them what they need, and this is the kind of chit that happens.

      --Amazon has been in business for a fairly long time - long enough for upper manglement to become stultified. I doubt they'll learn much from this debacle (but I'd be glad to be proven wrong.) They'll probably continue to treat their warehouse workers like chattel tho.

      --
      .
      == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
    11. Re:The whole point of "prime day" by Archangel+Michael · · Score: 1

      I would suggest, that if you're adverse to failure to the point of never really trying something difficult, you're already failing.

      There are a lot of things that teach us, but failure is one of the greatest teachers of all.

      Or as my dad used to say (probably stolen from elsewhere), "If you aren't failing, you're not trying hard enough"

      --
      Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
    12. Re:The whole point of "prime day" by Anonymous Coward · · Score: 0

      The issue is that Amazon doesn't ever really solve any "problems" that they encounter, they just barrel through and hope that they can outgrow the issue.

      Correct regarding blame, but someone WILL lose their job over it...watched it happen more than once. It'll be six months before it happens, but it'll happen.

    13. Re:The whole point of "prime day" by Anonymous Coward · · Score: 0

      Paperwork is now 90% of every job there :-/ In order to "identify gaps" they require thorough documentation of the problem which costs every employee literally hours every day. Once upon a time a good tech could get a lot of work done...now the best techs still can't perform half as well as the olde guarde simply because of the fucking TPS reports.

    14. Re:The whole point of "prime day" by Anonymous Coward · · Score: 0

      For example if a S3 bucket namespace gets slammed it needs to be partitioned at the Index tier. In a sudden load situation the auto-partition can't keep up (or will choose sub-optimally since it doesn't have Application-specific knowledge) or will have a significant delay before recognizing the need for it. Partitions are expensive so they are avoided if the situation is simply a "blip". So manually triggering a partition is not unusual. During partition thru-put is rather degraded so further steps are taken to throttle transactions to that namespace till it's complete.

      I don't recognize SABLE as a subsystem but it sounds like much the same concepts are in play.

    15. Re:The whole point of "prime day" by imidan · · Score: 1

      The entire point of "prime day", ... is to test out their infrastructure.

      Interesting, and that makes sense. I was thinking about it the other day. I've never bought anything from Amazon on Prime Day, mainly because every time I look at the sale items, they seem to be a bunch of junk that I have no interest in or need for. I'd started to think of it as a typical "clearance" sale, that they were trying to make space in the warehouse (for upcoming Xmas) by ditching their leftover junk at low prices.

    16. Re:The whole point of "prime day" by ayesnymous · · Score: 1

      Their test ended up being an advertisement for Azure.

  2. Tried buying something all day by SmaryJerry · · Score: 1

    I don't know about anyone else but I couldn't buy anything for a solid 5 hours or more, checking intermittently, because it wouldn't let me checkout. I did see a lot of Amazon dogs though.. like a lot..

    1. Re: Tried buying something all day by desdinova+216 · · Score: 1

      I thought they were all a continuation of a proud tradition of /. trolling. going all the way back to GNAA, golden girls, Apps, cows, hot Grits, Sublaxations, Bennett Haselton, Jonathan Katz...

  3. Didn't feel like capacity was the issue by SmaryJerry · · Score: 1

    The whole site worked for me but I couldn't checkout. It didn't seem like a server capacity issue to me.

  4. They should switch to a scalable Infrastructrure by burki · · Score: 2

    Like Amazon Web Services for example...

  5. not enough servers? by Anonymous Coward · · Score: 0

    wtf. if that was the case, not a very good example of aws' scaling capabilities, that's for sure..

    and how the fuck is cnbc getting 'internal' (and no doubt confidential, nda-protected shit) documents? bezos is gonna hang someone for that leak.

    1. Re:not enough servers? by Fnkmaster · · Score: 5, Informative

      This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.

      Source: I work at Amazon.

    2. Re:not enough servers? by Anonymous Coward · · Score: 0

      So this was lack of testing on a new system? Someone didn't bother with htload?

      Destination: Bin

    3. Re:not enough servers? by h33t+l4x0r · · Score: 1

      So there's a private AWS that works even worse than public AWS? Tell me more. And please subscribe me to your newsletter.

    4. Re:not enough servers? by Anonymous Coward · · Score: 0

      Think of it this way: Amazon.com existed before AWS, so to switch everything over (Load Balancers, CI/CD pipeline, repos, etc) would require enormous effort. What failed wasn’t AWS but the “legacy” systems Amazon.com built on AWS.

    5. Re:not enough servers? by Anonymous Coward · · Score: 0

      Except that AWS was built for internal use with the view to make it trivial to open up later. It runs on the same system.

      What is true is that Amazon will have dedicated servers running AWS purely for their own use, and priority over other AWS users for shared resources.

    6. Re:not enough servers? by Anonymous Coward · · Score: 3, Insightful

      AWS wasn’t built for amazon.com use. It was built with excess amazon.com capacity.

      That’s a significant difference.

    7. Re:not enough servers? by cascadingstylesheet · · Score: 2

      This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.

      Source: I work at Amazon.

      Why not? The private dog food tastes better?

    8. Re:not enough servers? by Anonymous Coward · · Score: 0

      AWS wasn’t built for internal use by amazon.com. It was built with excess capacity from amazon.com. AWS was a moonshot test of IaaS. And once it became large enough, Amazon.com realized it had to eat it’s own dog food.

    9. Re: not enough servers? by Anonymous Coward · · Score: 0

      What happened to planning for many different scenarios...like Amazon mentioned with regard to brexit? Please Amazon...for humanities sake, don't do space travel and not provision enough fuel to get into space...would hate to fall back to earth with someone in ground control saying...at least we have the "beta" disclaimer.

    10. Re:not enough servers? by Anonymous Coward · · Score: 1

      Except that AWS was built for internal use with the view to make it trivial to open up later. It runs on the same system.

      Nope. Pure propaganda from the early days of AWS. AWS was an external product from the beginning, with hopes that one day the retail side would be able to use it. Not so much.

      What is true is that Amazon will have dedicated servers running AWS purely for their own use, and priority over other AWS users for shared resources.

      Nope. All "reserved instances" have highest priority, and there's never really a problem with that capacity that lasted more than a couple minutes. EC2 "on-demand" instances come next, but they'll never terminate one to give it to anyone else. The only servers they'll take away from people are "Spot" instances, and they're up front about that,

    11. Re:not enough servers? by lgw · · Score: 2

      Why not? The private dog food tastes better?

      No, the non-AWS stuff is pure garbage. But it would take engineering effort to move to AWS, and management would have to fund that effort instead of their own pet projects.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    12. Re:not enough servers? by Anonymous Coward · · Score: 0

      Think of it this way: Amazon.com existed before AWS, so to switch everything over (Load Balancers, CI/CD pipeline, repos, etc) would require enormous effort. What failed wasn’t AWS but the “legacy” systems Amazon.com built on AWS.

      Well yah, no shit, that’s pretty much what anyone in ops thinks when some bright eyed fool in dev wants to migrate thintgs into AWS.

    13. Re:not enough servers? by AuMatar · · Score: 1

      I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that. But it makes lots of sense to keep separate pools of servers for AWS vs Amazon.com, for both security and reliability.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    14. Re:not enough servers? by lgw · · Score: 1

      I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that.

      You should doubt more. It's all deep legacy stuff now, most of it the same systems that were innovative 15 years ago. Most of the rest of the world has moved on to implicitly auto-scaling container-based solutions, where dev teams never muck with "servers" in any way. Google (which is hardly leading edge these days) has been containerized for years. Even Azure offers self-scaling fleets with some abstraction away from explicit server types.

      And the stuff your remember was never made to work on AWS, so if you're trying to build a non-legacy system on AWS to get its benefits, it's an entirely different stack to learn from scratch.

      It's a really shocking failure of senior management IMO. They should have started 5 years ago with an even better AWS-based tech stack for deployments and fleet management, then forced everyone to move over. But that wouldn't have immediate payoff.

      --
      Socialism: a lie told by totalitarians and believed by fools.
  6. Shopping is now news. by Anonymous Coward · · Score: 0

    That's pretty sad.

  7. Joke's on you by Anonymous Coward · · Score: 1, Insightful

    Joke's on you, because most of the Amazon retail doesn't actually run on AWS. It uses its own deployment system, server management, data storage, etc.

  8. The real problem with prime day by Anonymous Coward · · Score: 2, Funny

    Is that I wasn't able to buy anything on prime day because I don't have any money. Epic fail. LULZ.

    Thanks, Obama.

    1. Re: The real problem with prime day by queBurro · · Score: 2

      but but Hilary

      --
      sag
  9. stupid consumer whore drones by Anonymous Coward · · Score: 0

    I didn't even know this stupid event was going on and couldn't care less some consumer whores missed out on some deals.

    1. Re:stupid consumer whore drones by DontBeAMoran · · Score: 1

      But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.

      --
      #DeleteFacebook
    2. Re:stupid consumer whore drones by drinkypoo · · Score: 1

      But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.

      Only if it was discounted on prime day; the only thing I was thinking of buying from Amazon on prime day turned out to be twenty bucks cheaper from walmart. And, of course, only if you actually managed to place your order, given that Amazon failed at reliability.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    3. Re:stupid consumer whore drones by Anonymous Coward · · Score: 0

      You care enough to waste your time posting on Slashdot about it... so therefore you were looking for "could care less" and not "couldn't care less."

      Just once I wish someone would get this turn of phrase correct.

    4. Re:stupid consumer whore drones by DontBeAMoran · · Score: 1

      Well, good for Walmart then. I can't even imagine the number of sales Amazon lost over those glitches.

      --
      #DeleteFacebook
    5. Re:stupid consumer whore drones by EETech1 · · Score: 1

      Here's all they need to know!
      https://www.youtube.com/watch?...

  10. Embarassing by Anonymous Coward · · Score: 0

    Here's a company based solely on cloud solutions and web retail marketing who plans a day which always creates a burden from the start especially on server demand. Maybe this should not reflect on the reliability of AWS to provide business with reliable solutions. But I would certainly be asking questions on why a company like Amazon created such a bad automated system for demand that obviously didn't work

    1. Re:Embarassing by Fly+Swatter · · Score: 1

      If you don't stress test what you have how do know it will work? They use this 'day' to fix any issues now so they don't have unexpected ones at actual important times of the year.

  11. It wasn't "riddled with glitches" by Anonymous Coward · · Score: 0

    It wasn't riddled with glitches. It was a total disaster and a complete embarrassment to the company.

  12. FElon called someone a PEDO to distract by Anonymous Coward · · Score: 0

    We all saw it. He's Beszos' little PEDO bitch.

  13. Classic rookie AWS provisioning mistake... by supremebob · · Score: 1

    One of the first things you need to do when setting up an environment in AWS is to get them to increase your (artificially low) server limits for each instance type you're planning on using. Otherwise, you're going to run into those limits at the worst possible time when you need to rapidly scale your servers.

    While I understand why they do this (probably to protect themselves from having someone spin up 1,000 cryptocoin mining instances with a hacked account), it's refreshing to see Amazon get bit by their own annoying provisioning decisions.

    1. Re:Classic rookie AWS provisioning mistake... by Anonymous Coward · · Score: 0

      it's refreshing to see Amazon get bit by their own annoying provisioning decisions.

      Amazon retail mostly doesn't use AWS, but rest assured, the internal provisioning decisions are far more annoying.

  14. You had to be patient by DontBeAMoran · · Score: 1

    I tried for hours to order the Amazon Fire 7" (8GB) for the low price of CAD$40, but the page kept changing. Sometimes it would be available, sometimes it would be disabled and only the 16GB was available, sometimes the 8GB option completely disappeared as if it didn't even exist, other times it was available from a third-party non-Amazon seller for nearly twice the price.

    It kept doing that every single time the page loaded and I was reloading it roughly once per second.

    What's also weird is that once every few minutes, when it was finally available again, the estimated delivery kept going up by about two weeks. In the end I was able to order it (8GB), but I'm guessing it's not even manufactured yet since my delivery date is mid-september.

    --
    #DeleteFacebook
    1. Re: You had to be patient by Anonymous Coward · · Score: 0

      You should have taken all of these glitches as a hint to not buy a piece of junk!

    2. Re: You had to be patient by DontBeAMoran · · Score: 1

      It's still probably the best low-cost "brand-name" tablet out there for Netflix. The Fire HD 7" specifications are at least twice as better than the crap available around here at twice that price.

      --
      #DeleteFacebook
  15. Re: They should switch to a scalable Infrastructru by xxxJonBoyxxx · · Score: 1

    I was thinking of something like Microsoft Azure or Google Cloud platform. maybe Amazon could hire some of their Cloud Consultants to figure out how to do it.

  16. Prime day is not about infrastructure. by nimbius · · Score: 2

    Prime day is a reaction to Alibaba and Aliexpress.com. They both generate nearly a trillion dollars of revenue across the world with 11/11 day sales. 11/11 day itself is a celebration called 'singles day' in china, where students started celebrating being single around 1993 on university campuses.

    amazon day is a pointless branded knockoff Bezos hopes will generate just as much money. Assuming no one finds out about aliexpress and they somehow magically stop competing.

    --
    Good people go to bed earlier.
    1. Re:Prime day is not about infrastructure. by Anonymous Coward · · Score: 0

      Chinese Company for Chinese Jobs or American Company and American jobs, what would you choose. I know what I would choose.

      MAGA!

  17. Who would have thougt by nospam007 · · Score: 4, Funny

    Amazon got slashdotted.

    1. Re:Who would have thougt by bugs2squash · · Score: 2

      Slashdotted ! Ha, well played... I have to wonder if it was all bona fide customer demand that toppled it though.

      --
      Nullius in verba
  18. Honestly by MoralCharacter · · Score: 1

    The Prime Day thing is pretty skeezy - tons of no-name brand items who's prices were inflated for the sale day so they could "slash prices" and offer you the low low discounted price of what it normally sells at - but with the bigger price it never sold out crossed out. It was entirely an exercise in preying on peoples gullibility, who saw these huge "discounts" and made impulse buys thinking this super short special shopping day was saving them money. And of course, you had to buy the prime membership in the first place. You're by no means saving money on prime - you're usually paying more for the same goods - all you're getting is the "free" two day shipping.

    1. Re:Honestly by Anonymous Coward · · Score: 0

      there were some decent computing components on sale. plus it forced other retailers like newegg and walmart to add sales. yea 80% of the crap didnt matter but man o man if you wanted something nice you probably could have gotten it on discount.