Slashdot Mirror


How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com)

Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.

Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.

15 of 69 comments (clear)

  1. The whole point of "prime day" by darkain · · Score: 4, Interesting

    The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

    1. Re:The whole point of "prime day" by Anonymous Coward · · Score: 4, Interesting

      That sounds good and that's probably what used to happen.

      Last year (when I was still employed by them) Prime day was a VERY big deal; they count on that toward the value of their stocks.

      Unfortunately, Amazon is a very reactionary company - they look at the job that is in front of them (RIGHT in front of them) and focus on that to the exclusion of all else. It's likely that multiple departments knew the failures were going to happen but they were unable to get upper management to take them seriously (because working on something that is supposed to already work slows us down from working on what does not yet work, you see). I bet management is taking them seriously now.

      I guarantee some poor underpaid schmuck will lose his job over this instead of the management staff that's responsible for an overwhelming workload.

    2. Re:The whole point of "prime day" by Anonymous Coward · · Score: 2, Informative

      That is rarely the case.

      I have never seen Amazon or specifically Ec2 to be an organization of blame. The reality is they ask people to go mind numbingly fast a lot of the time and shit happens. There are a number of things people do to try to curtail failure, but it's pretty much by the pants most days. This is the reason there is a reflection on failure and what could have been done better. Consistently fail as an organization and higher ups start asking what you are doing as an org to solve your problems. The company as a whole? If multiple organizations have the same failure then it's usually some systemic fault and something needs to be spun up to solve this problem for the organization.

      The problems that were nightmares 5 years ago are hardly cause for concern these days. At least, so I'm told, based on very specific conversations regarding such things.

      Capacity in a system that doesn't autoscale like ec2 services is going to be a sore spot. I'm sure someone was digging into the couch cushions or trading servers for emergency capacity. In fairness, six hours is fairly good because you have to consider... time to identify issue, time to identify bottle neck, time to locate new capacity if not available (there be constraints on type) and then provisioning/deployment time. Six hours with some horse trading seems about right and that assumes their services automatically make use of expanded pools. Retail was fairly flexible before I left. I hope they didn't introduce some new bug.

      Then again... they could have just introduce one singular autoscaling failure and it took that much time to patch and deploy a fix.

      The worst thing you can do in a crisis is react and make a change without fully assessing the change. Evil begets evil... so to speak.

    3. Re: The whole point of "prime day" by LordWabbit2 · · Score: 2

      Welcome to IT, a work colleague once said "we can do the impossible, but miracles take a bit longer".

      --
      There are three kinds of falsehood: the first is a 'fib,' the second is a downright lie, and the third is statistics.
    4. Re:The whole point of "prime day" by Anonymous Coward · · Score: 5, Interesting

      Recent Amazon employee here, posting AC for obvious reasons,

      The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure.

      No, it's a straightforward copy of Singles Day. Singles Day is the biggest shopping day in the world, so Amazon figured they could invent their own shopping holiday and people would go for it.

      They need to be able to test AWS with massive demand on unpredictable pages

      Almost none of Amazon retail runs on AWS. There are islands here and there, but for the most part they're still unrelated after all these years.

      The big difference this year is that something didn't work right

      Yeah, here's what idn't work right: in a cost-cutting effort, upper management imposed a huge paperwork burden in order to scale up your fleet for prime day. Some teams clearly decided to take risks with a smaller fleet instead of jumping through flaming hoops to justify the exact number of servers they'd need to scale up to.

      How'd that work out for you, Amazon?

  2. They should switch to a scalable Infrastructrure by burki · · Score: 2

    Like Amazon Web Services for example...

  3. Re:not enough servers? by Fnkmaster · · Score: 5, Informative

    This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.

    Source: I work at Amazon.

  4. The real problem with prime day by Anonymous Coward · · Score: 2, Funny

    Is that I wasn't able to buy anything on prime day because I don't have any money. Epic fail. LULZ.

    Thanks, Obama.

    1. Re: The real problem with prime day by queBurro · · Score: 2

      but but Hilary

      --
      sag
  5. Re:not enough servers? by Anonymous Coward · · Score: 3, Insightful

    AWS wasn’t built for amazon.com use. It was built with excess amazon.com capacity.

    That’s a significant difference.

  6. Re:not enough servers? by cascadingstylesheet · · Score: 2

    This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.

    Source: I work at Amazon.

    Why not? The private dog food tastes better?

  7. Prime day is not about infrastructure. by nimbius · · Score: 2

    Prime day is a reaction to Alibaba and Aliexpress.com. They both generate nearly a trillion dollars of revenue across the world with 11/11 day sales. 11/11 day itself is a celebration called 'singles day' in china, where students started celebrating being single around 1993 on university campuses.

    amazon day is a pointless branded knockoff Bezos hopes will generate just as much money. Assuming no one finds out about aliexpress and they somehow magically stop competing.

    --
    Good people go to bed earlier.
  8. Re:not enough servers? by lgw · · Score: 2

    Why not? The private dog food tastes better?

    No, the non-AWS stuff is pure garbage. But it would take engineering effort to move to AWS, and management would have to fund that effort instead of their own pet projects.

    --
    Socialism: a lie told by totalitarians and believed by fools.
  9. Who would have thougt by nospam007 · · Score: 4, Funny

    Amazon got slashdotted.

    1. Re:Who would have thougt by bugs2squash · · Score: 2

      Slashdotted ! Ha, well played... I have to wonder if it was all bona fide customer demand that toppled it though.

      --
      Nullius in verba