Slashdot Mirror


How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com)

Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.

Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.

2 of 69 comments (clear)

  1. Re:The whole point of "prime day" by Anonymous Coward · · Score: 2, Informative

    That is rarely the case.

    I have never seen Amazon or specifically Ec2 to be an organization of blame. The reality is they ask people to go mind numbingly fast a lot of the time and shit happens. There are a number of things people do to try to curtail failure, but it's pretty much by the pants most days. This is the reason there is a reflection on failure and what could have been done better. Consistently fail as an organization and higher ups start asking what you are doing as an org to solve your problems. The company as a whole? If multiple organizations have the same failure then it's usually some systemic fault and something needs to be spun up to solve this problem for the organization.

    The problems that were nightmares 5 years ago are hardly cause for concern these days. At least, so I'm told, based on very specific conversations regarding such things.

    Capacity in a system that doesn't autoscale like ec2 services is going to be a sore spot. I'm sure someone was digging into the couch cushions or trading servers for emergency capacity. In fairness, six hours is fairly good because you have to consider... time to identify issue, time to identify bottle neck, time to locate new capacity if not available (there be constraints on type) and then provisioning/deployment time. Six hours with some horse trading seems about right and that assumes their services automatically make use of expanded pools. Retail was fairly flexible before I left. I hope they didn't introduce some new bug.

    Then again... they could have just introduce one singular autoscaling failure and it took that much time to patch and deploy a fix.

    The worst thing you can do in a crisis is react and make a change without fully assessing the change. Evil begets evil... so to speak.

  2. Re:not enough servers? by Fnkmaster · · Score: 5, Informative

    This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.

    Source: I work at Amazon.