Slashdot Mirror


How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com)

Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.

Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.

3 of 69 comments (clear)

  1. The whole point of "prime day" by darkain · · Score: 4, Interesting

    The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

    1. Re:The whole point of "prime day" by Anonymous Coward · · Score: 4, Interesting

      That sounds good and that's probably what used to happen.

      Last year (when I was still employed by them) Prime day was a VERY big deal; they count on that toward the value of their stocks.

      Unfortunately, Amazon is a very reactionary company - they look at the job that is in front of them (RIGHT in front of them) and focus on that to the exclusion of all else. It's likely that multiple departments knew the failures were going to happen but they were unable to get upper management to take them seriously (because working on something that is supposed to already work slows us down from working on what does not yet work, you see). I bet management is taking them seriously now.

      I guarantee some poor underpaid schmuck will lose his job over this instead of the management staff that's responsible for an overwhelming workload.

    2. Re:The whole point of "prime day" by Anonymous Coward · · Score: 5, Interesting

      Recent Amazon employee here, posting AC for obvious reasons,

      The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure.

      No, it's a straightforward copy of Singles Day. Singles Day is the biggest shopping day in the world, so Amazon figured they could invent their own shopping holiday and people would go for it.

      They need to be able to test AWS with massive demand on unpredictable pages

      Almost none of Amazon retail runs on AWS. There are islands here and there, but for the most part they're still unrelated after all these years.

      The big difference this year is that something didn't work right

      Yeah, here's what idn't work right: in a cost-cutting effort, upper management imposed a huge paperwork burden in order to scale up your fleet for prime day. Some teams clearly decided to take risks with a smaller fleet instead of jumping through flaming hoops to justify the exact number of servers they'd need to scale up to.

      How'd that work out for you, Amazon?