How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com)
Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
Joke's on you, because most of the Amazon retail doesn't actually run on AWS. It uses its own deployment system, server management, data storage, etc.
AWS wasn’t built for amazon.com use. It was built with excess amazon.com capacity.
That’s a significant difference.