How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com)
Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.
I don't know about anyone else but I couldn't buy anything for a solid 5 hours or more, checking intermittently, because it wouldn't let me checkout. I did see a lot of Amazon dogs though.. like a lot..
The whole site worked for me but I couldn't checkout. It didn't seem like a server capacity issue to me.
Like Amazon Web Services for example...
This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.
Source: I work at Amazon.
Joke's on you, because most of the Amazon retail doesn't actually run on AWS. It uses its own deployment system, server management, data storage, etc.
Is that I wasn't able to buy anything on prime day because I don't have any money. Epic fail. LULZ.
Thanks, Obama.
So there's a private AWS that works even worse than public AWS? Tell me more. And please subscribe me to your newsletter.
One of the first things you need to do when setting up an environment in AWS is to get them to increase your (artificially low) server limits for each instance type you're planning on using. Otherwise, you're going to run into those limits at the worst possible time when you need to rapidly scale your servers.
While I understand why they do this (probably to protect themselves from having someone spin up 1,000 cryptocoin mining instances with a hacked account), it's refreshing to see Amazon get bit by their own annoying provisioning decisions.
AWS wasn’t built for amazon.com use. It was built with excess amazon.com capacity.
That’s a significant difference.
This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.
Source: I work at Amazon.
Why not? The private dog food tastes better?
I tried for hours to order the Amazon Fire 7" (8GB) for the low price of CAD$40, but the page kept changing. Sometimes it would be available, sometimes it would be disabled and only the 16GB was available, sometimes the 8GB option completely disappeared as if it didn't even exist, other times it was available from a third-party non-Amazon seller for nearly twice the price.
It kept doing that every single time the page loaded and I was reloading it roughly once per second.
What's also weird is that once every few minutes, when it was finally available again, the estimated delivery kept going up by about two weeks. In the end I was able to order it (8GB), but I'm guessing it's not even manufactured yet since my delivery date is mid-september.
#DeleteFacebook
But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.
#DeleteFacebook
I was thinking of something like Microsoft Azure or Google Cloud platform. maybe Amazon could hire some of their Cloud Consultants to figure out how to do it.
Prime day is a reaction to Alibaba and Aliexpress.com. They both generate nearly a trillion dollars of revenue across the world with 11/11 day sales. 11/11 day itself is a celebration called 'singles day' in china, where students started celebrating being single around 1993 on university campuses.
amazon day is a pointless branded knockoff Bezos hopes will generate just as much money. Assuming no one finds out about aliexpress and they somehow magically stop competing.
Good people go to bed earlier.
Except that AWS was built for internal use with the view to make it trivial to open up later. It runs on the same system.
Nope. Pure propaganda from the early days of AWS. AWS was an external product from the beginning, with hopes that one day the retail side would be able to use it. Not so much.
What is true is that Amazon will have dedicated servers running AWS purely for their own use, and priority over other AWS users for shared resources.
Nope. All "reserved instances" have highest priority, and there's never really a problem with that capacity that lasted more than a couple minutes. EC2 "on-demand" instances come next, but they'll never terminate one to give it to anyone else. The only servers they'll take away from people are "Spot" instances, and they're up front about that,
Why not? The private dog food tastes better?
No, the non-AWS stuff is pure garbage. But it would take engineering effort to move to AWS, and management would have to fund that effort instead of their own pet projects.
Socialism: a lie told by totalitarians and believed by fools.
But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.
Only if it was discounted on prime day; the only thing I was thinking of buying from Amazon on prime day turned out to be twenty bucks cheaper from walmart. And, of course, only if you actually managed to place your order, given that Amazon failed at reliability.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Amazon got slashdotted.
I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that. But it makes lots of sense to keep separate pools of servers for AWS vs Amazon.com, for both security and reliability.
I still have more fans than freaks. WTF is wrong with you people?
If you don't stress test what you have how do know it will work? They use this 'day' to fix any issues now so they don't have unexpected ones at actual important times of the year.
I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that.
You should doubt more. It's all deep legacy stuff now, most of it the same systems that were innovative 15 years ago. Most of the rest of the world has moved on to implicitly auto-scaling container-based solutions, where dev teams never muck with "servers" in any way. Google (which is hardly leading edge these days) has been containerized for years. Even Azure offers self-scaling fleets with some abstraction away from explicit server types.
And the stuff your remember was never made to work on AWS, so if you're trying to build a non-legacy system on AWS to get its benefits, it's an entirely different stack to learn from scratch.
It's a really shocking failure of senior management IMO. They should have started 5 years ago with an even better AWS-based tech stack for deployments and fleet management, then forced everyone to move over. But that wouldn't have immediate payoff.
Socialism: a lie told by totalitarians and believed by fools.
The Prime Day thing is pretty skeezy - tons of no-name brand items who's prices were inflated for the sale day so they could "slash prices" and offer you the low low discounted price of what it normally sells at - but with the bigger price it never sold out crossed out. It was entirely an exercise in preying on peoples gullibility, who saw these huge "discounts" and made impulse buys thinking this super short special shopping day was saving them money. And of course, you had to buy the prime membership in the first place. You're by no means saving money on prime - you're usually paying more for the same goods - all you're getting is the "free" two day shipping.
Well, good for Walmart then. I can't even imagine the number of sales Amazon lost over those glitches.
#DeleteFacebook
Here's all they need to know!
https://www.youtube.com/watch?...