Amazon EC2 May Be Experiencing Growing Pains
1sockchuck writes "Some developers using Amazon EC2 are wondering aloud whether the popularity of the cloud computing service is beginning to affect its performance. Amazon this week denied speculation that it was experiencing capacity problems after a veteran developer reported performance issues and suggested that EC2 might be oversubscribed. Meanwhile, a cloud monitoring service published charts showing increased latency on EC2 in recent weeks. The reports follow an incident over the holidays in which a DDoS on a DNS provider slowed Amazon's retail and cloud operations."
Why not say "Yes, we're way too popular. We're adding capacity as quickly as we can, but people are just lapping up our service!"
This seems like a missed marketing opportunity.
The ______ Agenda
When the news came around for EC2's DDoS around Christmas, I remembered reading how Amazon began offering their services to third parties in the first place. Turns out Amazon has a sudden peak of traffic around shopping holidays and particularly Christmas.
To prepare for that, they have added enough hardware to handle the peak, but that hardware went unused the rest of the year. So they started leasing it to third parties in the form of their web services.
This immediately makes you think, ok, what happens to their ability to handle the third party apps around Christmas, when they need a lot more hardware to handle Amazon.com's traffic itself? And then this DDoS happened, which importantly overloaded not the actual app servers, but the DNS servers pointing to the app servers. So as a result the app servers experiences lower traffic for third party sites than they would have otherwise.
It's making me think, and this is of course just speculation, this may have possibly not be a genuine attack as much as a stunt to lessen the overload of their cloud services they knew they'd experience around Christmas, while having a plausible explanation for the downtime that blames it on a malicious third party.
Reading they do indeed have had (and still have) performance issues supports that speculation.
Just the other day I came across Amazon's marketing materials explaining the benefits of EC2 in which they show a pretty graph of your datacenter capacity vs. demand over time. EC2 is supposed to scale up right along side the demand for services. But Amazon has to use the traditional datacenter model to support EC2; it doesn't have the luxury of its datacenter automatically scaling up with demand. It is inexpensive for us customers to scale up our EC2 services, but relatively expensive for Amazon to decide to add a bunch of new servers or upgrade a bunch of existing servers in their datacenter, or maybe even add a new datacenter.
If I were Amazon and started noticing increased latency, after checking to make sure everything was in fact functioning properly, I would probably wait to see if the spike in usage is just temporary or if it will be sustained enough to warrant an increase in datacenter capacity.
Amazon needs to move their cloud into space. Yes, space! It's the next big frontier beyond clouds, and you heard it here first.
They found the person responsible - a stuffed bear covered in soot that had hovered under the honey tre via a balloon while singing.
Wouldn't want any of my important data stored on a system which has performance issues...
Or having to wait significantly longer than I would storing my data locally!
i priced out a high memory config and it's like $6000 per year or more for 32GB RAM of memory and 8 CPU cores. In a few months Intel will ship server CPU's with 12 logical cores per socket. RAM prices are dirt cheap and at current prices a 36GB RAM HP Proliant DL 380 G6 will run around $13,000 and 72GB of RAM another $2000. and that includes 5 year 4 hour response time support, some of the other extras like advanced ilo, and i forgot what else i added since it's so cheap.
add in the increased bandwidth costs and the supposed cost savings vanish. it's like the ghetto people that lease a lexus or a Benz because they can't afford to buy or they like the lower monthly payments. it's like 2000 all over again. hardware is expensive to ASP's set up shop. hardware prices drop for the power you get and ASP's go out of business.
and i think this is a scam by the hardware companies. i buy an HP server i buy one machine and a few hard drives. to support me Amazon needs to buy a few servers and 5 times the raw space for DR purposes.
Comment removed based on user account deletion
Uh... How else do you think they make money?
Deleted
We use EC2 as the back-end for Netalyzr (our free, applet-based network testing and debugging service), and right now are in the middle of a minor flashcrowd with our big updated release. No recent glitches we've noticed, with long running small instances.
Test your net with Netalyzr
Okay, you spend well over $10000 dollars to get a great machine. Then you need to pay for maintenance. Train or hire someone to maintain it or buy the service from third party... That costs extra. But even ignoring that... What if your company is planning a TV advertisement campaign and expects triple the strain on all public systems for a few months? Buy more hardware? Lease, perhaps? And what if there is some sort of an accident (be it fire, flooding, anything)? You need to suddenly spend over $10000 there again in addition to everything else you need to fix. (Alternatively, the server room needs to be made safe from all such hazards, which might cost extra) And what if there is a sudden large peak in the traffic? Or alternatively, your business swindles and then you have spent unnecessary much for the hardware...
Trade that all to about flat rate service that is easy to scale up or down as needed. Despite all the problems that cloud computing has, I can't say that I don't understand why many executives choose it, despite the risk that they might end up spending more in the long run.
I seen to recall a post on slashdot about Amazon Introduces Bidding For EC2 Compute Time. This announcement took place on 12/14/2009, which coincides with the increase in average ping latency as illustrated in cloudkick's chart. Was Amazon unprepared for the increase in demand created as a result of bidding off of the unused EC2 capacity?
I am sure that people came up with some pretty creative thing to do with low priced EC2 capacity.
Trinity Rescue Kit is a network boot/CD boot linux that reads and writes NTFS etc.
We use it here to image and deimage windows systems, it takes ~10 minutes boot-to-boot to bring up a raw windows system in a known state.
Test your net with Netalyzr
I keep a small reserve instance running 24x7 and the cost is very low. I also have a EBS bootable large instance that I run for a few hours at a time as needed. It has been a while since I used it, but Elastic MapReduce also works well and is fairly inexpensive for what you get.
About half of my customers also use EC2s.
(Note: Amazon gave me a large grant to use EC2 for free for work on my last book, but my comments are my honest opinions.)
Comment removed based on user account deletion
The cloud providers will have growing pains for years to come. However, cloud is a much better choice than the overhead of building and running your own data center.
Google's App Engine is so over-sold it can take 20 seconds for a page to load.
If you're app takes 20 seconds to deliver a page then it's your fault. My app consistently delivers dynamic, multi-hundred kilobyte pages in 1-2 seconds anywhere in the US. See for yourself! www.TwitGrids.com
If you code for App Engine like it's Rails or Django your app will be a dog.
"Liechtenstein is the world's largest producer of sausage casings, potassium storage units, and false teeth."
Amazon's instance types (http://aws.amazon.com/ec2/instance-types/) doesn't seem to indicate the number of cycles/sec you are guaranteed to use per type.
They sell instance types based on the physical hardware specs which is worthless in a cloud architecture.
What they should really be doing is indicating the number of cycles/sec an instance type will be GUARANTEED and then enforce it.
If the customer doesn't use that number of cycles/sec, then fine put the idle cycles up for bidding.
Just my $0.02
Ben
Does the system just feel slow of has it been measured as such? Which resources are being starved? CPU? Disk? Network? Memory? Has anyone done any benchmarking to see what the actuals vs theoretical are? What tools are being used? Collectl provides a pretty good high-level summary.