EC2 Outage Shows How Much the Net Relies On Amazon
An anonymous reader writes "Much has been written about the recent EC2/EBS outage, but Keir Thomas at PC World has a different take: it's shown how much cutting-edge Internet infrastructure relies on Amazon, and we should be grateful. Quoting: 'Amazon is a personification of the spirit of the Internet, which is one of true democracy, access to the means of distribution, and rapid evolution.'"
An article at O'Reilly comes to a similarly positive conclusion from a different angle.
This article seems to be an apology for Amazon.
Basicly it says "We went down, and took down lots of important stuff. That shows just how important we are and that lots of people use us. Thus, our cloud is a good thing."
The logic of that doesn't quite work.
I agree that it's a useful tool, but there are a lot of things that don't make sense to put in the cloud.
I guess what we should learn from this is to put your failover in separate regions, not separate availability zones?
Big companies, that have decided to put crucial operations on Amazon computers are apt to pay up for the equivalent of computing insurance, analysts say. Netflix, the movie rental site, has become a large customer of the Amazon cloud. Most of its Web technology — customer movie queues, search tools and the like — runs in Amazon data centers.
Netflix said it had sailed through the last couple of days unscathed. “That’s because Netflix has taken full advantage of Amazon Web Services’ redundant cloud architecture,” which insures against technical malfunctions in any one location, said Steve Swasey, a Netflix spokesman.
Sounds like it worked for some.
Totally concur with others pointing out Amazon offers redundancy if you choose to use it.
We had webservers, database (master/slave,) and other services split across usa-east and usa-west.
When usa-east started showing problems, we:
*) Took the usa-east webservers out of round robin DNS (ttl 1hr)
*) Verified the slave (in usa-west) was up to date, shut down the master (usa-east,) and converted the slave to master.
*) Updated all webservers to point to the new master.
*) Cranked up new usa-west webservers / updated round robin DNS
I believe Amazon offers mechanisms to do this automatically or we could just always write our own failover scripts, but this is the tradeoff me made. We were willing to trade some service degradation by switching over manually in exchange for avoiding the pitfalls of false-positive detection. Very much an application specific tradeoff, not for everyone, but it worked for what we are doing.
The key was to avoid putting all eggs in the usa-east basket and splitting up across usa-west, even though we incur additional bandwidth fees, ie master/slave replication transfer is full fee between regions.
We were never concerned about cascading failures effecting multiple availability zones in a give region nor did it matter for us - our redundancy requirement was geographical diversity, not partitions within a datacenter. We were thinking natural disaster, but the architecture covered us in this case as well.
The coolest thing to me is just how quickly we were able to shuffle around these resources to avoid a problem area - a couple of hours. There's no way we could have done it so quickly with what we had before - a combination of our own colocated servers and VPS.
Don't forget the one-click patent. True democracy/spirit of the Internet my ass.