Amazon Outage Shows Limits of Failover 'Zones'
jbrodkin writes "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime. 'By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,' Amazon says in pitching its Elastic Compute Cloud service. But the availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of the availability zones, and put a spotlight on Amazon's failure to provide load balancing between the east and west coasts."
For a little extra money, you can get a seat in my biplane, with the extra wings.
Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Not ready for the desktop ;-)
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?
or use a completely different company for redundancy. I think that is the lesson here.
I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
Please do not read this sig. Thank you.
or just use 4.2.2.1 and 4.2.2.2.. or 8.8.8.8 and 8.8.4.4.
...and get slow performance on anything delivered via Akamai or similar services which try to use regional data centers.
OpenDNS and Google DNS are hacks that work increasingly badly.
E pluribus unum
Outsource IT and you outsource responsiblity as well. If your own department fucks up, the top brass will come looking for you. However, If you outsource and the service provider messes up, you can shift the blame to them especially in case of big disasters like these. As long as you can show that you've managed the SLA's well and that it's them who didn't keep to their promises, you're good. More likely you'll find that those SLA's were crap to begin with, which is also fine, because it's likely your boss and his boss signed off on the deal as well. Pass the buck...
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
Okay, I had to log in simply to comment on the stupidity of this statement. Aside from now being in violation of their own ToS (probably, at least in transgression of up-time guarantees), they're undoubtedly fiscally liable for refunding payment for the period of time in which services were unavailable or degraded. Additionally, this dramatically hurts their brand name - I know if I ever have to host anything on 'the cloud' (I can't believe I said it), this incident will be on my mind when the time comes for me to choose a provider. And before I stop beating this dead horse - think about what kind of liability Amazon would have, fiscally, for intentionally dropping services for revenue producing sites. One would imagine that Amazon would be fiscally liable for revenue losses during that downtime if this outage was intentional. That's no small amount of coin.
Not that you're wrong, but that's not the fault of the DNS servers, Akamai should be using geolocation by IP, not by the location of DNS servers.
Infact, I'm not sure how they could be doing geolocation by the client's DNS servers... are you sure about that?
Actually, when you're on TWC you might get BETTER performance with OpenDNS than with their own DNS. When using the TWC DNS I can't get a 1080p without 10m of loading time or even a non-stuttering 720p stream from YouTube or Netflix. With OpenDNS or Google DNS I get much better performance. Also, if you're an AT&T Business customer, OpenDNS works much better with DNS-based RBL's like Spamhaus which AT&T blocks.
Custom electronics and digital signage for your business: www.evcircuits.com
Cloud Computing generally implies redundancy and non-locality 24/7.
No. It generally implies that you can hire resources (cpu, disk) on short notice and for short amounts of time without costing the earth. You can build high-availability systems on top of that, but HA is not trivial to set up and typically requires significant investment at many levels (hardware, system, application) to attain. Pretend that you can get away with less if you want; I don't care.
"Little does he know, but there is no 'I' in 'Idiot'!"
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.
Amazon and Microsoft have to distinctly different views of "cloud computing."
When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.
Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.
Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.
Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.
Kriston
Typically your computer asks your firewall/router for a DNS lookup. It relays that to your ISP's DNS server. Your ISP looks up the DNS server responsible for the domain and contacts that server and sends your original request. That request doesn't include your IP however, so Akamai's DNS servers are returning regional specific servers based on your ISP's DNS server IP/geo-location. That's usually perfectly acceptable, since presumably your ISP's DNS server would be located on a good route with a low ping.
So if you replace your ISP's DNS server with those of OpenDNS, google or whatever else, it is that server which determines your location when Akamai's DNS servers decide which IPs to give you.
You should be able to replace your DNS server with your own locally hosted one as well. ie, you contact the root-servers, hunt down the responsible server, then contact it directly for the IP. I'm not sure what the implications of that is though. The intent of the typical setup is that the ISP DNS servers can cache things and reduce the load on the central root servers.