Amazon Outage Shows Limits of Failover 'Zones'

Re:For a little extra money... by techsoldaten · 2011-04-21 08:06 · Score: 2

For a little extra money, you can get a seat in my biplane, with the extra wings.

Yay by recoiledsnake · 2011-04-21 08:07 · Score: 1

Yay, cloud. http://www.youtube.com/watch?v=Lel3swo4RMc

--
This space for rent.

Re:Yay by calderra · 2011-04-21 08:14 · Score: 1

So setting up a server to remote desktop into your home computer is cloud computing? I have a sinking feeling that "cloud computing" is a lot like web2.0, aka "broadband Geocities".
Re:Yay by DigiShaman · 2011-04-21 08:59 · Score: 1

Cloud Computing generally implies redundancy and non-locality 24/7. Computer hardware that makes up the cloud would normally be provisioned to acts as a resource and not a point of failure for the entire infrastructure. The idea with Cloud Computing is that the Cloud is an organ while the hardware acts as cells. A few could die off and/or be replaced without any disruption to the user.
Unfortunately, everyone has their own idea and implementation to creating Cloud based content and services. So we end up with a lot of bullshit marketing and thus rendering the entire concept to nothing better than a buzzword. Do you feel like trusting it now?

--
Life is not for the lazy.
Re:Yay by schnikies79 · 2011-04-21 09:06 · Score: 1

No need to have a sinking feeling, it's always been that way. The "Cloud" is a buzzword, nothing more, nothing less.

--
Gone!
Re:Yay by marcello_dl · 2011-04-21 09:20 · Score: 1

The incident might be eye opening for some people but the cloud cannot theoretically work because it's not a paradigm. Grid computing is a paradigm. Cloud computing is, as you said, marketspeak describing how providers organize their resources internally. Well that's irrelevant because the provider is the single point of failure. Piss off Amazon for whatever reason, your data becomes unavailable no matter how cloudy it was. It's more "cloudy" to simply replicate data locally and on two different providers.

--
---- MISSING MISCELLANEOUS DATA SEGMENT --- [sigdash] trolololol
Re:Yay by Fulcrum+of+Evil · 2011-04-21 09:28 · Score: 1

the cloud cannot theoretically work because it's not a paradigm.
What the fuck is that supposed to mean? It's got words in it, but is entirely vacuous.

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
Re:Yay by dkf · 2011-04-21 09:41 · Score: 2

Cloud Computing generally implies redundancy and non-locality 24/7.
No. It generally implies that you can hire resources (cpu, disk) on short notice and for short amounts of time without costing the earth. You can build high-availability systems on top of that, but HA is not trivial to set up and typically requires significant investment at many levels (hardware, system, application) to attain. Pretend that you can get away with less if you want; I don't care.

--
"Little does he know, but there is no 'I' in 'Idiot'!"
Re:Yay by DigiShaman · 2011-04-21 13:49 · Score: 1

He's wrong about Cloud not being able to work. But I think I understand his POV. If I'm right, he's basically saying what I've stated. That is to say, Cloud computing is a business solution based idea with the word coined for a marketing purpose. However, Cloud computing is not required to use any specific paradigm to achieve that goal. Grid computing is such a paradigm, and I believe it to be the proper one to use for Cloud computing.

--
Life is not for the lazy.
Re:Yay by sorak · 2011-04-22 05:26 · Score: 1

Sorry if I'm dumbing it down too much, but are you arguing that Grid computing is a specific implementation to a technical problem, while cloud computing is a marketing solution to a business problem?
Re:Yay by Fulcrum+of+Evil · 2011-04-22 05:34 · Score: 1

I'd say they are all marketing terms, sort of like MS and their DNS from days of yore. That said, hosted scalable VPS instances are damn useful, provided you actually understand them

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
Re:Yay by DigiShaman · 2011-04-22 09:28 · Score: 1

Yes.

--
Life is not for the lazy.

Let us learn from Xzibit by Anonymous Coward · 2011-04-21 08:08 · Score: 4, Funny

Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.

Re:Let us learn from Xzibit by Lehk228 · 2011-04-21 08:53 · Score: 1

Yo dawg we heard you liked clouds

--
Snowden and Manning are heroes.
Re:Let us learn from Xzibit by Anonymous Coward · 2011-04-21 17:02 · Score: 1

That's funny. But this whole thing is funny. What the blazes is the cloud for, if it fails? Is it not a cloud but a lead balloon? Sure, I understand no system is perfect, but to steal a line from Seinfeld (you're a car rental agency - you're supposed to *hold* the reservation): Amazon WS - you are a cloud! You are supposed to have 99.99% uptime! That's *all* you are supposed to do! Especially when mainframes have 99.9999% uptime, I believe.
Even distributed systems - which your average web site has had available for at least ten years, and even that wasn't new technology. What is it, we move to the "cloud" and forget all the stuff we learned before? Sometimes watching these chaps, like twitter, facebook and google, reinvent and rewrite and relearn basic, standard, industry practices, you've got to really be puzzled. Let's put all our logic in the view layer, etc.
Re:Let us learn from Xzibit by aled · 2011-04-22 11:45 · Score: 1

Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Wrong! You need to put the cloud in a 'virtual' cloud!

--

"I think this line is mostly filler"

Cloud computing by stopacop · 2011-04-21 08:08 · Score: 5, Funny

Not ready for the desktop ;-)

--
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?

Re:Cloud computing by stopacop · 2011-04-21 08:32 · Score: 1

I've been using Debian since 2.0.36 kernel - definitely like Linux.

--
http://www.stopacop.so -- You have rights. How about standing up for them before they go away?
Re:Cloud computing by blair1q · 2011-04-21 09:48 · Score: 1

What do you mean?
I've had a cloud in my desktop machine for years.
The path to it is "/"

sounds like TWCs DNS servers by dltaylor · 2011-04-21 08:13 · Score: 1

I lose access to TWC's DNS servers regularly (yes, I will be setting up my own, when it becomes annoying enough). Although you can do a quick-and-dirty load-balancing by setting them up as follows, there's no redundancy for the customers when there's a link failure.

search socal.rr.com
nameserver 209.18.47.61
nameserver 209.18.47.62

Re:sounds like TWCs DNS servers by michaelhood · 2011-04-21 08:43 · Score: 2

or just use 4.2.2.1 and 4.2.2.2.. or 8.8.8.8 and 8.8.4.4.
Re:sounds like TWCs DNS servers by DaftDev · 2011-04-21 08:47 · Score: 1

Or OpenDNS: 208.67.222.222 208.67.220.220
Re:sounds like TWCs DNS servers by samkass · 2011-04-21 08:57 · Score: 2, Informative

...and get slow performance on anything delivered via Akamai or similar services which try to use regional data centers.
OpenDNS and Google DNS are hacks that work increasingly badly.

--
E pluribus unum
Re:sounds like TWCs DNS servers by trapnest · 2011-04-21 09:05 · Score: 3, Interesting

Not that you're wrong, but that's not the fault of the DNS servers, Akamai should be using geolocation by IP, not by the location of DNS servers.
Infact, I'm not sure how they could be doing geolocation by the client's DNS servers... are you sure about that?
Re:sounds like TWCs DNS servers by guruevi · 2011-04-21 09:11 · Score: 2

Actually, when you're on TWC you might get BETTER performance with OpenDNS than with their own DNS. When using the TWC DNS I can't get a 1080p without 10m of loading time or even a non-stuttering 720p stream from YouTube or Netflix. With OpenDNS or Google DNS I get much better performance. Also, if you're an AT&T Business customer, OpenDNS works much better with DNS-based RBL's like Spamhaus which AT&T blocks.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:sounds like TWCs DNS servers by Anonymous Coward · 2011-04-21 11:39 · Score: 1

No. DNS-based geo-location caching schemes are the culprit. It works off a bad assumption that makes using an alternative DNS server a pain. I Don't like my ISP's DNS servers. They hijack domain typos as a revenue stream, so I consider them hostile and ignore them when I can.
Using google or opendns, however, will cause havoc for a couple of surprisingly common things I've experienced problems with:
Akamai
Itunes
Hotmail
Rackspace hosted exchange service
Netflix
Youtube
Fortunately you can configure your network's DNS server to forward requests to your ISP's DNS servers (nstead of your 3rd party DNS service) for specific domains (like *.apple.com *.netflix.com). This fixed the above issues where I work.
Re:sounds like TWCs DNS servers by Anonymous Coward · 2011-04-21 12:38 · Score: 1

I'm personally completely sure. I run a recursive DNS server at work for DNS lookups, and I get very different answers for www.akamai.com when I manually query 8.8.8.8 vs our own recursive DNS server.
It doesn't help that a RTT to the IP returned by 8.8.8.8 is over 200 ms, but the latter is around 10 ms. (I'm in New Zealand, 200 ms RTTs to popular, US based, websites is very normal. Heck, I have 228 ms RTT pinging slashdot.org right now on my home DSL. Clearly, using our own recursive DNS, I'm hitting an in-country node of the akamai network.)
Of course, ISP provided recursive DNS servers are usually fairly unreliable - hence, why we run our own. (When were were previously connected to a large, multinational provider (that has basically since pulled out of the country), the resolving DNS server they gave had a considerably larger RTT time than the authoritative DNS servers for .nz domains themselves!)
Re:sounds like TWCs DNS servers by The+Bean · 2011-04-21 17:54 · Score: 2

Typically your computer asks your firewall/router for a DNS lookup. It relays that to your ISP's DNS server. Your ISP looks up the DNS server responsible for the domain and contacts that server and sends your original request. That request doesn't include your IP however, so Akamai's DNS servers are returning regional specific servers based on your ISP's DNS server IP/geo-location. That's usually perfectly acceptable, since presumably your ISP's DNS server would be located on a good route with a low ping.
So if you replace your ISP's DNS server with those of OpenDNS, google or whatever else, it is that server which determines your location when Akamai's DNS servers decide which IPs to give you.
You should be able to replace your DNS server with your own locally hosted one as well. ie, you contact the root-servers, hunt down the responsible server, then contact it directly for the IP. I'm not sure what the implications of that is though. The intent of the typical setup is that the ISP DNS servers can cache things and reduce the load on the central root servers.
Re:sounds like TWCs DNS servers by raju1kabir · 2011-04-22 00:06 · Score: 1

I don't believe what you say is true, at least not anymore.
I'm in Malaysia, a bandwidth-constrained country 200ms from the USA. Using the wrong CDN node makes a huge difference.
When I use 8.8.8.8 to find www.apple.com, I get e3191.c.akamaiedge.net, which is 17ms from my house. That's as good as it's going to get.
Perhaps Google has started using source IPs for its DNS queries that match the client's location?

--
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
Re:sounds like TWCs DNS servers by raju1kabir · 2011-04-22 00:08 · Score: 1

Sorry, I should have included the IP, since that's the location-sensitive part. I get e3191.c.akamaiedge.net as 118.215.101.15.

--
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS

have your own servers by Dan667 · 2011-04-21 08:13 · Score: 4, Insightful

or use a completely different company for redundancy. I think that is the lesson here.

Re:have your own servers by rudy_wayne · 2011-04-21 08:35 · Score: 4, Insightful

This incident illustrates once again why you need to put your stuff on your own servers and not someone else's. All computer systems will fail occasionally. There's no such thing as 100% uptime. However, when your own servers fail you can get your own people working on it right away and it's their number one priority. When your stuff is on someone else's servers, you're at their mercy. It will get fixed when they get around to it, and, they have more customers than just you, so you might not be first on the priority list. Or second. Or third. Or tenth.
Re:have your own servers by gad_zuki! · 2011-04-21 08:47 · Score: 4, Informative

So wait. The cloud sales pitch is "no more servers-save money-cut IT staff" but now its:
1. Virtualized servers in zone 1
2. Virtualized servers in zone 2
3. Virtualized servers from a different company altogether.
So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.
Re:have your own servers by Fulcrum+of+Evil · 2011-04-21 08:49 · Score: 1

One? Lose a raid stack and you're toast. It's always been at least N+1 redundancy for your tier one crap. The cloud stuff is there so you can scale up quickly. Shouldn't be base load or anything.

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
Re:have your own servers by DdJ · 2011-04-21 08:50 · Score: 1

This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.
Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.
Re:have your own servers by fermion · 2011-04-21 08:51 · Score: 1

If one can afford that kind of redundancy, then sure. Two independent lines coming in from two independent providers that individually will adequately handle all traffic for an extended period of time. Independent arrays of pc computers hooked to independent load balancers that will not fall over if something happens to one line or a large numer of computers. One could also have big iron with a 6 nine reliability hooked to redundant lines. In any case backup power to keep all the equipment up for a long period of time is critical. I knew companies that did these kind of things back in the day. It was expensive and had to fund.
If this is one day out of the year that EC2 is offline, then that is probably better reliability than a home spun server. It is better reliability than I have ever gotten with any of the shared hosting companies I have dealt with. For affordable, or ad sponsored, high profile internet services 3 nines is probably all we are going to get. it is probably good enough. The services that were down were not critical. Real quality of life was not meaningfully effects for significantly large group of people. What this means to the common person is that one should have a redundant service. Use Gowalla and Foursquare. You may not be able to get to a dashboard, but maybe can get to the services. It was not like google was down.

--
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Re:have your own servers by FishOuttaWater · 2011-04-21 08:57 · Score: 1

...or, have your stuff on the same servers with their Most Important Customer.
xD
Re:have your own servers by Fulcrum+of+Evil · 2011-04-21 09:33 · Score: 1

If one can afford that kind of redundancy, then sure. Two independent lines coming in from two independent providers that individually will adequately handle all traffic for an extended period of time.

Why would you do that? It's enough to do things like run two DCs that can each handle 60% load or three that each handle 40% load. Not that much more expensive, and downtime turns into "the site is slow". There are architectural concerns, especially with data replication, but this is definitely doable, and it doesn't cost a mint.

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
Re:have your own servers by Artifex · 2011-04-21 09:35 · Score: 1

So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.
I hope that one good server is in a disparate geographical location from its hot backup, using a separate transit provider, each server has redundant power supplies, and your talent has a bus factor of (#servers)+1 or more. You're gonna need backup for any load balancing as well, and whether that should be in yet another location is, well, something to consider.
Cloud services should give you the redundancy you need, as well as being easily scalable. Why are you trying to say the whole concept is bad just because Amazon's implementation is flawed?

--
Get off my launchpad!
Re:have your own servers by vrmlguy · 2011-04-21 09:40 · Score: 2

This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.
Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.
It's probably worth noting that EMC makes a cloud storage product called Atmos with an API essentially identical to Amazon's S3 service. The main difference is that the HTTP headers start with x-emc instead of x-amz, so a properly written application running on non-Amazon servers could switch fairly easily between the two for load balancing or redundancy.

--
Nothing for 6-digit uids?
Re:have your own servers by dkf · 2011-04-21 09:48 · Score: 1

This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.
Hosting everything yourself? Can we sell you a contract for us to build you a datacenter? Then there's the ongoing costs of actually operating it.
Or were you thinking that a scavenged rack in a old closet previously only used by the janitor was a substitute?

--
"Little does he know, but there is no 'I' in 'Idiot'!"
Re:have your own servers by hey! · 2011-04-21 10:02 · Score: 2

Nah. It shows that when you buy a product or service you need to understand what you are paying for, not extrapolate from a buzzword like "cloud".
You can't make a blanket statement one way or another about using something like EC2 without considering the user's needs and capabilities. There may be users who'd find the recent outage intolerable ;they probably shouldn't be using EC2. But if they have good reasons to consider EC2 chances are they are goig to spend more money.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:have your own servers by dhasenan · 2011-04-21 11:36 · Score: 1

I would expect Amazon's marketing to indicate that these units within a region are a way to get fast communication between them without wholly losing redundancy. As such, it's a middle-tier option, not best at anything (you'd have the machines in the same data center if they really needed the bandwidth, and in separate regions if you really needed the redundancy). If I'm wrong about that, then the marketing people who handled that should be dismissed.
Re:have your own servers by asdfghjklqwertyuiop · 2011-04-21 13:25 · Score: 1

just because Amazon's implementation is flawed?
Which provider's implementation is not flawed?
Re:have your own servers by LordLimecat · 2011-04-21 13:40 · Score: 2

Pop quiz:
Youre a small company that does software development. You need servers to do deployment testing, basically just apache and the customized package. Uptime is a must, and your budget is limited.
Do you...
A) Spend tens of thousands on servers, plus backup power, plus racks, plus redundant switches, plus dual WAN links, plus a backup solution (for 10 servers, so far youre looking at ~$35k, plus a thousand a month on WAN links)
or
B) Trust that Amazon will have FAR better uptime than you could EVER dream of architecting on a budget, with far greater convenience, and a lower price tag to boot (the "cloud" is generally billed on CPU usage and bandwidth, which will be low for testing)?
Every time Google or Amazon or Rackspace suffers an outage, people start hollering that the cloud is a menace, a curse, a sham, or whatever. But if you look at the length of, for example, Googles outages over 8 years, their record is head and shoulders above anything that slashdot armchair engineers could throw together, especially given the load they carry.
Unless I missed a news story, this will be Amazon cloud's first outage, and the beauty here is that none of the "rebuild" or "restore from backup" burden will be on their customers. They have to pay technicians to come out and replace hardware; they have to provide the hardware. The downside, of course, is that their services are unavailable; but of course you would be facing that if your own setup failed, and you would be footing the bill to boot.
The real lesson here, I suppose, is that if you really really really need 100% uptime, you should be prepared to fire up a hot- or cold- standby system of your own, or that you should get a rather large budget approved to build a real redundant system-- but not that you can out-architect Amazon on anything less than a large budget.
NB-- I say this as a technician typically dealing with networks up to 100 users and up to 30 servers. If you have multi-million dollar budgets, certainly go ahead and build out that server room.
Re:have your own servers by Anonymous Coward · 2011-04-21 14:30 · Score: 1

Really, it depends on the financial hit your organization will take by downtime/lost productivity/lost business/lost confidence in the ability of your organization to be able to deliver your product. If being down for 12-15 hours or more will cost you more than $35k then yes it makes sense for you to roll your own solution or to use traditional dedicated hosting providers in a H/A configuration. Every organization needs to perform their own risk analysis. If downtime that is out of your control is acceptable then a cloud provider is for you. If it isn't then keep it in house.
What really gets me is the marketing that goes into the "cloud"/SaaS/PaaS solutions. Sure they can save you a significant amount of money up front. However, to achieve the best uptime I have always subscribed to the KISS philosophy. Moving something to a public cloud solution is inherently adding a significant layer of complexity to delivering your application (which should be your primary objective, you can't make money if people can't use your application/service). Even Google has downtime on their services from time to time (although I don't know of a time when search has been unavailable).
For what its worth the services that I host in-house rarely/if ever have unscheduled downtime. In the few instances of unscheduled downtime we were able to recover in minutes not hours because we have complete control over the environment. The one service that we have hosted has been down over 30 hours this year already (obviously not my choice, political decision by management).
Re:have your own servers by outsider007 · 2011-04-21 14:52 · Score: 2

No. If a zone goes down in CA, I can have a new server up in Virginia within minutes. I would rather be on ec2 when I go down. I guarantee I will be back up faster than you.

--
If you mod me down the terrorists will have won
Re:have your own servers by The+Bean · 2011-04-21 18:03 · Score: 2

Reddit's downtime has been a bit of a running joke for a while now, which most (all?) of it being blamed on Amazon.
The way they implemented things is one of the big issues. For example, things like setting up RAID volumes across multiple EBS volumes. They just magnified their exposure to any issues in the cloud. Any one machine goes down the system gets hosed and needs recovery. They also are constrained to a single availability zone in order to get the performance they need from their setup. (This is not intended to be a factual statement. ie, I didn't confirm the details, but I believe it captures the essence of the issue.)
To get the most from the "cloud" you need to build your infrastructure accordingly. You can't take old systems and throw them in the cloud and expect it to scale. Neither can you take all the old ideas, new tools will require new techniques which the industry will learn as things mature.
Re:have your own servers by craigbeat · 2011-04-21 18:19 · Score: 1

The company I work for hosts on another very large company (that had a lot of downtime for another reason a few years back), on dedicated servers. Believe me when I say we have as many problems with them. So far, there have been no problems for us using Amazon. I think it depends on your needs. Multiple redundancy is probably a better solution, but nothing is perfect yet.
Re:have your own servers by LordLimecat · 2011-04-22 02:34 · Score: 1

Basically what youre saying is you cant just throw the "cloud" around like its a magical fix-all; and thats true. But every time one of these "big company goes down" stories arises, people seem to take that as proof that the cloud is not useful for anything, and I would challenge that assertion. There are a number of times where you need to rapidly expand, or where you need good uptime and scalability but dont havea big budget; and for that, the cloud really shines.

Re:For a little extra money... by lgw · 2011-04-21 08:23 · Score: 1

Amazon: where failover meets overfail.

It has to be embarassing that a single incident broght down multiple "availability zones" (at least for EBS, maybe other parts of EC2), as that's just what they were supposed to be safe from. Hmm, "overfail", I like it.

--
Socialism: a lie told by totalitarians and believed by fools.

philosophical POV by Anonymous Coward · 2011-04-21 08:29 · Score: 2, Insightful

I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.

Re:For a little extra money... by HikingStick · 2011-04-21 08:33 · Score: 1

What, you mean a modern oil tanker?

--
I use irony whenever I can, but my shirts are still wrinkled...

And thus the gullible managers who ignored IT... by gestalt_n_pepper · 2011-04-21 08:41 · Score: 2

and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.

Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.

Cheers!

--
Please do not read this sig. Thank you.

Turning lemons into lemonade or...... by i_want_you_to_throw_ · 2011-04-21 08:44 · Score: 1, Funny

maybe the failure was on purpose to promote another revenue stream. Hmmmmmmmmm......................

Re:Turning lemons into lemonade or...... by elohel · 2011-04-21 09:00 · Score: 3, Insightful

Okay, I had to log in simply to comment on the stupidity of this statement. Aside from now being in violation of their own ToS (probably, at least in transgression of up-time guarantees), they're undoubtedly fiscally liable for refunding payment for the period of time in which services were unavailable or degraded. Additionally, this dramatically hurts their brand name - I know if I ever have to host anything on 'the cloud' (I can't believe I said it), this incident will be on my mind when the time comes for me to choose a provider. And before I stop beating this dead horse - think about what kind of liability Amazon would have, fiscally, for intentionally dropping services for revenue producing sites. One would imagine that Amazon would be fiscally liable for revenue losses during that downtime if this outage was intentional. That's no small amount of coin.
Re:Turning lemons into lemonade or...... by klui · 2011-04-21 17:44 · Score: 1

Reddit has been down for approx 24 hours--it's been on RO mode for most of the day. Pretty bad PR for Amazon.

What if Amazon was down instead for Reddit, etc.? by Anonymous Coward · 2011-04-21 08:58 · Score: 1

Do you think Amazon would allow its own sales and services to be impacted for 12 hours (and running) under any circumstances short of the recent disaster in Japan? EC2 customers, on the other hand, appear to be second-class citizens.

Gullible manager doesn't care by JaredOfEuropa · 2011-04-21 08:59 · Score: 2

Outsource IT and you outsource responsiblity as well. If your own department fucks up, the top brass will come looking for you. However, If you outsource and the service provider messes up, you can shift the blame to them especially in case of big disasters like these. As long as you can show that you've managed the SLA's well and that it's them who didn't keep to their promises, you're good. More likely you'll find that those SLA's were crap to begin with, which is also fine, because it's likely your boss and his boss signed off on the deal as well. Pass the buck...

--
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...

Availibility zones must be done PERFECTLY by sirwired · 2011-04-21 09:05 · Score: 1

"Availibility Zones", "Failure Domains", etc. must be done with absolute perfection if you do them at all. If your gargantuan application has some single tiny side-feature that is not replicated across domains, your whole app is going down.

True Story: I was doing some consulting work for a large bank after they had a bunch of problems. Their main website had all the super-available trimmings: Oracle RAC, mutli-site server clustering, storage mirroring, all the fancy, expensive, highly-available crap you could ask for. This is all well and good except... some dinky stylesheet (or something like that) for the bank's homepage resided on some dinky 1U non-clustered fileserver. When it went down, the pages simply would not display. Whoops! All that grand effort was for nought because there was one "leakage" that killed the whole app.

Re:Availibility zones must be done PERFECTLY by sockonafish · 2011-04-21 09:19 · Score: 1

You could code your application to be tolerant of those kinds of outages. The services backing feature X aren't available? Then don't render the controls for feature X on the page.
Re:Availibility zones must be done PERFECTLY by Fulcrum+of+Evil · 2011-04-21 09:34 · Score: 1

you could host the stylesheet somewhere sensible.

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"

They're called "FAILOVER" zones for a reason... by numbsafari · 2011-04-21 09:26 · Score: 1

You're supposed to FAILOVER between them, not load balance between them.

You can't hold amazon accountable for your own stupidity.

Beyond that, you have to ask yourself the question: how many outages would you have had with your own facility in the past year compared to this outage? Did you apply the same approach to your use of EC2 as you would to your own facility?

Re:They're called "FAILOVER" zones for a reason... by Slashdot+Parent · 2011-04-22 04:09 · Score: 1

You can't hold amazon accountable for your own stupidity.
I'm pretty sure you don't really understand what happened.
First, they're called Availability Zones. Not to be pedantic, but I just want you to be able to have the correct terminology if you want to read up on this.
Secondly, a failure in one AZ took out an entire Region. This is NOT supposed to happen. Each AZ is supposed to be considered as a separate datacenter in your application (separate power source, separate facility, separate uplink, etc.) AZs are supposed to be isolated from failures in other AZs.
Like you, I have little sympathy for AWS customers who put all of their eggs in one VM. However, lots of AWS customers who did the Right Thing(TM) got hosed by this. You can't say that they suffered because of their own stupidity.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Reddit's down, guess I'll check out slashdot. by finkployd · 2011-04-21 09:41 · Score: 1, Funny

It's been like 7 years, how's everyone doing? :)

Re:Reddit's down, guess I'll check out slashdot. by Dominic_Mazzoni · 2011-04-21 23:34 · Score: 1

I'm back here because Reddit is down too.
Comment system is not bad.
Stories are good. ...but there just aren't very many of them!

Is this related to the DDoS of Change.Org? by blair1q · 2011-04-21 09:46 · Score: 1

Change.Org says that for the past several days the Chinese have been DDoSing it over a petition they are posting to gather support for Ai WeiWei.

http://blog.change.org/2011/04/chinese-hackers-attack-change-org-platform-in-reaction-to-ai-weiwei-campaign/

But if you go to the Change.Org site to sign the petition, you get a message saying that something is wrong with their servers, which are at Amazon.

http://www.change.org/petitions/call-for-the-release-of-ai-weiwei

http://status.aws.amazon.com/

http://www.computerworld.com/s/article/9216064/Amazon_gets_black_eye_from_cloud_outage

Could Amazon's outage be the result of Chinese hackers?

Re:And thus the gullible managers who ignored IT.. by Tackhead · 2011-04-21 09:48 · Score: 1

And thus the gullible managers who ignored IT... and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.

C'mon. All managers love cloud!

What rolls down stairs, fails over in pairs,
Leaks data when it's allowed?
A stupidity tax, it replaces your racks,
It's cloud, cloud, cloud!

It's cloud! It's cloud! It's new, it's shiny, it's cheap!
It's cloud! It's cloud! It's down, and now you'll weep.

Everything's in the cloud! You're gonna love it, cloud!
Outsource it to the cloud! Everyone needs a cloud!

Cloud! It goes blammo!

Re:For a little extra money... by tnk1 · 2011-04-21 10:08 · Score: 1

Fools, I have this: http://en.wikipedia.org/wiki/Caproni_Ca.4

Bow down before the three wings and two engines. Seats are going fast, order a spot today.

Re:And thus the gullible managers who ignored IT.. by xtracto · 2011-04-21 10:17 · Score: 1

It is a sad joke. Even for sites like Reddit whose administrators are supposed to know better, the Amazon shit hit. And the terrible thing is that it is not the first time that Amazon's service has broken, this has happened quite a lot in the last months, and people still *pay* for the service. Crazy.

--
Ubuntu is an African word meaning 'I can't configure Debian'

No need to speculate by jc2brown · 2011-04-21 10:18 · Score: 2

Since apparently no one's actually looked into the issue beyond "ZOMG the cloud is down," here's some info from Amazon:

8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.

I've been saying by clenhart · 2011-04-21 10:30 · Score: 1

Downtime comes from people. The more people involved, the more downtime you'll have.

Change related? by Biggerveggies · 2011-04-21 11:52 · Score: 1

I don't necessarily hate the marketing concept of 'The Cloud', but I am fascinated by the business decisions and risk acceptance that organisations are willing to take. ie- the typical: "Demanding high availability and hot failover, instantaneous incident resolution, and 'we are your primary customer'... but also a low cost." I think that Amazon and their competitors *may* get there with their offerings, but until there is a bit more maturity, I expect to see more incidents like this.

My wild guess is that a change triggered this, which of course leads to why has the backout plan failed (and who signed off on the risk)? I can't imagine that this is not change related - otherwise there is a serious architectural design flaw here somewhere.

Amazon and Microsoft by kriston · 2011-04-21 15:07 · Score: 2

Amazon and Microsoft have to distinctly different views of "cloud computing."

When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.

Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.

Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.

Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.

--

Kriston

*Still A Happy, Paying EC2 Customer* by CyborgWarrior · 2011-04-21 15:58 · Score: 1

As what I would consider a medium-weight AWS user (our account is about 4 grand a month) I am still quite happy with AWS. We built our system across multiple availability zones, all in us-east and had zero downtime today as a result. We had a couple of issues where we tried to scale up to meet load levels and couldn't spin up anything in us-east-1a (or if we could, we couldn't attach it successfully to a load balance because of internal connectivity issues), but we spun up a new instance in us-east-1b and attached it completely fine and were able to handle the load just fine. The load balancers worked as expected (and hoped for) and the segregation of issues between availability zones was fairly successful.

I think that fixing these issues are just as high an issue with Amazon as they would be with any internal IT infrastructure, so I don't give much credence to the arguments that having your own servers and your own internal IT team would truly solve the problem any more effectively: I think it just gives you more the illusion of control because you can see that you're working on it, as opposed to trusting to the fact that Amazon is working on it.

If there is any AWS lesson to be taken away from this it is that:

1) EBS may not be ready for prime time - most of our servers are instance-store anyway, both for performance reasons and for other reliability problems we have had in the past.

2) You should keep your server templates set up as up-to-date AMIs so you can deploy across any availability zone you want at any time you want. Right now, we have our load balancer attachment configuration all scripted as well, so spinning up new instances to feed a cluster is a single CLI execution with us specifying the availability zones.

Check out http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html for a nice explanation of some of the issues you may come across with EBS and the internals of why.

Overall, I still give Amazon a good rating. This was a major outage and we felt barely a hiccup.

--
If you can't say something nice, make sure you have something heavy to throw.

Re:*Still A Happy, Paying EC2 Customer* by The+Bean · 2011-04-21 18:13 · Score: 1

I'd give you the good rating. You used the service in a sane manner that exploited the strengths of the system and avoided the weaknesses.
I suspect many users of EC2 actually end up with less reliability than they'd get with a server in a closet, as they don't realize the true effort it takes to have an effective solution like you do.

100% availability myth by SpaghettiPattern · 2011-04-21 20:52 · Score: 1

I have studied high availability systems and I have developed some. It's not too difficult to come up with a decent design that should guarantee extremely high availability. However, there always will remain assumptions and external factors which influence your system. The most tedious bits are systems and components that "never" fail (like simple NICs) for which you will not get any attention whatsoever.

100% availability is a myth. I know of a case where IBM's zero downtime operating system (z/OS) went down a whole day, causing huge losses. The shut down was allegedly caused by a cleaner accidentally disabling one power supply. Which essentially shouldn't have mattered... But it did. The fail over switch was prevented by a relatively small amount of disk writes which were not fully committed to the fail over site and hence the over site could not commence.
You should know that IBM charges huge amounts of money, it claim zero downtime and eventually doesn't deliver on that. Quite absurd.

But, IMHO, bashing IBM for not delivering zero downtime isn't fair. IBM's customer at hand should have conducted an own HA analysis but instead probably opted for a few words on "IBM" and "zero downtime". The customer really should have known.

--

I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)

Re:And thus the gullible managers who ignored IT.. by jim_kaiser · 2011-04-21 21:17 · Score: 1

and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.

Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.

Cheers!

As a developer from India, I can tell you India has absolutely nothing to do with the cloud.. thats a US revolution.. we are still years behind..

--
The last person to mod me down is a rotten egg..... there.. that should do it..

Re:Zones in different continents by Neil+Boekend · 2011-04-21 23:21 · Score: 1

In that case the downtime wouldn't be the biggest of my problems.

--
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.

Comment removed by account_deleted · 2011-04-22 00:00 · Score: 1

Comment removed based on user account deletion

for a little extra cash ? by doperative · 2011-04-22 00:03 · Score: 1

"For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime

I would have thought the the entire raison d'etre of moving to the Cloud was to eliminate downtime, else why not rent two boxes in different locations and achieve this near-guarantee uptime without the extra expense not to mention your data totally disappearing when the Cloud goes down ...

centralized cloud computing ?? by doperative · 2011-04-22 00:10 · Score: 1

`Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud."'

I think you hit-the-nail-on-the-head there, a centralized anything is always vulnerable to a this kind of failure. For a business with multiple locations a number of servers sited locally in a peer-to-peer configuration would provide a more reliable service. All they rely on is an end-to-end IP connection. If one site goes then the rest can carry on. I do believe this whole cloud computing concept has been over sold.

They don't call it ... by Rambo+Tribble · 2011-04-22 02:09 · Score: 1

... the "Bleeding Edge" for nothing. New technology always comes with teething problems.

Re:For a little extra money... by Hylandr · 2011-04-22 02:09 · Score: 1

Canadian Host?

http://www.youtube.com/watch?v=LAYMJnO9LBQ

- Dan.

--
~ People that think they are better than anyone else for any reason are the cause of all the strife in the world.

Re:And thus the gullible managers who ignored IT.. by pinkwarhol · 2011-04-22 07:19 · Score: 1

this reminds me of pynchon lyrics... it's perfectly in the spirit.

Slashdot Mirror

Amazon Outage Shows Limits of Failover 'Zones'

85 of 125 comments (clear)