Researcher: Interdependencies Could Lead To Cloud 'Meltdowns'

← Back to Stories (view on slashdot.org)

Researcher: Interdependencies Could Lead To Cloud 'Meltdowns'

Posted by Soulskill on Saturday June 9, 2012 @03:16PM from the we-can-only-hope dept.

alphadogg writes "As the use of cloud computing becomes more and more mainstream, serious operational 'meltdowns' could arise as end-users and vendors mix, match and bundle services for various means, a researcher argues in a new paper set for discussion next week at the USENIX HotCloud '12 conference in Boston. 'As diverse, independently developed cloud services share ever more fluidly and aggressively multiplexed hardware resource pools, unpredictable interactions between load-balancing and other reactive mechanisms could lead to dynamic instabilities or "meltdowns,"' Yale University researcher and assistant computer science professor Bryan Ford wrote in the paper. Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis."

16 of 93 comments (clear)

Min score:

Reason:

Sort:

This is why you cloud your cloud... by houstonbofh · 2012-06-09 15:27 · Score: 4, Insightful

If you have a critical service, have it at more than one host... That way when AWS has a bad hair day, you are still up.

Or, have your entire business totally dependent one someone else. (Sounds kinda scary that way, don't it?)
1. Re:This is why you cloud your cloud... by girlintraining · 2012-06-09 15:43 · Score: 5, Funny
  
  If you have a critical service, have it at more than one host... That way when AWS has a bad hair day, you are still up.
  While we're at it, we should probably backup the internet too. You'd think someone would have done it by now, in case it crashes, but I can't find any record of anyone doing it.
  
  --
  #fuckbeta #iamslashdot #dicemustdie
2. Re:This is why you cloud your cloud... by c0lo · 2012-06-09 15:50 · Score: 4, Funny
  
  You'd think someone would have done it by now, in case it crashes, but I can't find any record of anyone doing it.
  Heh... the real think crashed long ago, you are using now the backup.
  
  --
  Questions raise, answers kill. Raise questions to stay alive.
3. Re:This is why you cloud your cloud... by flonker · 2012-06-09 15:58 · Score: 4, Informative
  
  http://archive.org/
4. Re:This is why you cloud your cloud... by martin-boundary · 2012-06-09 16:42 · Score: 4, Insightful
  
  There's a limited number of cloud hardware providers on the internet, and the rest are middle men. It's useless to diversify yourself on the middle men, they will all be affected when the common underlying hardware provider has an issue. Thus there's a limit to the reliability that can be achieved, irrespective of how much mixing and matching is performed at the "business end".
  Diversification only "works" when the alternatives are provably independent. That's not true in a highly interconnected and interdependent world, which is TFA's point, I believe.
5. Re:This is why you cloud your cloud... by im_thatoneguy · 2012-06-09 16:52 · Score: 4, Informative
  
  That's one of the problems though that the researcher is flagging.
  1) If a company has one instance on AWS and one on Azure and AWS fails... Azure suddenly doubles in load ( and also fails due to everybody piling on unexpectedly).
  the other being:
  2) Everybody uses Azure for SQL and AWS for hosting and Azure goes down... suddenly SQL dies and the AWS hosts all fail with the database down. Or the converse happens and AWS goes down and the SQL is useless without a head.
  The more services you rely on the more likely that on any given day one of them will be down. If you have 99% reliability and 20 services that you depend on (without any redundancy) then your failure rate could be up to 20% since any one of the 1% failures could kill your service.
  It's interesting but it seems like most of the cloud failures have been due to #1 internally so far. One sector fails and in an effort to load balance it starts taking out its peers who then also overload and take out their peers.
XKCD by Shadyman · 2012-06-09 15:28 · Score: 3, Funny

XKCD (jokingly) saw this coming a while ago: http://xkcd.com/908/
The analogy the author uses doesn't work. by stephanruby · 2012-06-09 15:34 · Score: 4, Insightful

The analogy the author uses doesn't work.
A better analogy would be the airline industry. The airline industry likes to over-book airplane seats it may not have because it's always trying to optimize its profit-margin.
The same will happen with cloud-services. Cloud-services will always try to optimize their own profit-margins, at the risk of triggering significant outages.
And I don't see what this has to do with the financial crisis at all.
1. Re:The analogy the author uses doesn't work. by pitchpipe · 2012-06-09 16:02 · Score: 4, Insightful
  
  A better analogy would be the airline industry.
  I think a better analogy is the power grid. System hits a peak, one line goes down, others try to compensate becoming overloaded, another can't handle the load and goes down, and behold: cascading failures.
  
  --
  Look where all this talking got us, baby.
2. Re:The analogy the author uses doesn't work. by TubeSteak · 2012-06-09 16:08 · Score: 3, Informative
  
  And I don't see what this has to do with the financial crisis at all.
  FTFA
  
  New cloud services may arise that essentially "resell, trade, or speculate on complex cocktails or 'derivatives' of more basic cloud resources and services, much like the modern financial and energy trading industries operate," he wrote.
  Each of these various cloud components are often maintained and deployed "by a single company that, for reasons of competition, shares as few details as possible about the internal operation of its services," Ford added.
  As a result, the cloud industry could find itself "yielding speculative bubbles and occasional large-scale failures, due to 'overly leveraged' composite cloud services" with weaknesses that don't become known "until the bubble bursts," Ford wrote.
  
  The metaphor more ore less fits, except for the part that ignores how a lot of what happened during the financial crisis was outright fraud perpatrated by lenders.
  The potential mess with the cloud is not about fraud, just about excessive dependancies.
  
  --
  [Fuck Beta]
  o0t!
3. Re:The analogy the author uses doesn't work. by plover · 2012-06-09 16:39 · Score: 3, Insightful
  
  I think by "financial crisis" he meant "a minor market crash due to autotrading algorithms", and not the real crisis being caused by thieves running trillion dollar banking, mortgage, and insurance scams.
  The point is "if you use similar automated response strategies as a large set of other similar entities, you could all suffer the same fate from a common cause."
  Supposedly a market crash was triggered by autotrading algorithms that all tended to do exactly the same thing in the same situations. So when the price of oil shot up (or whatever the trigger was) then all those algorithms said "sell". As all the sell orders came in, the market average dropped, and the next set of algorithms said "sell moar". So there was a cascade because so many systems had identical responses to the same negative stimulus. Think of those automated trades as being akin to a "failover" IT system: if host X is failing, automatically shift my service load this way.
  So that's the analogy the author is trying to make with respect to systems that depend on automated recovery machinery like load balancers: if response time is too high at hosting vendor X, my automated strategy is to failover to hosting vendor Y. And perhaps 500 large sites all have a similar strategy. Now let's say that vendor X suffers a DDoS attack because they host some site that pissed off Anonymous. So now all these customer load balancers see the traffic slowing down a X, and they simultaneously reroute all app traffic to vendor Y in response. Vendor Y then gets hammered due to the new load, and the load balancers shift the traffic elsewhere. Now two main hosting providers are down while they try to clean up the messes, and the several smaller providers are seeing much bigger customers than usual using them as tertiary providers, and they start straining under the load as well, causing their other clients to automatically shift.
  And if that isn't exactly what plays out next year, might not something similar happen with payment gateways, or edge content delivery systems, or advertising providers?
  It's a cascade of failures due to automated responses that's remarkably similar to the electrical grid overloads that caused the northeast coast blackout in 2003. The author's point is "we don't know precisely what bad thing might happen within this particular ecosystem, but there is significant risk because we've seen complex interdependent systems have similar failures before."
  
  --
  John
Low hanging fruit of a research piece by mcrbids · 2012-06-09 15:39 · Score: 4, Interesting

Efficiency normally comes with economies of scale. As a partner in an outsourced vertical software company, we have hundreds of clients running in our highly tuned hosting cluster, and are able to bring economies of scale to an otherwise ridiculously expensive software niche. Yes, that means that if we have an outage, all of our clients experience an outage as well.
However, we have carefully laid plans for multiple recovery points in a disaster scenario, (Plan B, Plan C, Plan D, etc) and have maintained an uptime significantly better than our clients would typically attain if left to their own devices. We easily manage close to 4 nines of uptime in an industry where the average is realistically around 2 nines. (having "the computer is down" a day or two every year or so is typical)
Although the Internet is a "network of ends" the truth is that not all ends are created equal. Having a high quality, high speed (100 Mb), reliable (99.99%+) Internet feed in my small-ish hometown of around 80,000 people is ridiculously expensive. But in a nearby city (500,000 people 2 hours' drive) we host our servers in a tier 1 colo at 1/10th the cost of running it all ourselves, with dramatically improved reliability and network performance.
Yes, putting all your eggs in one basket means that if that basket fails, you lose all your eggs. But it also makes it easy to buy just one, really nice basket that won't break and lose your eggs.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:impossible idea. by c0lo · 2012-06-09 15:57 · Score: 3, Insightful

we live in an age where information is distributed, even if statistical. (hell I made a fake Facebook account and somehow they found my mom, and she is no where close to me) a meltdown of information can't happen unless there is a world wide melt down of power. we have backups, but also ways of statistically restoring those backups.
Redundancy helps but it is not bullet-proof. A good chunk of it is the "topology" in which this redundancy is engaged in events of failure.(e.g. we had cascading blackouts in the past even if the energy network had enough total power to serve all consumers)
Have a look on cascading failures.

--
Questions raise, answers kill. Raise questions to stay alive.
if they actually do this - they're stupid by Karmashock · 2012-06-09 16:02 · Score: 3, Interesting

systems needs to be compartmentalized or have redundancies built into them.
For example, I have several systems that send automated emails. I've had a problem in the past of given email servers not accepting or sending messages. It's uncommon but it happens and it's not acceptable. These are mission critical systems. They can't fail.
Solution? Redundancy up the wazoo. The way it's set up now so many things would all have to happen at the exact same moment that the only way the system is likely to fail is if we fight world war 3... and lose.
That is how you solve this problem. Don't rely on any one system. Rely on all of them. Once you figure out how to integrate one of them it's typically easier to integrate the rest. The virtues of this approach are manifest. Not just stability but if the services do processing or data retrieval you can cross reference them to find errors in databases or get a more complete data set then exists in any one source.
I mean is google or bing the best search engine? What about both at the same time?

--
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
just like mainframes by Dan667 · 2012-06-09 16:05 · Score: 4, Insightful

I think it is funny that lessons learned years ago with mainframes are being presented as new by just changing the word mainframe to cloud.
Nightmare scenario has already happened by dbIII · 2012-06-09 16:47 · Score: 3, Insightful

It's a leap year, February 28, and all over the world, completely out of the blue (or azure if you prefer) cloud clusters crash as the local clocks swing around to midnight, then stay down all day.
Still, it's three nines of uptime when it's spread out over a few years :)

A highly interdependant system is only as reliable as the QC on the weakest link. Who would have thought that somebody from a company that had a lot of embarrassing press about a leap year stuffup would make such a stupid and obvious mistake four years later? That's the cloud, where even the biggest names still don't care anywhere near as much as you would about your own systems and so don't pay enough attention to detail.