Slashdot Mirror


Researcher: Interdependencies Could Lead To Cloud 'Meltdowns'

alphadogg writes "As the use of cloud computing becomes more and more mainstream, serious operational 'meltdowns' could arise as end-users and vendors mix, match and bundle services for various means, a researcher argues in a new paper set for discussion next week at the USENIX HotCloud '12 conference in Boston. 'As diverse, independently developed cloud services share ever more fluidly and aggressively multiplexed hardware resource pools, unpredictable interactions between load-balancing and other reactive mechanisms could lead to dynamic instabilities or "meltdowns,"' Yale University researcher and assistant computer science professor Bryan Ford wrote in the paper. Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis."

93 comments

  1. This is why you cloud your cloud... by houstonbofh · · Score: 4, Insightful

    If you have a critical service, have it at more than one host... That way when AWS has a bad hair day, you are still up.

    Or, have your entire business totally dependent one someone else. (Sounds kinda scary that way, don't it?)

    1. Re:This is why you cloud your cloud... by girlintraining · · Score: 5, Funny

      If you have a critical service, have it at more than one host... That way when AWS has a bad hair day, you are still up.

      While we're at it, we should probably backup the internet too. You'd think someone would have done it by now, in case it crashes, but I can't find any record of anyone doing it.

      --
      #fuckbeta #iamslashdot #dicemustdie
    2. Re:This is why you cloud your cloud... by c0lo · · Score: 4, Funny

      You'd think someone would have done it by now, in case it crashes, but I can't find any record of anyone doing it.

      Heh... the real think crashed long ago, you are using now the backup.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    3. Re:This is why you cloud your cloud... by flonker · · Score: 4, Informative
    4. Re:This is why you cloud your cloud... by martin-boundary · · Score: 4, Insightful
      There's a limited number of cloud hardware providers on the internet, and the rest are middle men. It's useless to diversify yourself on the middle men, they will all be affected when the common underlying hardware provider has an issue. Thus there's a limit to the reliability that can be achieved, irrespective of how much mixing and matching is performed at the "business end".

      Diversification only "works" when the alternatives are provably independent. That's not true in a highly interconnected and interdependent world, which is TFA's point, I believe.

    5. Re:This is why you cloud your cloud... by im_thatoneguy · · Score: 4, Informative

      That's one of the problems though that the researcher is flagging.

      1) If a company has one instance on AWS and one on Azure and AWS fails... Azure suddenly doubles in load ( and also fails due to everybody piling on unexpectedly).

      the other being:

      2) Everybody uses Azure for SQL and AWS for hosting and Azure goes down... suddenly SQL dies and the AWS hosts all fail with the database down. Or the converse happens and AWS goes down and the SQL is useless without a head.

      The more services you rely on the more likely that on any given day one of them will be down. If you have 99% reliability and 20 services that you depend on (without any redundancy) then your failure rate could be up to 20% since any one of the 1% failures could kill your service.

      It's interesting but it seems like most of the cloud failures have been due to #1 internally so far. One sector fails and in an effort to load balance it starts taking out its peers who then also overload and take out their peers.

    6. Re:This is why you cloud your cloud... by Anonymous Coward · · Score: 0
    7. Re:This is why you cloud your cloud... by Anonymous Coward · · Score: 0

      It's time to turn to Cloud Based Raid Array Digital Interface Optimization

      a.k.a. CB Radio

    8. Re:This is why you cloud your cloud... by Gerzel · · Score: 2

      That's why you use a back up other than the cloud. If you can backup across clouds you almost certainly can backup across some real-hardware and the cloud using both at the same time.

      The could is great to provide extra power and computing resources cheaply but real hardware and servers owned by your company also still serve a vital role. One can be a backup to the other and both be utilized sharing the load at normal times.

      Thus you don't have to pay for the full hardware costs of what you use by offloading some of it to the clouds.

    9. Re:This is why you cloud your cloud... by Anonymous Coward · · Score: 0

      Don't be a clown in the clowd. You can run trivial shit in the clowd but not mission critical infrastructure. I don't have critical services I have critical networks. I can and do swap them out. I can't do that with the clowd clowns. I've already had one clowd provider go dark for 15 minutes and take several hours to get fully back online. Yes we had backup resources and were lucky one of of six worked. Clowds are for clowns.

    10. Re:This is why you cloud your cloud... by siddesu · · Score: 1

      You can always run freenet on your tablet, it will back the important parts for you ;)

    11. Re:This is why you cloud your cloud... by RKBA · · Score: 1

      While we're at it, we should probably backup the internet too. You'd think someone would have done it by now, in case it crashes, but I can't find any record of anyone doing it.

      Internet Archive

    12. Re:This is why you cloud your cloud... by sgtrock · · Score: 2

      ...I also noticed that my harddisk on "linux.cs.helsinki.fi" (which is where I keep the primary development sources) seems to be going, so keep your fingers crossed. I thought I'd better upload what I have now, rather than notice that I lost everything when I get back to work on Monday..

      (Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it ;)

      Linus Torvalds, July 20, 1996

    13. Re:This is why you cloud your cloud... by AmberBlackCat · · Score: 1

      That kind of sounds like distributed computing, which is what the cloud was before corporations picked up on the word and applied it to what they wanted to sell.

    14. Re:This is why you cloud your cloud... by k8to · · Score: 1

      It's worse than this.

      Most cloud services are built out using a significant number of other cloud services. That's the "upside" of being in the cloud -- you can use software/platform as a service to reduce the management/overhead costs of building out all that infraustructure yourself. So you can use service X for credit card running, and service Y for user support, and service Z for indexing and search, and so on.

      A modern cloud offering might use 10 or more other services. And those offerings may be using a list of other cloud services. And so on.

      This is where meltdowns become highly plausible.

      --
      -josh
    15. Re:This is why you cloud your cloud... by Anonymous Coward · · Score: 0

      It's worse than this.

      Most cloud services are built out using a significant number of other cloud services. That's the "upside" of being in the cloud -- you can use software/platform as a service to reduce the management/overhead costs of building out all that infraustructure yourself. So you can use service X for credit card running, and service Y for user support, and service Z for indexing and search, and so on.

      A modern cloud offering might use 10 or more other services. And those offerings may be using a list of other cloud services. And so on.

      This is where meltdowns become highly plausible.

      I think that's my cue: The Obligatory XKCD!

    16. Re:This is why you cloud your cloud... by Shagg · · Score: 1

      If you really have a critical service then you're not going to be putting it on "the cloud" anyway.

      --
      Unix is user friendly, it's just selective about who its friends are.
  2. XKCD by Shadyman · · Score: 3, Funny

    XKCD (jokingly) saw this coming a while ago: http://xkcd.com/908/

    1. Re:XKCD by Anonymous Coward · · Score: 1

      XKCD (jokingly) saw this coming a while ago: http://xkcd.com/908/

      It's rather chilling. Imagine for a moment a configuration where Cloud A hosts an online auction website. It automatically converts bids into the user's chosen currency. 'A' also hosts a service which calculates the current value of a few obscure currencies. Add in a Cloud B which hosts that auction sites' currency exchange information service and is written to check the exchange rate of, for example, bitcoins, every hour. Also on 'B' is a service that aggregates various auction sites for best deals. Assume any part of this is badly written, and may break when unable to get a pong on a value. If either cloud hosting provider goes down, up to four services stop working, despite each only hosting two. This problem is only magnified and made much more complicated when dealing with a heterogeneous environment, with many cloud solutions and multitudes of apps and services being hosted within each. More and more so, we are becoming dependent on CDNs and third party APIs to help streamline and improve the web browsing experience. It's just so easy to link against, say, jquery mobile. They even recommend it, "It’ll likely be the fastest way to include jQuery Mobile in your site."

      I can easily imagine the future possibility of a single service having an issue that cascades into a tidal wave of problems and broken apps/software/websites.

  3. impossible idea. by Spinalcold · · Score: 2

    we live in an age where information is distributed, even if statistical. (hell I made a fake Facebook account and somehow they found my mom, and she is no where close to me) a meltdown of information can't happen unless there is a world wide melt down of power. we have backups, but also ways of statistically restoring those backups.

    1. Re:impossible idea. by c0lo · · Score: 3, Insightful

      we live in an age where information is distributed, even if statistical. (hell I made a fake Facebook account and somehow they found my mom, and she is no where close to me) a meltdown of information can't happen unless there is a world wide melt down of power. we have backups, but also ways of statistically restoring those backups.

      Redundancy helps but it is not bullet-proof. A good chunk of it is the "topology" in which this redundancy is engaged in events of failure.(e.g. we had cascading blackouts in the past even if the energy network had enough total power to serve all consumers)

      Have a look on cascading failures.

      --
      Questions raise, answers kill. Raise questions to stay alive.
    2. Re:impossible idea. by Gerzel · · Score: 1

      No no it CAN happen. It might not be likely at any given moment but it is well within the range of possibility and that possibility grows larger every day steps are not taken to minimalize it.

    3. Re:impossible idea. by Spinalcold · · Score: 1

      ok, I agree, it can happen, but the chance is on a logarithmic scale. So, a huge failure is unlikely but it would be...well huge. Data loss is inevitable, why not put things in many area's at once? The 'earthquakes' are easier to deal with for the 'country' but for the individual, they should invest in long term 'earthquake control'. That's probably a HORRIBLE analogy, but it's all I could come up with.

    4. Re:impossible idea. by Gerzel · · Score: 0

      I don't think it is even that small of a chance, certainly not on the logarithmic scale. Remember while the chance of failure for any one instance may be very low you have to take the chances over the whole range of instances.

    5. Re:impossible idea. by Spinalcold · · Score: 1

      But that's the point, it's layers of redundancy. You'd have to have failures across not just one center but ALL centers simultaneously. The chance you get of that is the chance of each one going down multiplied together (.1*.1*.1*n).

    6. Re:impossible idea. by Gerzel · · Score: 1

      I thought the article mentioned that layers of redundancy were NOT being used.

  4. The analogy the author uses doesn't work. by stephanruby · · Score: 4, Insightful

    The analogy the author uses doesn't work.

    A better analogy would be the airline industry. The airline industry likes to over-book airplane seats it may not have because it's always trying to optimize its profit-margin.

    The same will happen with cloud-services. Cloud-services will always try to optimize their own profit-margins, at the risk of triggering significant outages.

    And I don't see what this has to do with the financial crisis at all.

    1. Re:The analogy the author uses doesn't work. by pitchpipe · · Score: 4, Insightful

      A better analogy would be the airline industry.

      I think a better analogy is the power grid. System hits a peak, one line goes down, others try to compensate becoming overloaded, another can't handle the load and goes down, and behold: cascading failures.

      --
      Look where all this talking got us, baby.
    2. Re:The analogy the author uses doesn't work. by c0lo · · Score: 2

      And I don't see what this has to do with the financial crisis at all.

      The insurance/reinsurance and CDO schemes in finance resembles "fail-over redundancy activation" in the cloud. Enough complexity and nobody can predict what can happen until it actually happens - see cascading failure

      --
      Questions raise, answers kill. Raise questions to stay alive.
    3. Re:The analogy the author uses doesn't work. by TubeSteak · · Score: 3, Informative

      And I don't see what this has to do with the financial crisis at all.

      FTFA

      New cloud services may arise that essentially "resell, trade, or speculate on complex cocktails or 'derivatives' of more basic cloud resources and services, much like the modern financial and energy trading industries operate," he wrote.

      Each of these various cloud components are often maintained and deployed "by a single company that, for reasons of competition, shares as few details as possible about the internal operation of its services," Ford added.

      As a result, the cloud industry could find itself "yielding speculative bubbles and occasional large-scale failures, due to 'overly leveraged' composite cloud services" with weaknesses that don't become known "until the bubble bursts," Ford wrote.

      The metaphor more ore less fits, except for the part that ignores how a lot of what happened during the financial crisis was outright fraud perpatrated by lenders.

      The potential mess with the cloud is not about fraud, just about excessive dependancies.

      --
      [Fuck Beta]
      o0t!
    4. Re:The analogy the author uses doesn't work. by plover · · Score: 3, Insightful

      I think by "financial crisis" he meant "a minor market crash due to autotrading algorithms", and not the real crisis being caused by thieves running trillion dollar banking, mortgage, and insurance scams.

      The point is "if you use similar automated response strategies as a large set of other similar entities, you could all suffer the same fate from a common cause."

      Supposedly a market crash was triggered by autotrading algorithms that all tended to do exactly the same thing in the same situations. So when the price of oil shot up (or whatever the trigger was) then all those algorithms said "sell". As all the sell orders came in, the market average dropped, and the next set of algorithms said "sell moar". So there was a cascade because so many systems had identical responses to the same negative stimulus. Think of those automated trades as being akin to a "failover" IT system: if host X is failing, automatically shift my service load this way.

      So that's the analogy the author is trying to make with respect to systems that depend on automated recovery machinery like load balancers: if response time is too high at hosting vendor X, my automated strategy is to failover to hosting vendor Y. And perhaps 500 large sites all have a similar strategy. Now let's say that vendor X suffers a DDoS attack because they host some site that pissed off Anonymous. So now all these customer load balancers see the traffic slowing down a X, and they simultaneously reroute all app traffic to vendor Y in response. Vendor Y then gets hammered due to the new load, and the load balancers shift the traffic elsewhere. Now two main hosting providers are down while they try to clean up the messes, and the several smaller providers are seeing much bigger customers than usual using them as tertiary providers, and they start straining under the load as well, causing their other clients to automatically shift.

      And if that isn't exactly what plays out next year, might not something similar happen with payment gateways, or edge content delivery systems, or advertising providers?

      It's a cascade of failures due to automated responses that's remarkably similar to the electrical grid overloads that caused the northeast coast blackout in 2003. The author's point is "we don't know precisely what bad thing might happen within this particular ecosystem, but there is significant risk because we've seen complex interdependent systems have similar failures before."

      --
      John
    5. Re:The analogy the author uses doesn't work. by khallow · · Score: 2

      The point is "if you use similar automated response strategies as a large set of other similar entities, you could all suffer the same fate from a common cause."

      Systemic risks like this were also present in the real crisis. I think the primary problem here is simply that the risks aren't well understood and that users and suppliers of cloud services are likely to make unwarranted assumptions (or even warranted assumptions that get invalidated when the infrastructure gets stressed in certain ways). There's also the possibility for tragedy of commons problems on a global scale. For example, if requests for DNS (it's a bit low level for a cloud service, just using a concrete service for this example) occasionally get bounced, one might be tempted to issue several such requests in order that one gets through. If everyone does that, then the number of requests to DNS have just gone up by an unknown but hefty factor. If DNS was already stressed, then this collective behavior might push it into failing completely.

    6. Re:The analogy the author uses doesn't work. by Gerzel · · Score: 1

      The biggest difference is that it is still somewhat easy for companies to balance themselves against the cloud by having their own hardware running.

      They don't need full capacity capabilities, but even a small amount of capability can keep their services up, if slowed, rather than a full crash mitigating costs when things do go wrong.

      Physical hardware also provides a way out of a service provider.

      Though physical hardware also requires physical staff but that is the downside of in-sourcing. The downside of outsourcing (which is what the clouds are) is that you don't have the capabilities and when your outsourced provider crapps out you are eventually SOL.

    7. Re:The analogy the author uses doesn't work. by turbidostato · · Score: 2

      "The biggest difference is that it is still somewhat easy for companies to balance themselves against the cloud by having their own hardware running."

      Regarding services, what's the real difference when using my own hardware? I think Amazon owns its own hardware too.

      "They don't need full capacity capabilities, but even a small amount of capability can keep their services up, if slowed"

      Slashdot effect? For so many services, if you can't go full capacity, you don't serve at lower speed, you just don't work, full stop.

      And it is not even a cloud issue. It's not the first time I advised a customer not to go with an active/active scenario for high availability but active/hot standby instead and my advise being rejected because they didn't want to invest on a spare "doing nothing". Of course, the first time a failure pushed full capacity to the remaining, which couldn't stand the overload and failed too, they started thinking otherwise.

    8. Re:The analogy the author uses doesn't work. by im_thatoneguy · · Score: 2

      I think by "financial crisis" he meant "a minor market crash due to autotrading algorithms", and not the real crisis being caused by thieves running trillion dollar banking, mortgage, and insurance scams.

      You're right about the cascading failure but wrong about the financial crisis. The larger financial crisis was a crisis because banks had circular loans and insurance on one another. So if one bank failed it would suddenly stress the next bank to the point of failure and bankruptcy which would trigger another bank to fail and so on and so forth. What we had in 2008 was a cascading financial failure because everybody was insuring everybody else assuming that everybody wouldn't fail simultaneously.

    9. Re:The analogy the author uses doesn't work. by Anonymous Coward · · Score: 0

      also, in the wake of economic crysis, they just discovered money they tough it was simply wasn't there

      you can always call dell for some more cpu, if the cloud boes bankrupt and can't lend any more cpu power.

    10. Re:The analogy the author uses doesn't work. by Anonymous Coward · · Score: 1

      Yep. The power grid is a good example -- especially the wet dreams of a 'smart grid'. Particularly since one of the conclusions from the 2004 blackout was that the management complexity of the power grid was a principle cause of the cascade -- i.e. the potential interactions of all the interconnects transcended our understanding. KISS is still the watchword of the day -- as Scottie once said 'the more you complicate the plumbing the easier it is to plug up the works'. Do the math sometime on the likelihood that a connected system, with each part having a low failure rate, will be up -- its pretty depressing. The miracle, actually, is that anything works at all.

    11. Re:The analogy the author uses doesn't work. by houstonbofh · · Score: 1

      "The biggest difference is that it is still somewhat easy for companies to balance themselves against the cloud by having their own hardware running."

      Regarding services, what's the real difference when using my own hardware? I think Amazon owns its own hardware too.

      Mega-upload also owned it's own servers. And they are not the only cloud provider to have hardware inappropriately seized. You do not have control of someone else's hardware.

    12. Re:The analogy the author uses doesn't work. by turbidostato · · Score: 1

      "Mega-upload also owned it's own servers. And they are not the only cloud provider to have hardware inappropriately seized. You do not have control of someone else's hardware."

      Maybe, but that's not the point you are making with your example.

      As you already said, megaupload *owned* their servers. If this case has to show anything is that you can't control your own hardware either.

    13. Re:The analogy the author uses doesn't work. by TubeSteak · · Score: 1

      You and the GP are both incorrect.
      TFA did not mean "a minor market crash due to autotrading algorithms," which the GP would know if they had RTFA.

      And the larger financial crisis was not about circular loans. It was about 5 banks that were wildly overleveraged*
      when the housing bubble popped and their losses were magnified between 30:1 and 40:1 instead of the industry standard 12:1.
      This disaster devalued their housing holdings, which devalued everyone else's housing holdings, which fucked everything else, everywhere else, all at once.

      *which TFA mentions

      --
      [Fuck Beta]
      o0t!
    14. Re:The analogy the author uses doesn't work. by CimmerianX · · Score: 1

      No - you can't control when the government seizes your hardware. But the point here is that if you use the 'cloud' as your storage or sole backup, your trusting your data to another service.

      Unlike Megaupload which stored other peoples data and allows 'sharing', if you 'own' your own servers, you control the data, the backups, can write backups to out of state servers you also own....

    15. Re:The analogy the author uses doesn't work. by Anonymous Coward · · Score: 0

      And as usual at the source of every problem is an MBA. That whole department needs an overhaul. eg more ethics and more long term planning.

    16. Re:The analogy the author uses doesn't work. by plover · · Score: 1

      The autotrading event was the trigger, but not the cause of the disaster. By itself, the autotrading crash would have been a minor event. The root cause of the disaster was the thieving bank scams, all of them together, including the housing market, the overextended banks, the deregulated investments made by the insurance companies, the insanely complex derivatives that spat out profits but had ultra high risks built in, all of that together was the real cause.

      If you have a barrel full of gunpowder, and you're examining it closely with a lighted candle, it's hardly the candle's fault if it explodes.

      --
      John
  5. Dupe Meltdown by Anonymous Coward · · Score: 0
  6. Low hanging fruit of a research piece by mcrbids · · Score: 4, Interesting

    Efficiency normally comes with economies of scale. As a partner in an outsourced vertical software company, we have hundreds of clients running in our highly tuned hosting cluster, and are able to bring economies of scale to an otherwise ridiculously expensive software niche. Yes, that means that if we have an outage, all of our clients experience an outage as well.

    However, we have carefully laid plans for multiple recovery points in a disaster scenario, (Plan B, Plan C, Plan D, etc) and have maintained an uptime significantly better than our clients would typically attain if left to their own devices. We easily manage close to 4 nines of uptime in an industry where the average is realistically around 2 nines. (having "the computer is down" a day or two every year or so is typical)

    Although the Internet is a "network of ends" the truth is that not all ends are created equal. Having a high quality, high speed (100 Mb), reliable (99.99%+) Internet feed in my small-ish hometown of around 80,000 people is ridiculously expensive. But in a nearby city (500,000 people 2 hours' drive) we host our servers in a tier 1 colo at 1/10th the cost of running it all ourselves, with dramatically improved reliability and network performance.

    Yes, putting all your eggs in one basket means that if that basket fails, you lose all your eggs. But it also makes it easy to buy just one, really nice basket that won't break and lose your eggs.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Low hanging fruit of a research piece by jtownatpunk.net · · Score: 2

      Yes, putting all your eggs in one basket means that if that basket fails, you lose all your eggs. But it also makes it easy to buy just one, really nice basket that won't break and lose your eggs.

      That's a great analogy until someone throws a weasel in your well-crafted basket.

    2. Re:Low hanging fruit of a research piece by blind+biker · · Score: 1

      Efficiency normally comes with economies of scale. As a partner in an outsourced vertical software company, we have hundreds of clients running in our highly tuned hosting cluster, and are able to bring economies of scale to an otherwise ridiculously expensive software niche. Yes, that means that if we have an outage, all of our clients experience an outage as well.

      In your post is implied that you have a single location. How do your customers feel about that - if they're even aware of it?

      --
      "The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
  7. Technology.... by Anonymous Coward · · Score: 1

    Never turns on its makers. Never. This story is bullshit. Technology is a tool. I treat it like a tool. I control it.

    Now, who's up for another drink?

    1. Re:Technology.... by Narnie · · Score: 2

      I'd like to get a drink with you sir, but then I would feel like a tool.

      --
      greed@All_Evils:~#
  8. A cloud meltdown? by Anonymous Coward · · Score: 0

    Sounds to me that you have a mushroom cloud.

  9. if they actually do this - they're stupid by Karmashock · · Score: 3, Interesting

    systems needs to be compartmentalized or have redundancies built into them.

    For example, I have several systems that send automated emails. I've had a problem in the past of given email servers not accepting or sending messages. It's uncommon but it happens and it's not acceptable. These are mission critical systems. They can't fail.

    Solution? Redundancy up the wazoo. The way it's set up now so many things would all have to happen at the exact same moment that the only way the system is likely to fail is if we fight world war 3... and lose.

    That is how you solve this problem. Don't rely on any one system. Rely on all of them. Once you figure out how to integrate one of them it's typically easier to integrate the rest. The virtues of this approach are manifest. Not just stability but if the services do processing or data retrieval you can cross reference them to find errors in databases or get a more complete data set then exists in any one source.

    I mean is google or bing the best search engine? What about both at the same time?

    --
    I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
    1. Re:if they actually do this - they're stupid by Anonymous Coward · · Score: 1

      Umm, no. Doing what you say will result in the catastrophic failure. You and X percentage of companies are on CloudA. One of those companies gets a massive DNS attack and CloudA can't handle it along with their normal loads (they oversell just like net providers, both lined and wireless telephone services, and airlines). CloudA goes down (or it forwards to CloudB). You and some of X move their loads to another CloudB. CloudB now has way more load than it expected. They've never had so much of CloudA's load before. The other copmanys on CouldA more to other Clouds increasing those loads somewhat. One of these smaller Clouds can't take the extra load and go down. Again the companys redistribute to the reminaing Clouds. These Clouds again see their loads increase and again the weakest Cloud dies. This pattern will repeat until every cloud service is offline.

      Can the largest cloud server handle all the cloud computing in the world? Why would the MBAs let so much hardware sit idle (and thus wasing money)? They would require services to be oversold to keep loads at an acceptable profit level. Peak demand be dammed. It'll be cheaper to offload our peak demands to another cloud instead of mataining hardware that only gets used once a year (the exact reason to go to the cloud in the first place).

      Everything will happen much faster if a larger cloud goes down. The smaller ones will be clobbered with massvies loads and will all go down; especially the ones who had planned on offloading their extra work onto the downed larger cloud. There's no way to know what cycles the cloud compaines have gotten themselves into. CloudA might off load to CloudB, CloudB uses CloudC and CloudD, CloudC uses CloudA, and CloudD uses CloudC (A->B->C/D->C->A...)

      Keep your data local and be able to matain minimal operating levels with that data. Feel free to use the cloud for everything above that. The cloud is a silver bullet; aimed right at your company.

      err, I misunderstood some of your post but I already wrote all of the above. I'll post it anyway because it's still correct, just not a great reply. Sorry.

    2. Re:if they actually do this - they're stupid by happyhamster · · Score: 1

      Redundancy in common sense would only solve hardware issues. You can have multiple servers, multiple network connections, multiple power solutions. What about software? If you run the same mailing software on all servers, a bug or vulnerability in the mailer would happily bring down all your servers. Same with bugs and vulnerabilities in operating systems running on the servers, and other software. Unless you run several different mailer systems, operating systems, etc, and they all synchronize between themselves (e.g. no emails get lost, no emails sent more than once etc.), redundancy as it's commonly practiced is not the solution to cloud problems.

    3. Re:if they actually do this - they're stupid by Karmashock · · Score: 1

      Why can't I use several competing cloud systems that do the same thing? What they're talking about here is powerful cloud systems that depend on each other. So if one cloud goes down it causes a chain reaction of failure.

      But if every system can use two or three different sources for everything then it doesn't need any specific cloud to be running so long as most of them are running.

      --
      I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
    4. Re:if they actually do this - they're stupid by NeutronCowboy · · Score: 1

      Because at some point the ROI isn't there. It's a common problem actually. Everybody knows how to make things redundant - triply, quadruply, etc. The problem is that no one is willing to pay for that kind of redundancy. The business doesn't, the clients don't, and you sure as hell aren't paying for it out of your own pocket. So you rely on failover mechanisms that are generally doubly redundant, or at least that rely on a large number of inexpensive machines. On top of that, you craft as clever a process as you can.

      And then you discover that there is a cascade effect you didn't consider, or that you did consider but didn't have the money to build for. And that's when things go to hell..

      --
      Those who can, do. Those who can't, sue.
    5. Re:if they actually do this - they're stupid by ultranova · · Score: 1

      For example, I have several systems that send automated emails.

      Didn't those switch to globally distributed clouds years ago? Ones composed mostly of unpatched Windows machines, if I understood correctly.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    6. Re:if they actually do this - they're stupid by Karmashock · · Score: 1

      Why would a DDoS attack cause a chain reaction?

      First off, the cloud is especially resistant to DDoS attacks. Ask Amazon. They've designed their systems specifically to reduce the effectiveness of that sort of attack. And as systems become larger they become harder to hit with DDoS attacks. You might as well try to DDoS a root DNS server. Have fun with that.

      Furthermore, why would ALL systems route to the same alternate cloud provider? Rather then everyone going from A to B what you'd actually see is some going to B some going to C some going to D-Z. You're not going to see everyone go from one to the next. Which means that rather then load doubling on B you'll see a 5 percent increase in load on B-Z. That's an exaggeration. There will be favorites so some service might get a 15 percent increase while some would see a 1 percent increase. But you're not going to get some perfect domino effect.

      It all boils down to how many reasonable IF/THEN subroutines you've built into the system. In the emailing system I have, every time an email is sent a check is done on the email server to make sure it is working. If it isn't it routes to another server in a list. Then there is a check to make sure the email got through. If it didn't there are additional subroutines that go through a trial and error process working through most likely reasons for a failure.

      The result is that emails get through. Always. The only thing I don't have control over is the receiver's server. If that stops working there isn't anything I can do about that. But short of that, the email gets through.

      Everything is logged. Everything is cross referenced. Everything is added to spreadsheets and turned into lovely little graphs.

      The mail gets delivered. Period. No excuses.

      And with this cloud computing there are ways to do the same thing. Redundancy and compartmentalization.

      You want to make sure that a failure in one system can't cause a failure anywhere else. And you make sure that if any system fails there are backs ups upon backups upon backups.

      The system fails if the whole network drops. If any portion of the network starts working again, all jobs can be routed to that portion of the network and everything continues.

      I've seen multiple failures before. We had an issue not long ago where many companies decided to do maintenance at the exact same time. Their systems all went off line for a couple hours. The system automatically shifted from unresponsive systems to responsive systems and there was no disruption even though upwards of 40 percent of the systems were not responding.

      This is how you do it.

      --
      I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
    7. Re:if they actually do this - they're stupid by Karmashock · · Score: 1

      It's all about how you design it. It isn't just redundancy. It's compartmentalization. Given systems are going to fail. You need to set it up so they can fail totally without it effecting anything else. The redundancy especially in an enterprise organization is a requirement.

      A lot of people looked at the cloud as a way to save a lot of money on computers etc. For mission critical applications that's the wrong attitude. Instead, you should look at it as an opportunity to make the system more robust. Take the per system cost savings and put them towards redundancy. A cloud system is a tenth the cost of an in horse system by our calculations. The cost of the system isn't really important to us since it's nominal in the scheme of things. So we took the savings and put them towards making the system more robust.

      We joke that it's military grade at this point though honestly it's probably well beyond anything you see out of the military. We built everything with the assumption that Murphy was not our friend. A cascade failure is only going to happen if you don't have any redundancy. And in that case you deserve the result.

      No pity.

      --
      I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
    8. Re:if they actually do this - they're stupid by Karmashock · · Score: 1

      I can't speak to what remote systems outside my control are doing. But we could tolerate 95 percent failure and still operate at 100 percent efficiency.

      As I said, we have a great deal of redundancy built into it.

      We've seen failure rates as high as 40 percent. Only for an hour or so... It had no effect on us.

      --
      I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
  10. just like mainframes by Dan667 · · Score: 4, Insightful

    I think it is funny that lessons learned years ago with mainframes are being presented as new by just changing the word mainframe to cloud.

    1. Re:just like mainframes by StormReaver · · Score: 1

      I think it is funny that lessons learned years ago with mainframes are being presented as new by just changing the word mainframe to cloud.

      I wish I could moderate you higher than +5. I got into computing in the mid 80's, when home computers were popular, and never had to deal with mainframes. However, I know enough about computing history to see exactly how absurd this entire "cloud" computing fiasco is becoming. And it is going to exactly follow the same curve that mainframes followed, until "suddenly" the concept of having your own computing resources is going to be "new and exciting" again.

      P.T. Barnum would be proud.

    2. Re:just like mainframes by houstonbofh · · Score: 1

      So you worked at an Application Service Provider (ASP) at the turn of the millennium too?

      I think this is why the 40+ crowd has trouble getting work in some of the "New Internet" businesses. We were at the old Internet businesses last time and remember how the cool-aid tasted then.

  11. In other words by Anonymous Coward · · Score: 2, Insightful

    Unmanaged systems are hard to manage.

  12. So by mikes.song · · Score: 1

    Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis.

    Cloud computing is like fractional reserve accounting, with artificially low interest rates?

  13. UNISEX HotCloud? by Anonymous Coward · · Score: 0

    Sounds like a hair salon.

  14. Nightmare scenario has already happened by dbIII · · Score: 3, Insightful

    It's a leap year, February 28, and all over the world, completely out of the blue (or azure if you prefer) cloud clusters crash as the local clocks swing around to midnight, then stay down all day.
    Still, it's three nines of uptime when it's spread out over a few years :)

    A highly interdependant system is only as reliable as the QC on the weakest link. Who would have thought that somebody from a company that had a lot of embarrassing press about a leap year stuffup would make such a stupid and obvious mistake four years later? That's the cloud, where even the biggest names still don't care anywhere near as much as you would about your own systems and so don't pay enough attention to detail.

    1. Re:Nightmare scenario has already happened by Anonymous Coward · · Score: 0

      That systemic failure is a bitch.

      However, I believe that demonstrated the system wasn't highly independent and there in lies the issue. I really doubt it's a matter of caring or not. Often, you'll find these guys really do care about what they do or else it would just fall apart. No one gets up at 4am to respond some god forsaken issue because they like the early morning pager based wake up call. More likely, mistakes are made because there are usually not enough hands in the cookie jar. That's typically what I see businesses do. Run with a horrible level of safety in the guise of saving a few dollars. Never mind any reasonably horrible failure will actually cost more then the equipment or software that could have mitigated the disaster. After a few of those incidents company's tend to either go the way of the dodo or get smart. Suddenly, it's not worth risking issue X and Y is created to handle it. (Maybe they get smart and start asking the tough questions on mitigating all the Z's too)

      That's my entirely random and likely inapplicable thoughts.

  15. Jargon by tpstigers · · Score: 0

    Jargon, jargon, jargon, jargon. Jargon.

  16. with a difference by Ralph+Spoilsport · · Score: 1
    Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis."

    The difference being, of course, that the global financial crisis was the product of the abyssal greed of speculators and the stupidity of venal governments borrowing from private banks instead of doing the right thing and being directly responsible for the creation of money.

    But other than that, sure it's just like it.

    (/snark)

    --
    Shoes for Industry. Shoes for the Dead.
    1. Re:with a difference by houstonbofh · · Score: 1

      Yeah! No IT company has any greed! They would never skirt the law! Oh, wait...

  17. Who uses the cloud for serious uptime? No one sane by Anonymous Coward · · Score: 1

    Using a public cloud seems sensible for low risk projects, or one off, large scale computations. The security and availability risk would suggest that anyone using the cloud for their entire infrastructure has either read too many brochures, or is about to do something else crazy, like divest their entire original business, and then hike service charges.

  18. The Octel Clusterfuck by Anonymous Coward · · Score: 0

    I was a sysadmin at Octel Communications back in the day. Octel invented voice mail; perhaps you've heard of it.

    When I hired on we had three Sun 3/280 servers. I think these were 60830 boxen, but they might have been '020s. They were primarily used for cross-compiling the homebrew RTOS that Octels voice mail machines ran, but they were also used for Electronic Design Automation.

    There was a mysterious problem that from time to time would cause one of the servers to go to its knees for an hour or two, but not actually crash. Because all three machines were NFS hard-mounted on each other, as soon as one machine got stuck, they were all stuck. 250 engineers all got to sit on their hands while I contemplated whether I'd be a few inches short of a head by the end of the workday.

    I asked a colleague why we didn't soft-mount the NFS shares. That would allow a client of a hung server to timeout. My colleague's reply was that, at the time at least, we couldn't count on our development tools to do the right thing if they got read or write errors during a build. It was felt that soft-mounting might lead to bad machine code generation.

    In the end it turned out that the hung servers was caused by high capacitance serial cables. When a machine would emit "SunOS Login:", it would receive a capacitively-couple bunch of garbage back, that login would take as the username. Login would then prompt "Password:", and receive again garbage for the password "attempt". Each machine had 32 serial lines, some of them going hundreds of feet. Good thing I studied Physics and not Computer Science!

    The solution was to buy a big, long, expensive spool of serial cable that had lower capacitance per foot, as well as a bunch of RS-232 plug kits, and then to tear out and replace all the cable. That took some convincing to get the management to give me the budget and the time to do the work, but in the end all I required to convince my manager Karen Coates was to hook a glass TTY up to a scope.

    In Other News: I have been doing some study of security, and will have results to announce soon. These results will be digitally signed. Please use a keyserver to download my Public Key into your keyring. Please use nothing other than my key fingerprint; key emails and Key IDs can be spoofed:

    Fingerprint=9B9F 2D03 9996 AF83 9A4F CB26 20E8 0D0B F760 5786

    1. Re:The Octel Clusterfuck by houstonbofh · · Score: 1

      A digitally signed AC post... I am confused...

  19. Alterate headline: by VortexCortex · · Score: 2

    Researcher Observes Cloud Interactions, Predicts Lightning

  20. How is this different from the internet itself? by WOOFYGOOFY · · Score: 1

    Seriously. I don't get why this same description doesn't apply to the internet itself, a thing known to work reliably?

    Don't MAKE me RTFA.

  21. How to solve the debt crisis by Anonymous Coward · · Score: 0

    Host all the debt on the cloud, then pfft, gone!

  22. The focus on infrastructure as a service is flawed by Anonymous Coward · · Score: 0

    As long as you are focusing on infrastructure, or dealing with IaaS providers, you will be stuck thinking of all of the typical IT failure scenarios (systems, not people) but at a much larger scale. The future of cloud computing lies in two areas. Platform as a Service (PaaS), and changing how we write software (in that order).
    I don't work for Microsoft, I am talking about Azure specifically because this is our first implementation, but we plan on using other cloud providers as they mature to catch up with Azure.

    PaaS. http://en.wikipedia.org/wiki/Platform_as_a_service

    Systems like Azure and to a much smaller extent AWS although nobody uses it that way, are abstracted away from the 'myapp==this host' thinking and more towards treating the cloud as if it is an OS overlaid on top of a very large compute fabric. In our deployments we have started re-writing all of our critical functions as worker roles within Azure. The worker roles are dispatched using cloud native functions. We have roles for SQL, processing, BLOB (data) stores, etc. We have some fairly generic communication libraries we use to get them to work together along with the native azure functionality. We have several backend management instances which act as coordinating hubs to deploy, monitor, and manage, all of the worker roles. This allows us to do several things, one, is that any bottleneck can typically be isolated at a much finer level than you would typically get running a monolithic application stack. This allows us to duplicate roles that are getting overworked. This in turn gives us much finer control over scale as we can run multiple roles on the same system, on different systems, whatever makes sense. For purposes of backup we have a very small, almost idle mirror setup in each of the different Azure data centers with only the database being actively migrated (synced). If one data center were to go down, we could basically pick up in another data center and 'right size' the entire thing in a matter of minutes (at worst). All of this is routed to the users through two different CDNs. So there is no direct client to process connection.

    Anyhow, that is the route we are taking. Yes, it was a bit of an undertaking to get going but we have been doing it piece by piece with a long ways to go but we are very satisfied with what we have achieved so far.

  23. Server gone! by Anonymous Coward · · Score: 0

    This is nothing compared to the harm that will be done when government confiscate cloud servers in the name of gathering terrorist information.

    1. Re:Server gone! by houstonbofh · · Score: 1

      Why post AC? This has happened already, many time, and not "terrorist" buzzword needed. MegaUpload was just the most recent high profile one.

  24. Re:Who uses the cloud for serious uptime? No one s by Anonymous Coward · · Score: 0

    You mean like hotmail, ebay, amazon, salesforce or.... Apple? (icloud runs on azure).

  25. uhm where is the fraud? by decora · · Score: 1

    the 'global financial crisis' was caused directly by massive fraud and profiteering. is there any incentive for cloud companies to create massive quantities of products that are completely worthless and sell them to sucker investors?

    1. Re:uhm where is the fraud? by lennier · · Score: 1

      the 'global financial crisis' was caused directly by massive fraud and profiteering. is there any incentive for cloud companies to create massive quantities of products that are completely worthless and sell them to sucker investors?

      Um, is that a trick question?

      If you're selling something - whether it's investment, insurance, public key certificates or data backup - where the other buyer can't directly measure the quality of the product, of course there's incentive for fraud.

      Here, just upload your data to Dev Null Industries quad-cached completely tamperproof server. No, your data isn't encrypted. No, you can't have it back all at once. No, we won't peek, honest, and of course we'd never sell your spreadsheets to your competitors. Seriously. Would this stock photo of a face lie to you?

      --
      You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
  26. This! by Anonymous Coward · · Score: 0

    This is a perfect description of what will be "the perfect storm"(cloudy pun intended). And when, not if, it happens there will be a massive exodus form the cloud. The question is, where will the exodus go? Will they bring their data centers back in house? Will they colo and build their own private clouds?

    As soon as I can figure out where they will go, I'll be putting my money there.

  27. What this has to do with the financial crisis? by Anonymous Coward · · Score: 0

    Nothing, it just another attack by bad analogy man. The global financial meltdown was cause by a small number of financial houses who bet against their own CDO debt bubble and then shorted the entire economy. The same people who are currently going through Europe and bankrupting whole countries one-by-one.

  28. reddit by MichaelSmith · · Score: 1

    Just look at the periodic reddit meltdowns.

  29. Not just technology by Sqreater · · Score: 1

    Complexity is rising in all things at a frightening rate, not just technology. Over my lifetime the amount of information required to make any decision has become massive. For instance, can your select the "best" cellphone for you today? Which credit card? Car? Checking account? There is a coming "complexity collapse." What it will look like, or what the consequences will be is hard to project, but there cannot be an infinite rise in complexity in our lives without something painful happening eventually. Will people retreat from complexity? Will they just start to chuck technology and pull back from activities we now take as normal? Put their money in a mattress at home and use tin cans to communicate? Probably not. But what will they do to protect their sanity when bombarded by too many unmakeable decisions?

    --
    E Proelio Veritas.
  30. It wouldn't be a "meltdown" by kommakazi · · Score: 1

    ...it would be a "storm" in the cloud

  31. Pretty vague by Anonymous Coward · · Score: 0

    Would be nice to read the paper rather than some nearly meaningless story about it.