Slashdot Mirror


City's IT Infrastructure Brought To Its Knees By Data Center Outage

An anonymous reader writes "On July 11th in Calgary, Canada, a fire and explosion was reported at the Shaw Communications headquarters. This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers, local radio stations, emergency 911 services, provincial services such Alberta Health Services computers, and Alberta Registries. One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well.' No doubt this has been a hard lesson on how NOT to host critical public services."

21 of 102 comments (clear)

  1. First post! by Svartormr · · Score: 3, Informative

    I use Telus. >:)

    1. Re:First post! by clarkn0va · · Score: 5, Funny

      So Shaw customers get all their disappointment in one fell swoop, while you suffer subclinical abuse on an ongoing basis. Congrats.

      --
      I am literally 3000 tokens away from the chaotic crossbow --Stephen
  2. Or... by Transdimentia · · Score: 4, Insightful

    ... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.

    1. Re:Or... by theshowmecanuck · · Score: 2

      The designers AND their managers and their managers should be made redundant.

      --
      -- I ignore anonymous replies to my comments and postings.
  3. No Site Level Resiliency? by sociocapitalist · · Score: 5, Insightful

    Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).

    --
    blindly antisocialist = antisocial
    1. Re:No Site Level Resiliency? by Anonymous Coward · · Score: 2, Informative

      The issue is IBM runs the Alberta Health Services and other infrastructure from the Shaw building of which IBM has their own datacenter in. IBM had no proper backups in place for these services.

      911 being the most critical was also not affected, just Shaw VoIP users couldn't call 911 if their lines were down -- obviously (only ~20k people downtown were affected).

    2. Re:No Site Level Resiliency? by sumdumass · · Score: 3, Insightful

      This is why i do not understand the rush to cloud space. The same types of outages that apply to locally hosting the data apply to the cloud space providers. You still need the backup's, disaster plans with the ability to access the servers and such, much of the same stuff if not more then you would need if hosting it yourself. Is the clouds that much cheaper or something? Or is it more about marketing hype that talks PHBs and supervisors who want to sound cool into situations like this where diligence is not necessarily a priority?

    3. Re:No Site Level Resiliency? by Sir_Sri · · Score: 2

      Is everybody safe

      That is, quite literally, someone else's problem. It sounds calloused to say, but seriously, /. isn't a site for first responders, it's for IT and CS types. It's not like we're looking at one thing at the expense of another here, your data (and 911 access) should work, people shouldn't die in a fire and your data shouldn't be hosed if it was housed there.

      As to your point about universities. As tragic as it might be if someone died in a fire tomorrow at the university I graduated from 10 years ago, I still want them to be able to provide me transcripts and a copy of my degree if needed 10 years from now.

      People around here are supposed to worry about preserving data, usually not at the expense of peoples lives (although there is a market for that in government secrets). Worrying about how to put out a fire and treat burn victims is someone else's job.

    4. Re:No Site Level Resiliency? by foradoxium · · Score: 3, Interesting

      Imagine if the library of Alexandria had backup copies of all those books, manuscripts and other treasures? How about Constantinople? I'm sure there were people that tried to protect that data who believed it was worth more then their life. I hope that brings stuff into perspective.

  4. Maybe the city/provinces should skip on redundancy by Anonymous Coward · · Score: 3, Interesting

    The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

  5. Fukushima Daiichi - Anyone? by Paleolibertarian · · Score: 2

    Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (datacenter) again.

    The reason the cloud was "invented" was to bring back the more profitable mainframe/dumb terminal business model.

  6. Limitations by phorm · · Score: 2

    There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.

    In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thing that 9-11 showed is that even huge skyscrapers - though unlikely - can be knocked out by a crazy set of circumstances (or just crazy people).

    However, what happens if you're running through gigs of data on a constant basis? If you can't get a fibre connection between two sites, you might not be able to have a live redundant backup.

    Now what if you connect to multiple outside entities. They'll need to have redundant connections to both your sites. You'll want two ISP's, in case one kicks etc, etc etc.

    How about power? Both sites will need a big generator or something of the like, plus battery-backup to hold things until the generator kicks in. Preferably they'd both be fairly far apart on the grid so if one site doesn't.

    Weather... they'd both better be outside of any major weather considerations (forest fires, floods, quakes, whatever).

    I won't make excuses for Shaw (no I don't work for them, in fact I'm affected negatively by the outage), but for many companies 100% HA/redundancy isn't really possible.

    Luckily for those using services, I believe that this was a case of connection/infrastructure loss rather than all the data, so I hope that Shaw is working their a**es off to get things back.

  7. Re:Shaw is an ISP by tlhIngan · · Score: 2

    All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.

    Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).

    And it really isn't a screwup - they were doing something important - testing the backup generators. It's just the generators blew up, which took out the other backup generators (dual redundant power!), and knocked out power to all the equipment by knocking the utility power offline as well.

  8. Captain Obvious by roc97007 · · Score: 2

    > No doubt this has been a hard lesson on how NOT to host critical public services.

    And no doubt the lesson was not learned.

    --
    Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
  9. Not surprising by Anonymous Coward · · Score: 3, Informative

    There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.

    What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
    http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
    (apologies don't comment a lot and don't know how to properly link)

    The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)

  10. 911 not down by CaptainPuff · · Score: 2

    911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.

  11. What really happened... by Anonymous Coward · · Score: 4, Interesting

    Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.

    Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.

    Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.

  12. It was so bad.. by Megahard · · Score: 5, Funny

    It caused a stampede.

    --
    I eat only the real part of complex carbohydrates.
  13. Re:Maybe the city/provinces should skip on redunda by snowraver1 · · Score: 2

    The problem wasn't necessarily with Shaw. Shaw's problems were relatively minor. Internet and television services were affected over a small geographic area (downtown Calgary). Those affected by the Internet outage who also had Shaw Home Phone, couldn't use their phone as the network was down. If they called 911 on a cell phone or a land line, they would have received help.

    The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing compared to (Estimate, based on almost nothing)5,000 servers going down. The Fire Dept was involved, so power is going to get cut whether they like it or not, but there were still other problems. There are reports that the backup generators didn't kick in (whether or not that would have avoided the outage, I don't know). I have received indication that if those backup generators had worked, service could have been restored slightly faster(I'm hearing this all 3rd party, so salt is needed).

    IBM runs the datacenter where the servers live. Who screwed up? IBM? Shaw? Someone else? I don't know. We'll have to wait and see.

    --
    Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
  14. Re:Maybe the city/provinces should skip on redunda by snowraver1 · · Score: 2

    Oh yeah, I also heard that IBM will be incurring HUGE fines from SLAs. I think I heard some obscene number like 1M/minute.

    --
    Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
  15. Re:Shaw is an ISP by mysidia · · Score: 2

    Redundant internet connections are no guarantee of no single points of failure.

    "Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.

    Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.

    What good is working 911 service, if nobody can call in, because all their phones are rendered useless by a failure of the Class5 switch that all the phones in the city are connected to?

    This kind of equipment typically has redundancy built in to survive the failure of any one processing unit or card, and telco facilities may be constructed with steel-reinforced concrete walls, and many protections against external events such as tornados.. but when you consider catastrophic disaster scenarios, where the problem originates inside, you are still faced with single points of failure

    It's not like the average household is willing to pay for two phone lines, each to a different exchange, and some kind of "automatic failure switching" mechanism to select the working telco exchange office.