City's IT Infrastructure Brought To Its Knees By Data Center Outage
An anonymous reader writes "On July 11th in Calgary, Canada, a fire and explosion was reported at the Shaw Communications headquarters. This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers, local radio stations, emergency 911 services, provincial services such Alberta Health Services computers, and Alberta Registries. One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well.' No doubt this has been a hard lesson on how NOT to host critical public services."
I use Telus. >:)
... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.
Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).
blindly antisocialist = antisocial
The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.
Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (datacenter) again.
The reason the cloud was "invented" was to bring back the more profitable mainframe/dumb terminal business model.
There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.
In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thing that 9-11 showed is that even huge skyscrapers - though unlikely - can be knocked out by a crazy set of circumstances (or just crazy people).
However, what happens if you're running through gigs of data on a constant basis? If you can't get a fibre connection between two sites, you might not be able to have a live redundant backup.
Now what if you connect to multiple outside entities. They'll need to have redundant connections to both your sites. You'll want two ISP's, in case one kicks etc, etc etc.
How about power? Both sites will need a big generator or something of the like, plus battery-backup to hold things until the generator kicks in. Preferably they'd both be fairly far apart on the grid so if one site doesn't.
Weather... they'd both better be outside of any major weather considerations (forest fires, floods, quakes, whatever).
I won't make excuses for Shaw (no I don't work for them, in fact I'm affected negatively by the outage), but for many companies 100% HA/redundancy isn't really possible.
Luckily for those using services, I believe that this was a case of connection/infrastructure loss rather than all the data, so I hope that Shaw is working their a**es off to get things back.
Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).
And it really isn't a screwup - they were doing something important - testing the backup generators. It's just the generators blew up, which took out the other backup generators (dual redundant power!), and knocked out power to all the equipment by knocking the utility power offline as well.
> No doubt this has been a hard lesson on how NOT to host critical public services.
And no doubt the lesson was not learned.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.
What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
(apologies don't comment a lot and don't know how to properly link)
The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)
911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.
Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.
Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.
Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.
It caused a stampede.
I eat only the real part of complex carbohydrates.
The problem wasn't necessarily with Shaw. Shaw's problems were relatively minor. Internet and television services were affected over a small geographic area (downtown Calgary). Those affected by the Internet outage who also had Shaw Home Phone, couldn't use their phone as the network was down. If they called 911 on a cell phone or a land line, they would have received help.
The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing compared to (Estimate, based on almost nothing)5,000 servers going down. The Fire Dept was involved, so power is going to get cut whether they like it or not, but there were still other problems. There are reports that the backup generators didn't kick in (whether or not that would have avoided the outage, I don't know). I have received indication that if those backup generators had worked, service could have been restored slightly faster(I'm hearing this all 3rd party, so salt is needed).
IBM runs the datacenter where the servers live. Who screwed up? IBM? Shaw? Someone else? I don't know. We'll have to wait and see.
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
Oh yeah, I also heard that IBM will be incurring HUGE fines from SLAs. I think I heard some obscene number like 1M/minute.
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
Redundant internet connections are no guarantee of no single points of failure.
"Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.
Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.
What good is working 911 service, if nobody can call in, because all their phones are rendered useless by a failure of the Class5 switch that all the phones in the city are connected to?
This kind of equipment typically has redundancy built in to survive the failure of any one processing unit or card, and telco facilities may be constructed with steel-reinforced concrete walls, and many protections against external events such as tornados.. but when you consider catastrophic disaster scenarios, where the problem originates inside, you are still faced with single points of failure
It's not like the average household is willing to pay for two phone lines, each to a different exchange, and some kind of "automatic failure switching" mechanism to select the working telco exchange office.