City's IT Infrastructure Brought To Its Knees By Data Center Outage
An anonymous reader writes "On July 11th in Calgary, Canada, a fire and explosion was reported at the Shaw Communications headquarters. This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers, local radio stations, emergency 911 services, provincial services such Alberta Health Services computers, and Alberta Registries. One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well.' No doubt this has been a hard lesson on how NOT to host critical public services."
I use Telus. >:)
i don't really see the problem here. after all it's only canadians...
... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.
Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).
blindly antisocialist = antisocial
The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.
All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.
If I were God, wouldn't I protect my churches from acts of me?
Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (datacenter) again.
The reason the cloud was "invented" was to bring back the more profitable mainframe/dumb terminal business model.
There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.
In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thing that 9-11 showed is that even huge skyscrapers - though unlikely - can be knocked out by a crazy set of circumstances (or just crazy people).
However, what happens if you're running through gigs of data on a constant basis? If you can't get a fibre connection between two sites, you might not be able to have a live redundant backup.
Now what if you connect to multiple outside entities. They'll need to have redundant connections to both your sites. You'll want two ISP's, in case one kicks etc, etc etc.
How about power? Both sites will need a big generator or something of the like, plus battery-backup to hold things until the generator kicks in. Preferably they'd both be fairly far apart on the grid so if one site doesn't.
Weather... they'd both better be outside of any major weather considerations (forest fires, floods, quakes, whatever).
I won't make excuses for Shaw (no I don't work for them, in fact I'm affected negatively by the outage), but for many companies 100% HA/redundancy isn't really possible.
Luckily for those using services, I believe that this was a case of connection/infrastructure loss rather than all the data, so I hope that Shaw is working their a**es off to get things back.
> No doubt this has been a hard lesson on how NOT to host critical public services.
And no doubt the lesson was not learned.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.
What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
(apologies don't comment a lot and don't know how to properly link)
The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)
Transformers sometimes fail catastrophically and without warning. Other than keeping transformers outside, such things simply fall under "shit happens". Then once the fire department gets involved you turn off all the power: your backup generators, your UPS, everything.
this is my sig
911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.
Our primary internet connection was Bell and Shaw was our backup. To our surprise Bell's downtown network relies on Shaw's backbone and was ultimately affected by this monumental single point of failure.
To get back on the internet without having to fail-over to our DR site we came up with a crazy solution of hooking up a Rogers Rocket Hub. The damn thing worked without our ~85 employees and 3 remote users noticing a difference.
Over the next few weeks we will be canceling all of our Shaw services, signing up with Enmax for our primary, and bumping Bell down to our secondary.
Sounds like the datacenter heard about this "cloud" thing and decided to give it a try.
Everything is better with chainsaws.
Shaw has other buildings in the city. They should have used that for redundancy.
Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.
Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.
Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.
The worst thing about this is that someone designed the fire suppression systems for the DC and electricals with water. Not halon, co2 or foam.... Water. Thats just past common sense, it's pure negligence or incompetence.
It caused a stampede.
I eat only the real part of complex carbohydrates.
The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.
Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government. You spend what you have to spend to get the job done right, not more, not less. We're also not talking about a town with a population of 16 but a city with a population of 3,645,257 (in 2011). I am quite sure that they had the means to do this the right way and just chose not to.
blindly antisocialist = antisocial
..where everything must be privatized for profit.
It'd be interesting to see Shaw's quarterly profits against the cost of making life and death services geographically redundant.
Why are our telecommunications companies allowed to operate with minimal to no real competition as private entities?
Calgary population is about 1,200,000. Alberta has had a long succession of conservative governments. Spending money on health care infrastructure is not that high on their list of priorities. They're all about oil, pickup trucks, big hats and small government. This kind of thing is the natural result.
There's way too many assumptions going on with this story. There's more than one company in the building and not all issues reflect on Shaw as is mostly being reported. There's also an IBM datacentre located in the building and that's where the Alberta Health Services stuff resides. There's also a lot of shared infrastructure, but when water is everywhere due to the transformer explosion and they cut power to the entire building...well what does one expect.
As with anything issues like this need to be learned from and not turned into a blame circus.
-Just someone who's had previous experience in the building in question
As already mentioned, health services was not hosted by shaw
Yeah, if it was entirely government owned, it'd be rock solid and cheaper, too.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
The problem wasn't necessarily with Shaw. Shaw's problems were relatively minor. Internet and television services were affected over a small geographic area (downtown Calgary). Those affected by the Internet outage who also had Shaw Home Phone, couldn't use their phone as the network was down. If they called 911 on a cell phone or a land line, they would have received help.
The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing compared to (Estimate, based on almost nothing)5,000 servers going down. The Fire Dept was involved, so power is going to get cut whether they like it or not, but there were still other problems. There are reports that the backup generators didn't kick in (whether or not that would have avoided the outage, I don't know). I have received indication that if those backup generators had worked, service could have been restored slightly faster(I'm hearing this all 3rd party, so salt is needed).
IBM runs the datacenter where the servers live. Who screwed up? IBM? Shaw? Someone else? I don't know. We'll have to wait and see.
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
Oh yeah, I also heard that IBM will be incurring HUGE fines from SLAs. I think I heard some obscene number like 1M/minute.
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
in some buildings / data centers the fire system can kill most of the power
Which is a *good* thing. Fire and live electrical systems don't mix well.
There is a public building. Inside this building there is a level where no elevator can go, no stair can reach and it has no working network backup. This level is filled with flaws. These flaws lead to many catastrophes. Unpredictable catastrophes. But one flaw is special. One flaw leads to the source of all other problems.
This building is protected by a very secure system. Every alarm triggers da'bomb for public services. But like all systems it has a weakness, the system is based on the rules, regulations and the budget of the building. One system built on another. If one fails, so must the other.
'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.
'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact was far reaching, sure. Only 30,000 downtown core subscribers were affected (and service was restored to them quite quickly). In a geographic location with MILLIONS of customers, this is hardly a 'large swath' of customers though - a bit of a stretch.
'local radio stations'
THREE radio stations: one country (I think most ppl were pretty happy Country 105 was off the air), and two talk radio. One of which was just a studio, and affected only one particular show. So, really - TWO radio stations.
'emergency 911 services'
Completely FALSE and over-hyped by media. The Shaw VOIP customers couldn't access 911 - yes. But as long as they had access to a cell phone, or lived outside the affected area, 911 was up the whole time. Some EMS systems were affected, though. However they have a backup analogue radio system should the digital system go down - so, nothing catastrophic here.
'provincial services such Alberta Health Services computers, and Alberta Registries'
Only SOME AHS computers were affected, in the Calgary area only. The rest of the province had no email or VPN until this morning. Big deal, we can live without email...just phone or fax someone. Alberta Registries was hit pretty hard (as in, completely offline)- but are allowing a very gratuitous grace period if anyone needs to renew a license or some such. Not a big deal.
'One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well''
Sigh. Again, way off base. Yes, there were 'backup' systems - comprised of a UPS system that suffered water damage. However, no servers in the data centre suffered any water damage. They were worried about condensation, but that turned out to be a non-issue.
'No doubt this has been a hard lesson on how NOT to host critical public services' ...and that was the biggest mistake - a design flaw. The building design aside (13th floor = mechanical room. A transformer blew, triggering the sprinkler system. The backup generators engaged, but the battery room already suffered water damage and shorted out with the high load. When the fire marshal and building ppl arrived, they simply cut power to the building entirely, as water in the bus ducts - "wire trays" for non-construction types - was found), the REAL lesson here is what was already mentioned to AHS execs, Shaw managers and IBM Global - too many eggs in one basket. Instead of fire suppression via water; use halon. Instead of the 'backup systems' (what a JOKE) in the SAME BUILDING, configure clustered services across two or (better yet) more sites.
No, this is a lesson for the submitter, if he or she is really interested in clear reporting, to avoid the word 'not'. So, a cleaner sentence might be:
"No doubt, this is a hard lesson on how to host critical public services with clustering across sites, thereby avoiding a single point of failure."
But - we're just geeks, not execs...what do we know?
were you sitting in a Stampede beer tent at the time?
I used to be a City until l took an arrow to the knee.
'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.
'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact was far reaching, sure. Only 30,000 downtown core customers were affected. In a geographic location with MILLIONS of customers, this is hardly a 'large swath' of customers, however.
'local radio stations'
THREE radio stations - one country (I think most ppl were pretty happy Country 105 was off the air), and two talk radio. One of which was just a studio, and affected only one particular show. So, really - TWO radio stations.
'emergency 911 services'
Completely FALSE and over-hyped by media. The Shaw VOIP customers couldn't access 911 - yes. But as long as they had access to a cell phone, or lived outside the affected area, 911 was up the whole time. Some EMS systems were affected, though. However they have a backup analogue radio system should the digital system go down - so, nothing catastrophic here.
'provincial services such Alberta Health Services computers, and Alberta Registries'
Only SOME AHS computers were affected, in the Calgary area only. The rest of the province had no email or VPN until this morning. Big deal, we can live without email...just phone or fax someone. Alberta Registries was hit pretty hard (as in, completely offline)- but are allowing a very gratuitous grace period if anyone needs to renew a license or some such. Not a big deal.
'One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well''
Sigh. Again, way off base. Yes, there were 'backup' systems - comprised of a UPS system that suffered water damage. However, no servers in the data centre suffered any water damage. They were worried about condensation, but that turned out to be a non-issue.
'No doubt this has been a hard lesson on how NOT to host critical public services' ...and that was the biggest mistake - a design flaw. The building design aside (13th floor = mechanical room. A transformer blew, triggering the sprinkler system. The backup generators engaged, but the battery room already suffered water damage and shorted out with the high load. When the fire marshal and building ppl arrived, they simply cut power to the building entirely, as water in the bus ducts - "wire trays" for non-construction types - was found), the REAL lesson here is what was already mentioned to AHS execs, Shaw managers and IBM Global - too many eggs in one basket. Instead of fire suppression via water; use halon. Instead of the 'backup systems' (what a JOKE) in the SAME BUILDING, configure clustered services across two or (better yet) more sites.
No, this is a lesson for the OP, if he or she is really interested in clear reporting, to avoid the word 'not'. So, a cleaner sentence might be:
"No doubt, this is a hard lesson on how to host critical public services with clustering across sites, thereby avoiding a single point of failure."
But - we're just geeks, not execs...what do we know?
People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.
You may have a very good internet presence with lots of bandwidth, but it may be all housed in the same building where the same sprinkler system can bring it all down. You may think that ISPs can reroute lots of traffic to other places because it is possible. Yet, there are common failure modes there too.
Cloud computing is often hailed as a very resilient method for infrastructure. Yet, there is a disturbing tendency to focus all the servers in one big glass room of everything. You may get the dynamic pay per clock-cycle performance, but it may all come back to one substation. A single fire in that substation could bring everything down.
This is the problem with SLA deals: You don't know what kind of planning they may use for such infrastructure. Remember, the Internet itself may be resilient, but your cloud and your ISP may not be.
Nearly fifty percent of all graduates come from the bottom half of the class!
Never fear, IBM started flying tape backups to an alternate datacenter (datacentre?), probably in Ontario...
Didn't IBM also host stuff in one of the World Trade Center towers, and had the backups in the second tower?
"Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government."
Ahh, but here in the real world, IT IS. Why is it people think money is always no object? That all redundancy is free, and any lack of said redundancy means someone should be fired.
Go rule your little make believe world in some shitty flash based "Sim IT Manager" game or something, ok?
In some buildings they use nitrogen gas to extinguish fire instead of water. Obviously this requires immediate evacuation of all people/animals but this is fairly easy in a standalone building dedicated datacenter.
You don't need to be at the mercy of the fire department killing your power system in all cases.
Learn something please: http://en.wikipedia.org/wiki/List_of_the_100_largest_metropolitan_areas_in_Canada
People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.
The Cloud
My company relies on that data center to receive all of our EDI data from Union Pacific and Norfolk Southern. So that was all down for like 18 hours. It kinda sucked.
Yes Shaw had an issue and there was some local neighbourhood services related to Shaw went down, but more to blame for the large outages is IBM for housing redundant systems in the same building or the government customers for buying a less than adequate redundancy solution. Always ask for a physical diagram in addition to the logical diagram :)