City's IT Infrastructure Brought To Its Knees By Data Center Outage

First post! by Svartormr · 2012-07-13 09:36 · Score: 3, Informative

I use Telus. >:)

Re:First post! by clarkn0va · 2012-07-13 09:40 · Score: 5, Funny

So Shaw customers get all their disappointment in one fell swoop, while you suffer subclinical abuse on an ongoing basis. Congrats.

--
I am literally 3000 tokens away from the chaotic crossbow --Stephen
Re:First post! by Anonymous Coward · 2012-07-13 09:57 · Score: 1

Thanks. I nearly got a hernia I laughed so hard at that. Seriously. You could say Telus are a of bunch of cunts, but cunts are useful.
Re:First post! by Anonymous Coward · 2012-07-13 09:58 · Score: 1

The unofficial offsite backup (the trunk of a certain station wagon) shall henceforth use the Telus' office parking lot.
Re:First post! by MrNickname · 2012-07-13 09:59 · Score: 1

I am in Calgary and use Shaw as my ISP. I did not have any internet downtime. Perhaps the damage was restricted certain servers while parts of their network were not affected?
Re:First post! by Anonymous Coward · 2012-07-13 10:58 · Score: 0

Both the companies are pretty abusive. It's a pretty locked up market.
Re:First post! by snowraver1 · 2012-07-13 11:16 · Score: 1

Shaw court is an IBM datacenter. Many companies lost critical servers. We had a couple dozen there that are still coming back up. This didn't just affect shaw, but also big customers that pay big money for uptime.

--
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.
Re:First post! by Anonymous Coward · 2012-07-13 11:29 · Score: 0

Engineering department good
IT department bad
Re:First post! by Anonymous Coward · 2012-07-13 12:11 · Score: 0

Hear, hear! Telus sucks. It's true.
I just can't say it enough. TELUS SUCKS!!! God, that feels good. TELUS SUCKS DONKEY BALLS!!! Ahhhhhh...
Re:First post! by gen0c1de · 2012-07-13 15:10 · Score: 1

You don't pay big money to IBM for uptime, all IBM does now is resell other companies and take a share of the money. And become a middle man.
Re:First post! by dargon · 2012-07-13 17:34 · Score: 1

Shaw has 2 major locations in Calgary, the only people effected are those that use the downtown site
Re:First post! by BagOBones · 2012-07-13 18:53 · Score: 1

This is true, we just finished doing evaluations and IBMs quote included subbing out ALL the work to multiple sub vendors, the only part with IBMs name on it was the Quote.

--
EA David Gardner -"... but the consumers have proven that actually what they want is fun."
Re:First post! by Dr+Caleb · 2012-07-13 18:59 · Score: 1

Incorrect. I know many IBMers that have been restoring service to that datacentre for more than 50 hours, on 2 hours sleep. It sucks when you are doing it, but it's worth much geek cred in my book.

--
"History doesn't repeat itself, but it does rhyme." Mark Twain
Re:First post! by Anonymous Coward · 2012-07-13 19:59 · Score: 0

It was Shaw's fuckup, too many eggs in one basket. The redundant electrical substations should not have been in the same building with the same fire protection system.
Re:First post! by Anonymous Coward · 2012-07-16 16:20 · Score: 0

And downtown is not actually one of them...

is it a problem? by Anonymous Coward · 2012-07-13 09:39 · Score: 0, Troll

i don't really see the problem here. after all it's only canadians...

Or... by Transdimentia · 2012-07-13 09:40 · Score: 4, Insightful

... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.

Re:Or... by g0es · 2012-07-13 09:54 · Score: 1

Well it seems that they had the redundant systems in the same building. when designing redundant systems its best to avoid common mode failure when ever possible.
Re:Or... by theshowmecanuck · 2012-07-13 09:59 · Score: 2

The designers AND their managers and their managers should be made redundant.

--
-- I ignore anonymous replies to my comments and postings.
Re:Or... by Glendale2x · 2012-07-13 10:39 · Score: 1

Irrelevant. The fire department became involved and that typically means you shut it all down if they say so (or they'll do it for you), even the redundant stuff that's still running. The only way around that is separate physical buildings.

--
this is my sig
Re:Or... by Anonymous Coward · 2012-07-13 10:40 · Score: 0

The problem there is that as soon as you split your infrastructure across multiple data centers, you have to start worrying about latency and split-brain scenarios. In most cases, the possible failure modes for multiple data centers are both greater and more likely.
Re:Or... by sortius_nod · 2012-07-13 11:09 · Score: 1

Uhh, it is stupidity. Having your DR in the same site as your production servers is monumentally stupid. Most companies I've worked at have rules that state a minimum of 5km distance between production & DR sites in case of catastrophic failure. The Department of Defence here in Australia has 500km between their two production sites & DR. Our biggest service provider has 1000km between production & DR.
The only time I have seen the same building used is for redundancy & then the two comms/server rooms were blast proof bunkers (that was for a newspaper).
So yes, this is stupidity on a grand scale.
Re:Or... by Anonymous Coward · 2012-07-13 11:57 · Score: 0

Do you even know what the titanic is? Is your dod fortress built on the same overconfidence principle?
Re:Or... by cusco · 2012-07-13 12:45 · Score: 1

There's a local municipality whose IT department was very proud of its redundant fiber ring. Then a backhoe pointed out the fact that all of the fibers, prod and redundant, were all in the same conduit. Oops.

--
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
Re:Or... by Anonymous Coward · 2012-07-13 12:56 · Score: 1

Yeah, silly Australian Department of Defense with poor network redundancy planning -- if a meteor hits Australia and wipes it off the map there will be no backups available.
Re:Or... by flyingfsck · 2012-07-13 16:29 · Score: 1

Two buildings eh? Like those US companies who had their main and backup systems in the two World Trade Centre Towers in NY. It sure helped them a lot...

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!

No Site Level Resiliency? by sociocapitalist · 2012-07-13 09:40 · Score: 5, Insightful

Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).

--
blindly antisocialist = antisocial

Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-13 09:53 · Score: 2, Informative

The issue is IBM runs the Alberta Health Services and other infrastructure from the Shaw building of which IBM has their own datacenter in. IBM had no proper backups in place for these services.
911 being the most critical was also not affected, just Shaw VoIP users couldn't call 911 if their lines were down -- obviously (only ~20k people downtown were affected).
Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-13 09:53 · Score: 0

Smacked in the head? It should be a capital offense to be so stupid in setting up critical systems! Where's the .357 magnum?
Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-13 09:58 · Score: 0

I guess not everyone has as much money as you think they should have.
Re:No Site Level Resiliency? by jtnix · 2012-07-13 10:00 · Score: 1

Add 'tornado zone' to that list.
If you host all your cloud services at Rackspace in Texas and a tornado happens to rip apart their datacenter, well expect a few hours/days downtime. And you better have offsite backups of mission critical data or that's a long bet that is getting shorter every day.

--
She blinded me with science, she tricked me with technology. ~ Thomas Dolby
Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-13 10:07 · Score: 0

What remains a question is... Is everybody safe? Geez, there's an explosion and everyone is complaining because there was no internet.

What pisses me of from the future, is that people is starting to think that data backup is more valuable than human lives. Many Universities in the US are contracting disaster recovery sites. So if they are bombed, they can have all the information, who cares if there's any student left alive. I hope that brings stuff into perspective.
Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-13 10:18 · Score: 0

Dude, that costs money reducing profit and bonuses.
Re:No Site Level Resiliency? by sumdumass · 2012-07-13 10:37 · Score: 3, Insightful

This is why i do not understand the rush to cloud space. The same types of outages that apply to locally hosting the data apply to the cloud space providers. You still need the backup's, disaster plans with the ability to access the servers and such, much of the same stuff if not more then you would need if hosting it yourself. Is the clouds that much cheaper or something? Or is it more about marketing hype that talks PHBs and supervisors who want to sound cool into situations like this where diligence is not necessarily a priority?
Re:No Site Level Resiliency? by Sir_Sri · 2012-07-13 10:37 · Score: 2

Is everybody safe
That is, quite literally, someone else's problem. It sounds calloused to say, but seriously, /. isn't a site for first responders, it's for IT and CS types. It's not like we're looking at one thing at the expense of another here, your data (and 911 access) should work, people shouldn't die in a fire and your data shouldn't be hosed if it was housed there.
As to your point about universities. As tragic as it might be if someone died in a fire tomorrow at the university I graduated from 10 years ago, I still want them to be able to provide me transcripts and a copy of my degree if needed 10 years from now.
People around here are supposed to worry about preserving data, usually not at the expense of peoples lives (although there is a market for that in government secrets). Worrying about how to put out a fire and treat burn victims is someone else's job.
Re:No Site Level Resiliency? by Eponymous+Hero · 2012-07-13 10:47 · Score: 1

it's the warm body on the other end of the line who kisses your ass so well you can hardly yell at them for anything

--
insensitive clod overlords obligatory xkcd car analogy russian reversals whoosh pedant fanbois ftfy in 3...2...1..PROFIT
Re:No Site Level Resiliency? by Eponymous+Hero · 2012-07-13 10:52 · Score: 1

it does not remain a question, you just didn't RTFA. why does no one care about the safety of people in the explosion?

No one was hurt when a blast in a 13th-floor electrical room on Wednesday brought down Alberta Health Services computers, put three radio stations off the air and affected some banking services.
because no one was hurt, you fucking chicken little. if you gave a shit at all about "the people" you would have RTFA to find out. go ahead and read it. i hope it brings stuff into perspective for you.

--
insensitive clod overlords obligatory xkcd car analogy russian reversals whoosh pedant fanbois ftfy in 3...2...1..PROFIT
Re:No Site Level Resiliency? by JustOK · 2012-07-13 10:52 · Score: 1

it's boni

--
rewriting history since 2109
Re:No Site Level Resiliency? by foradoxium · 2012-07-13 11:06 · Score: 3, Interesting

Imagine if the library of Alexandria had backup copies of all those books, manuscripts and other treasures? How about Constantinople? I'm sure there were people that tried to protect that data who believed it was worth more then their life. I hope that brings stuff into perspective.
Re:No Site Level Resiliency? by englishstudent · 2012-07-13 11:11 · Score: 0

Totally agree.

--
We'll never make it.......oh! we made it! http://www.youtube.com/watch?v=SWf3iJjqYCM&list=FL7kKrE4eTs17mQl7eyvJIOg
Re:No Site Level Resiliency? by ahodgson · 2012-07-13 11:20 · Score: 1, Insightful

The cloud is not cheaper, unless you're doing things really wrong in the first place, like buying tier 1 servers or running windows.
It does provide economies of scale, can be somewhat cost-competitive with doing it yourself for at least some things, and you don't have to deal with hardware depreciation and the constant refresh cycle.
The big cloud providers also integrate a lot of services that would be a pain to build internally for small and mid-sized clients.
Hype explains the rest. PHBs are always looking for a silver bullet to make things "easy".
Oh, and developers and managers both think that moving to the cloud means they won't need sysadmins. Only to (eventually) find out that running stuff in the cloud needs sysadmins who not only know how to do everything themselves but can also then work around the cloud providers' idiosyncracies to still build things that work.
Re:No Site Level Resiliency? by Mike+Buddha · 2012-07-13 11:27 · Score: 1

It's IBM's fault that their customers didn't have a DR plan?

--
by Mike Buddha -- Someday the mountain might get him, but the law never will.
Re:No Site Level Resiliency? by 0100010001010011 · 2012-07-13 12:02 · Score: 1

Texas and a tornado happens to rip apart their datacenter
I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-13 13:05 · Score: 0

It's IBM's fault that their customers didn't have a DR plan?
Probably not--unless their customers were paying for it.
Re:No Site Level Resiliency? by drinkypoo · 2012-07-13 13:56 · Score: 1

I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
Mostly these days it's built in whatever they have available, because actually building anything costs too much. That's got to be responsible in large part for the rise of the shipping container as a data center... it's a temporary, soft-set structure. You only need a permit for the electrical connection, and maybe a pad.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:No Site Level Resiliency? by Ol+Olsoc · 2012-07-13 13:59 · Score: 1

I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
The cost is impressive.

--
The shepherds did so well protecting the flock that the sheep no longer believed that wolves existed.
Re:No Site Level Resiliency? by sociocapitalist · 2012-07-13 20:29 · Score: 1

Arguably you could still use two different cloud providers after verifying (and continuing to verify over time) that the infrastructure (and connectivity to it) is actually redundant.

--
blindly antisocialist = antisocial
Re:No Site Level Resiliency? by Anonymous Coward · 2012-07-16 01:47 · Score: 0

The designer isn't necessarily at fault. On multiple occasions in my career I've run into management and/or owners who don't think the risk is justified by the cost for site DR.

Maybe the city/provinces should skip on redundancy by Anonymous Coward · 2012-07-13 09:49 · Score: 3, Interesting

The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

Shaw is an ISP by Capt.DrumkenBum · 2012-07-13 09:53 · Score: 1

All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.

--
If I were God, wouldn't I protect my churches from acts of me?

Re:Shaw is an ISP by tlhIngan · 2012-07-13 09:57 · Score: 2

All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.
Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).
And it really isn't a screwup - they were doing something important - testing the backup generators. It's just the generators blew up, which took out the other backup generators (dual redundant power!), and knocked out power to all the equipment by knocking the utility power offline as well.
Re:Shaw is an ISP by Anonymous Coward · 2012-07-13 10:02 · Score: 0

It was a Shaw owned datacenter, major services were hosted there through IBM. For the most part Shaw internet connections didn't have an issue.
Re:Shaw is an ISP by mysidia · 2012-07-13 14:31 · Score: 2

Redundant internet connections are no guarantee of no single points of failure.
"Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.
Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.
What good is working 911 service, if nobody can call in, because all their phones are rendered useless by a failure of the Class5 switch that all the phones in the city are connected to?
This kind of equipment typically has redundancy built in to survive the failure of any one processing unit or card, and telco facilities may be constructed with steel-reinforced concrete walls, and many protections against external events such as tornados.. but when you consider catastrophic disaster scenarios, where the problem originates inside, you are still faced with single points of failure
It's not like the average household is willing to pay for two phone lines, each to a different exchange, and some kind of "automatic failure switching" mechanism to select the working telco exchange office.
Re:Shaw is an ISP by Anonymous Coward · 2012-07-13 18:36 · Score: 0

The lesson is to never test anything. If it fails because you didn't test it, it's an unforseen accident. If it fails because you did test it, it's your fault for testing it instead of trusting it to work.

Fukushima Daiichi - Anyone? by Paleolibertarian · 2012-07-13 09:56 · Score: 2

Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (datacenter) again.

The reason the cloud was "invented" was to bring back the more profitable mainframe/dumb terminal business model.

Re:Fukushima Daiichi - Anyone? by Anonymous Coward · 2012-07-14 01:49 · Score: 0

The problem at the nuke plant was not the lack of multiple backup generators. The problem was that the tsunami wave flooded them and rendered them inoperable. Multiple flooded generators wouldn't have done diddly to help. The problem was the plant designers were ultimately driven by bean counters to not build a wall high enough to shield the generators from such a wave based on the perceived unlikeliness of such a high tsunami wave.
Re:Fukushima Daiichi - Anyone? by Paleolibertarian · 2012-07-14 12:52 · Score: 1

Actually the backup generators were located in the basement as per GE's original design. Engineers requested to locate the generators in a more secure (from tsunamis) location when the plants were built but were overruled by upper management.
However the location of the generators at Fukushima is irrelevant to my point that had the power grid been a distributed system with local batteries or what have you then all residents would not have lost power when the plant was flooded. This is not an argument about the design of the centralized power plant but an argument that a centralized power plant need not be used or even exist.

Limitations by phorm · 2012-07-13 09:56 · Score: 2

There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.

In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thing that 9-11 showed is that even huge skyscrapers - though unlikely - can be knocked out by a crazy set of circumstances (or just crazy people).

However, what happens if you're running through gigs of data on a constant basis? If you can't get a fibre connection between two sites, you might not be able to have a live redundant backup.

Now what if you connect to multiple outside entities. They'll need to have redundant connections to both your sites. You'll want two ISP's, in case one kicks etc, etc etc.

How about power? Both sites will need a big generator or something of the like, plus battery-backup to hold things until the generator kicks in. Preferably they'd both be fairly far apart on the grid so if one site doesn't.

Weather... they'd both better be outside of any major weather considerations (forest fires, floods, quakes, whatever).

I won't make excuses for Shaw (no I don't work for them, in fact I'm affected negatively by the outage), but for many companies 100% HA/redundancy isn't really possible.

Luckily for those using services, I believe that this was a case of connection/infrastructure loss rather than all the data, so I hope that Shaw is working their a**es off to get things back.

Re:Limitations by theshowmecanuck · 2012-07-13 10:42 · Score: 1, Interesting

It's kind of funny for me to hear someone call Shaw one of the "big boys" after working on a number of Telecom projects in the USA. In the scheme of things Shaw would only rate being maybe a tier 3 player. Their maximum customer base is maybe 8 million potential people .... not households (and I'm presupposing they are in Saskatchewan and Manitoba now otherwise subtract a couple million). And they compete with Telus and Manitoba Tel and Sasktel (or whoever it is there). That's definitely tier 3 or smaller. Heck, Bell Canada and Rogers are considered tier 2 and their market is probably 15 million or more. FYI Sprint USA has 50 million plus accounts, AT&T has at least 300 million accounts.
I told a Telus (Shaw's major and bigger competitor) senior manager in a conversation one time that it was ridiculous that they made employees pay for their coffee and had the gall to close the cafeteria at 2:30. He said "do you know how many employees we have?! 20,000!" I said I had worked on campuses of companies that had more people than that... and they bought the coffee... I guess you don't think much of your employees. He didn't believe there were business campuses that big. People in Canada don't get that we're not that big in the scheme of things world wide and like it or not it's why we have to work with out neighbor and stop insulting them with our Napoleon syndrome antics. Land area doesn't make up for relatively small population.

--
-- I ignore anonymous replies to my comments and postings.

Captain Obvious by roc97007 · 2012-07-13 10:01 · Score: 2

> No doubt this has been a hard lesson on how NOT to host critical public services.

And no doubt the lesson was not learned.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.

Not surprising by Anonymous Coward · 2012-07-13 10:04 · Score: 3, Informative

There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.

What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
(apologies don't comment a lot and don't know how to properly link)

The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)

Transformer fire? by Glendale2x · 2012-07-13 10:04 · Score: 1

Transformers sometimes fail catastrophically and without warning. Other than keeping transformers outside, such things simply fall under "shit happens". Then once the fire department gets involved you turn off all the power: your backup generators, your UPS, everything.

--
this is my sig

Re:Transformer fire? by Anonymous Coward · 2012-07-13 10:07 · Score: 0

I find keeping Transformers outside is bad for the chrome parts. Besides, do you really want Megatron where any kid could find him?

911 not down by CaptainPuff · 2012-07-13 10:11 · Score: 2

911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.

Re:911 not down by Anonymous Coward · 2012-07-13 11:37 · Score: 0

only some downtown customers
Re:911 not down by Svartormr · 2012-07-13 14:43 · Score: 1

Well, I was standing in a hospital emergency several hours after the initial service loss, watching the staff fall back on paper systems. And many commonly used services, like finding out what medications a patient was on by checking a shared database used by pharmacists, were unavailable. No single event like this outage should have degraded all this services to uselessness.

Not just Shaw's network was affected by Anonymous Coward · 2012-07-13 10:13 · Score: 1

Our primary internet connection was Bell and Shaw was our backup. To our surprise Bell's downtown network relies on Shaw's backbone and was ultimately affected by this monumental single point of failure.

To get back on the internet without having to fail-over to our DR site we came up with a crazy solution of hooking up a Rogers Rocket Hub. The damn thing worked without our ~85 employees and 3 remote users noticing a difference.

Over the next few weeks we will be canceling all of our Shaw services, signing up with Enmax for our primary, and bumping Bell down to our secondary.

Re:Not just Shaw's network was affected by gen0c1de · 2012-07-13 15:23 · Score: 1

Have fun with Enmax, their stability record is pretty awesome. Try getting any kind of service on a weekend, their noc number on the weekends goes to a pager and they will call you back within the hour. I deal with them so often it isn't funny. And watch out, they contract out the last mile in many of their build out. Their like using telus and shaw and the best, they won't tell you.

Poof! by Antipater · 2012-07-13 10:15 · Score: 1

Sounds like the datacenter heard about this "cloud" thing and decided to give it a try.

--
Everything is better with chainsaws.

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 10:15 · Score: 0

Shaw has other buildings in the city. They should have used that for redundancy.

What really happened... by Anonymous Coward · 2012-07-13 10:18 · Score: 4, Interesting

Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.

Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.

Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.

Re:What really happened... by Anonymous Coward · 2012-07-13 10:40 · Score: 0

+1 Truth, from another AC who was there.
And SuperNet wasn't private-for-profit, so it must have have been bad. This is Alberta after all.

Water + equipment = magic smoke escaping by Anonymous Coward · 2012-07-13 10:22 · Score: 0

The worst thing about this is that someone designed the fire suppression systems for the DC and electricals with water. Not halon, co2 or foam.... Water. Thats just past common sense, it's pure negligence or incompetence.

Re:Water + equipment = magic smoke escaping by corychristison · 2012-07-13 10:59 · Score: 1

Halon has been banned for quite some time now. The replacement, Halotron, was just recently restricted.
CO2 would make the most sense.
The problem from what I understand was the generator room. A CO2 system would be ideal in this situation, you're dealing with lubricants as well and foam just makes an awful mess.
What I dont understand is how the sprinkler system was involved at all. When that valve bursts, it only flows at th affected area. Its not like the movies where if one pops the whole system goes off.
I don't know the building in question but I worked in the fire protection industry a short while.
Re:Water + equipment = magic smoke escaping by Anonymous Coward · 2012-07-21 14:13 · Score: 0

Halon replacement is FM200, which IIRC unfortunately produces toxic, corrosive compounds in the presence of flame or red-hot metal. Duh. Consequently they're supposed to be installed along with exhaust ventilation for post-event use. Even so, bug out if your system triggers.
Better choice is Inergen: fully breatheable, inert, etc. Downside is cost and space requirements.
I believe data centers which use water actually use a mist (not spray) for sensitive equipment. Cheap, little space required, and you don't wreck everything during an event. I guess the generators got the full sprinkler treatment :-(.

It was so bad.. by Megahard · 2012-07-13 10:23 · Score: 5, Funny

It caused a stampede.

--
I eat only the real part of complex carbohydrates.

Re:Maybe the city/provinces should skip on redunda by sociocapitalist · 2012-07-13 10:33 · Score: 1

The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government. You spend what you have to spend to get the job done right, not more, not less. We're also not talking about a town with a population of 16 but a city with a population of 3,645,257 (in 2011). I am quite sure that they had the means to do this the right way and just chose not to.

--
blindly antisocialist = antisocial

Re:Such lessons are never learn in Alberta.. by Anonymous Coward · 2012-07-13 10:35 · Score: 0

..where everything must be privatized for profit.

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 11:01 · Score: 0

It'd be interesting to see Shaw's quarterly profits against the cost of making life and death services geographically redundant.
Why are our telecommunications companies allowed to operate with minimal to no real competition as private entities?

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 11:08 · Score: 0

Calgary population is about 1,200,000. Alberta has had a long succession of conservative governments. Spending money on health care infrastructure is not that high on their list of priorities. They're all about oil, pickup trucks, big hats and small government. This kind of thing is the natural result.

don't jump to conclusions by Anonymous Coward · 2012-07-13 11:16 · Score: 0

There's way too many assumptions going on with this story. There's more than one company in the building and not all issues reflect on Shaw as is mostly being reported. There's also an IBM datacentre located in the building and that's where the Alberta Health Services stuff resides. There's also a lot of shared infrastructure, but when water is everywhere due to the transformer explosion and they cut power to the entire building...well what does one expect.

As with anything issues like this need to be learned from and not turned into a blame circus.

-Just someone who's had previous experience in the building in question

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 11:33 · Score: 0

As already mentioned, health services was not hosted by shaw

Re:Such lessons are never learn in Alberta.. by roc97007 · 2012-07-13 11:39 · Score: 1

Yeah, if it was entirely government owned, it'd be rock solid and cheaper, too.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.

Re:Maybe the city/provinces should skip on redunda by snowraver1 · 2012-07-13 11:44 · Score: 2

The problem wasn't necessarily with Shaw. Shaw's problems were relatively minor. Internet and television services were affected over a small geographic area (downtown Calgary). Those affected by the Internet outage who also had Shaw Home Phone, couldn't use their phone as the network was down. If they called 911 on a cell phone or a land line, they would have received help.

The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing compared to (Estimate, based on almost nothing)5,000 servers going down. The Fire Dept was involved, so power is going to get cut whether they like it or not, but there were still other problems. There are reports that the backup generators didn't kick in (whether or not that would have avoided the outage, I don't know). I have received indication that if those backup generators had worked, service could have been restored slightly faster(I'm hearing this all 3rd party, so salt is needed).

IBM runs the datacenter where the servers live. Who screwed up? IBM? Shaw? Someone else? I don't know. We'll have to wait and see.

--
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.

Re:Maybe the city/provinces should skip on redunda by snowraver1 · 2012-07-13 11:52 · Score: 2

Oh yeah, I also heard that IBM will be incurring HUGE fines from SLAs. I think I heard some obscene number like 1M/minute.

--
Copyright 2010. All rights reserved. This comment may not be copied in any way including, but not limited to caching.

in some buildings / data centers the fire system by Joe_Dragon · 2012-07-13 11:57 · Score: 1

in some buildings / data centers the fire system can kill most of the power

Re:in some buildings / data centers the fire syste by Chris+Mattern · 2012-07-13 12:11 · Score: 1

Which is a *good* thing. Fire and live electrical systems don't mix well.

Systems on systems by Anonymous Coward · 2012-07-13 12:14 · Score: 0

There is a public building. Inside this building there is a level where no elevator can go, no stair can reach and it has no working network backup. This level is filled with flaws. These flaws lead to many catastrophes. Unpredictable catastrophes. But one flaw is special. One flaw leads to the source of all other problems.
This building is protected by a very secure system. Every alarm triggers da'bomb for public services. But like all systems it has a weakness, the system is based on the rules, regulations and the budget of the building. One system built on another. If one fails, so must the other.

OK, you read the headlines, now some FACTS by Anonymous Coward · 2012-07-13 13:14 · Score: 1

'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.

'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact was far reaching, sure. Only 30,000 downtown core subscribers were affected (and service was restored to them quite quickly). In a geographic location with MILLIONS of customers, this is hardly a 'large swath' of customers though - a bit of a stretch.

'local radio stations'
THREE radio stations: one country (I think most ppl were pretty happy Country 105 was off the air), and two talk radio. One of which was just a studio, and affected only one particular show. So, really - TWO radio stations.

'emergency 911 services'
Completely FALSE and over-hyped by media. The Shaw VOIP customers couldn't access 911 - yes. But as long as they had access to a cell phone, or lived outside the affected area, 911 was up the whole time. Some EMS systems were affected, though. However they have a backup analogue radio system should the digital system go down - so, nothing catastrophic here.

'provincial services such Alberta Health Services computers, and Alberta Registries'
Only SOME AHS computers were affected, in the Calgary area only. The rest of the province had no email or VPN until this morning. Big deal, we can live without email...just phone or fax someone. Alberta Registries was hit pretty hard (as in, completely offline)- but are allowing a very gratuitous grace period if anyone needs to renew a license or some such. Not a big deal.

'One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well''
Sigh. Again, way off base. Yes, there were 'backup' systems - comprised of a UPS system that suffered water damage. However, no servers in the data centre suffered any water damage. They were worried about condensation, but that turned out to be a non-issue.

'No doubt this has been a hard lesson on how NOT to host critical public services'
No, this is a lesson for the submitter, if he or she is really interested in clear reporting, to avoid the word 'not'. So, a cleaner sentence might be:
"No doubt, this is a hard lesson on how to host critical public services with clustering across sites, thereby avoiding a single point of failure." ...and that was the biggest mistake - a design flaw. The building design aside (13th floor = mechanical room. A transformer blew, triggering the sprinkler system. The backup generators engaged, but the battery room already suffered water damage and shorted out with the high load. When the fire marshal and building ppl arrived, they simply cut power to the building entirely, as water in the bus ducts - "wire trays" for non-construction types - was found), the REAL lesson here is what was already mentioned to AHS execs, Shaw managers and IBM Global - too many eggs in one basket. Instead of fire suppression via water; use halon. Instead of the 'backup systems' (what a JOKE) in the SAME BUILDING, configure clustered services across two or (better yet) more sites.

But - we're just geeks, not execs...what do we know?

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 13:16 · Score: 0

were you sitting in a Stampede beer tent at the time?

It has to be said by Phibz · 2012-07-13 13:18 · Score: 0

I used to be a City until l took an arrow to the knee.

the FACTS by Anonymous Coward · 2012-07-13 13:24 · Score: 0

'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.

'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact was far reaching, sure. Only 30,000 downtown core customers were affected. In a geographic location with MILLIONS of customers, this is hardly a 'large swath' of customers, however.

'local radio stations'
THREE radio stations - one country (I think most ppl were pretty happy Country 105 was off the air), and two talk radio. One of which was just a studio, and affected only one particular show. So, really - TWO radio stations.

'emergency 911 services'
Completely FALSE and over-hyped by media. The Shaw VOIP customers couldn't access 911 - yes. But as long as they had access to a cell phone, or lived outside the affected area, 911 was up the whole time. Some EMS systems were affected, though. However they have a backup analogue radio system should the digital system go down - so, nothing catastrophic here.

'provincial services such Alberta Health Services computers, and Alberta Registries'
Only SOME AHS computers were affected, in the Calgary area only. The rest of the province had no email or VPN until this morning. Big deal, we can live without email...just phone or fax someone. Alberta Registries was hit pretty hard (as in, completely offline)- but are allowing a very gratuitous grace period if anyone needs to renew a license or some such. Not a big deal.

'One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well''
Sigh. Again, way off base. Yes, there were 'backup' systems - comprised of a UPS system that suffered water damage. However, no servers in the data centre suffered any water damage. They were worried about condensation, but that turned out to be a non-issue.

'No doubt this has been a hard lesson on how NOT to host critical public services'
No, this is a lesson for the OP, if he or she is really interested in clear reporting, to avoid the word 'not'. So, a cleaner sentence might be:
"No doubt, this is a hard lesson on how to host critical public services with clustering across sites, thereby avoiding a single point of failure." ...and that was the biggest mistake - a design flaw. The building design aside (13th floor = mechanical room. A transformer blew, triggering the sprinkler system. The backup generators engaged, but the battery room already suffered water damage and shorted out with the high load. When the fire marshal and building ppl arrived, they simply cut power to the building entirely, as water in the bus ducts - "wire trays" for non-construction types - was found), the REAL lesson here is what was already mentioned to AHS execs, Shaw managers and IBM Global - too many eggs in one basket. Instead of fire suppression via water; use halon. Instead of the 'backup systems' (what a JOKE) in the SAME BUILDING, configure clustered services across two or (better yet) more sites.

But - we're just geeks, not execs...what do we know?

Single Points of Failure by AB3A · 2012-07-13 13:35 · Score: 1

People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.

You may have a very good internet presence with lots of bandwidth, but it may be all housed in the same building where the same sprinkler system can bring it all down. You may think that ISPs can reroute lots of traffic to other places because it is possible. Yet, there are common failure modes there too.

Cloud computing is often hailed as a very resilient method for infrastructure. Yet, there is a disturbing tendency to focus all the servers in one big glass room of everything. You may get the dynamic pay per clock-cycle performance, but it may all come back to one substation. A single fire in that substation could bring everything down.

This is the problem with SLA deals: You don't know what kind of planning they may use for such infrastructure. Remember, the Internet itself may be resilient, but your cloud and your ISP may not be.

--
Nearly fifty percent of all graduates come from the bottom half of the class!

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 14:19 · Score: 0

Never fear, IBM started flying tape backups to an alternate datacenter (datacentre?), probably in Ontario...

IBM by Anonymous Coward · 2012-07-13 16:57 · Score: 0

Didn't IBM also host stuff in one of the World Trade Center towers, and had the backups in the second tower?

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 18:58 · Score: 0

"Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government."

Ahh, but here in the real world, IT IS. Why is it people think money is always no object? That all redundancy is free, and any lack of said redundancy means someone should be fired.

Go rule your little make believe world in some shitty flash based "Sim IT Manager" game or something, ok?

Re:in some buildings / data centers the fire syste by Anonymous Coward · 2012-07-13 20:01 · Score: 0

In some buildings they use nitrogen gas to extinguish fire instead of water. Obviously this requires immediate evacuation of all people/animals but this is fairly easy in a standalone building dedicated datacenter.

You don't need to be at the mercy of the fire department killing your power system in all cases.

Re:Maybe the city/provinces should skip on redunda by Anonymous Coward · 2012-07-13 20:17 · Score: 0

Learn something please: http://en.wikipedia.org/wiki/List_of_the_100_largest_metropolitan_areas_in_Canada

Oblig xkcd by Anonymous Coward · 2012-07-13 21:02 · Score: 0

People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.

The Cloud

this fire by Anonymous Coward · 2012-07-14 03:38 · Score: 0

My company relies on that data center to receive all of our EDI data from Union Pacific and Norfolk Southern. So that was all down for like 18 hours. It kinda sucked.

IBM and their customers fault by phizman · 2012-07-16 16:29 · Score: 1

Yes Shaw had an issue and there was some local neighbourhood services related to Shaw went down, but more to blame for the large outages is IBM for housing redundant systems in the same building or the government customers for buying a less than adequate redundancy solution. Always ask for a physical diagram in addition to the logical diagram :)

Slashdot Mirror

City's IT Infrastructure Brought To Its Knees By Data Center Outage

102 comments