Power Problems Force Seattle To Throttle City Data Center For Days
Nerval's Lobster writes with an except from sister site SlashDataCenter: "On Aug. 23, Mayor Mike McGinn of Seattle informed residents that the city would partially shut down its municipal data center for five days including the Labor Day weekend. As a result, city residents will be unable to pay bills, apply for business licenses, or take advantage of other online services. In a Webcast press conference, McGinn isolated the issue as a failure in one of the electrical 'buses' that supplies power to the data center. Because that piece of equipment began overheating, the city had to begin taking servers and applications offline to prevent overloading the system. The maintenance will cost the city $2.1 million of its maintenance budget. A second power bus will remain operational, supplying enough electricity to power redundant systems for critical life and fire safety systems, including 911 services and fire dispatch. The city's Web sites should also be up and running in some capacity."
That should help the situation.
While looking at the prospect of moving to Seattle, I've read repeatedly read that the city is in political gridlock and seems totally unable to get any meaningful long term infrastructure additions put in place despite wide support for them. It seems to me this is the case in many cities out there, but can anyone say what it's like in their city?
I suspect that the first ones to finally get something useful done (rather than just repaving a few highways) will be the ones to reap a lot of growth when this recession finally begins to fade away.
-
Interesting that this is not on the front page of the Seattle Times. In fact, I can't find it at Washington's biggest paper at all.
If you want news from today, you have to come back tomorrow.
Problem is that it contained atleast bill paying and other personal information, and the government has very strict laws about how stuff like that is to be stored and processed. Even if they can get over the red tape just for using a private vendor, there are laws saying that they have to use vendors that meet certain special interest criteria, and then automatically pick the lowest price because of budget laws. In the end they get a data center that is down on weekends, holidays, and all other cruft for a freaking fortune. I remember when one of my relatives told me about a similar tale; they were forced by all sorts of stupid laws to buy office chairs (padded, but still pretty much junk) for a government office for hundreds or thousands of dollars a piece, chairs that would cost at very most $100 at the local officemax.
Feel free to bid out the project, and then see if it's worthwhile. Check out the costs of something being under another entity control.
In this case, they COULD pay the extra to have this fixed while running, but for them, I'm guessing the temporary shutdown over a 3-day weekend is the more cost-effective option. With two days on either side, it's hardly a gross inconvenience. The city's key operations will go on.
Or just assume that contracting out is the better way, and sprinkle on a little magic fairy dust from the Cloud.
If you lived in podunk nowhere then no probably not, if emergency services continue to operate it wouldn't be a big issue. But for such a large municipality to go dark for 5 days...would definitely be impactful locally and possibly regionally/nationally to a smaller degree. Emergency services are very important but the business of government (no matter how i feel about it from time to time) needs to continue and serve it's people...I am sure (at least i hope) that they looked into portable power generation, but it seems that this is a poor solution. just my 2 pennies.
Chief Thinker www.devotedskeptic.com
If bills don't get paid, there better not be any late fees imposed. The banks could make millions on this.
“He’s not deformed, he’s just drunk!”
iCarly will be pissed.
Sounds like Seattle's 911 system is quite fragile.
"If any question why we died, Tell them because our fathers lied."
If power problems are downing the city's datacenter for a holiday weekend, couldn't they just rent a few $100/mo servers and run the city apps on them for the downtime and make the problems transparent to the end user? No one-place site is ever safe for important apps, we call that a Single Point of Failure around here.
I'm LostCluster but I lost my password to that user. Hey Slashdot, how about helping me get it back!
Seattle? The home of Amazon? Why on earth don't they just move their datacenter to Amazon Web Services? They could probably do it for less than the $2.1 million they're spending on this single part!
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
What I'm trying to figure out is why 911 and emergency services didn't have a separate offsite backup. I mean, how much more mission critical can you get than that? Everytime I see one of these articles I think to myself: Why are they mentioning this if there wasn't some risk of failure? And the answer is... because quite obviously, there was some risk.
I don't want my cause of death to be "Your call could not be completed as dialed. Please check the number and try your call again later..."
#fuckbeta #iamslashdot #dicemustdie
Nice, so they're running their mission critical operations on reserve systems. Hope nothing too important happens while they're getting bombed by a /. post.
I've had a similar issue with a private data center. There wasn't a UPD bypass switch because the UPS had an internal bypass switch (installed with the datacenter years before. But the UPS was old, and a new UPS was cheaper than replacing all the batteries (and more powerful with better features). So my coworker planned out the switch, 2 days outage over a weekend. Of course, since I took most of the classes to be an EE, I re-drew the plans and got the project done with half the labor time and two 30-second outages (well, both were about a second, but longer than the time a server could live without power, so it was safer to turn everything off as if it were a longer outage). The problem was caused by a stupid "cost saving" choice on installation.
Sounds like something similar here, where there's an issue with part of the redundancy, but it's not actually capable of running fully redundantly. Otherwise, cut everything over, then fix it. Or just turn it off and fix it (and the power will flow). I've seen it more than once in corporate world, so it's not an example of governmental oops, just IT oops.
Learn to love Alaska
The cloud doesn't need power?
Learn to love Alaska
Why don't they just fail over the critical life and fire safety systems to the backup datacenter, and keep normal services up at the primary datacenter while they do the work? They do have a second site, right? Surely no one would host a system deemed "critical" and "life safety" at a single site?
overheating power buses / wires are a fire risk and that comes from them being under sized for the load.
See the towering inferno to see where that can get you.
I had an almost identical situation happen to me this past spring, too. I was the sysadmin at one of the facilities. It happened right after I gave my two weeks, and damn was I busy. :P I ended up having to take all my UPSes off the mains and run them over some two phase at one point to get additional power onto a secondary genset, because the amp load simply was too high (oops, poor planning - someone forgot to figure high load overhead amperage requirements).
Unlike this situation, my situation only had a single power run due to the topographical location of where we were: on top of a hill/small mountain, on the edge of a park. There were 5 fairly sizeable facilities on the hill, some of which have some fairly significant power requirements due to the type of work they perform (lots of sciencey stuff).
Fortunately, all of the buildings had (100 KW+) gensets. Unfortunately, only one of the 5 was NG, and the others were diesel. This gets really costly, really quickly, since it's California, diesel's at something like $4.50/gallon, and the things will burn through a full 500 gallon tank in a day at around 60% utility. So we're talking ~$10k a day just to keep these things fueled (including an extra pulled up due to additional crunch demand).
Plant faculty - probably a good 30-60 people in all - were in the conduit going up the hill for a day trying to figure out where the fault was, and then another three days getting new cable run and relay substation. (God, I hate how slow many union workers work.) Turns out the relay fused up pretty solidly, welding itself nicely into the culvert.
I seem to recall talk back and forth that the total damage was going to be over $500,000, so it really doesn't surprise me that a large city's power infrastructure would cost a multiple of that. If cities are like some of the hospitals I've seen, they've got lecherous IT sales people at their door on an almost-daily basis. They also buy a lot of the crap the sales people are peddling, many of which seem to (still) require being run on their own propriety platform and/or a dedicated piece of hardware. And then, the old systems don't really go away until they die, and there's a cost incurred to recover the lost data - because they're non-profit, they don't really seem to understand cost of maintenance, depreciation, or anything like that. So, I can certainly see the power requirements for some poorly designed cluster for public facing things, a handful or three of interface systems to tie in with the governmenty systems, and so on.
In my mind, it makes sense that they just shut those services down temporarily. "Forced vacation use" for city workers, maybe? They'll save a lot more than 2.5 million that way, if they can do it, I'm sure (funny how government is able to cut costs when there's no alternative :P). I imagine it's too much of a cost and/or risk to try to move essential services (fire/PD/911) to the hot site, and really no reason to do so, especially when they've not yet tested their DR plan.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
The cloud is somewhere else with their own power problems.
now we need to go OSS in diesel cars
The datacenter is on the 26th floor of the municipal tower and the overheating bus runs up to that floor. The power company in question is municipally owned, either way it would be the city's problem.
McGinn had quite a few facts wrong in the press conference. The equipment is working fine now and the overheating only caused a minor amount of downtime. The major issue though was the backup generator never kicked in because as it turns out, the electric starter for the diesel generator is connected to the same bus. Labor Day weekend was then choosen to fix this majorly obvious design deficiency.
when was the last time a well-run private datacenter was offline for five days barring a natural disaster?
Boeing's datacenter in Bellevue Washington (East of Seattle about 10 miles). About a decade ago, they had to shut down their entire operation because a purpose-designed and built data center had power problems and didn't have the system redundancy they thought it did. In this case, the problem had to do with the use of incorrectly specified parts in some panelboards (main lug bolts). When a few were discovered to be overheating due to loose connections, the extent of the problem was revealed.
That datacenter was supposed to have been designed with fully redundant systems, including two utility sources. But that turned out not to have been the case and the only solution was to shut down everything over a long weekend and replace the suspect parts.
Having been in the commercial power biz in addition to my time at Boeing, I have become aware of a number of inadequate data center power systems installed in the past few decades. Without pointing any fingers, it seems the Seattle area has suffered from a few engineering firms and contractors that rode to dot com boom, swept in installing crap and moved on. The City of Seattle's IT infrastructure could be suffering from an additional problem in that they consolidated their city operations into a building that they picked up cheap. And this building was designed prior to the past few decades growth in IT systems. So its data center power systems were a retrofit to that building with all the shortcomings that this entails. Not enough room to install redundant power risers, switchgear and other assorted equipment in an existing structure.
Have gnu, will travel.
Diesel here in CA is under $4, usually a few cents less than unleaded, and in any case, generators can use the road-tax-free supply (ala. Home heating oil) which drops the price significantly, still.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
If you've got enough compute resources to have your own data center, it's probably cheaper for you to have your own data center instead of paying someone else to do it for you. So then, you build your own data center, and you decide on compromising on certain things like power bus redundancy, because in any given data center environment there are a million things that can fail, but you have to prioritize the systems you make redundant by looking at their failure probability and expected failure impact. You can't make everything redundant. That would be foolish because you'd spend so much money on redundancy that you'd have no money left for functionality. You probably want redundancy for routers, switches, servers, storage, etc because that's all stuff that's likely to fail, but I'd bet most single tenant datacenters probably don't have power grid redundancy because that's really expensive, and not as likely to fail. You would probably be better served by staying on a single power grid and putting the money to bring another power connection in on a generator instead.
The point I'm trying to make is that there is a level at which you have to say something is "redundant enough". I think that call that the point of diminishing returns. I would say an overheating power bus is probably an acceptable failure because I would've considered that as something that has a pretty low failure risk, so I wouldn't have spent the money to have two of them.
Remember I'm talking about a single tenant, privately owned datacenter for a small entity here. I.E. a municipality that probably has somewhere between 100-500 servers. Naturally if you're a company that is in the business of doing business online, or a huge company, then this isn't the right path for you. At the end of the day, a municipality offers online services as a convenience for its residents. When the DC blows up, you can still write a check and drop it in the mailbox to pay your water bill.
When you figure out your critical services, you separate them and define another SLA which applies only to them. And from the article, it sounds like that is exactly what they did - they kept the critical life safety systems running and took down the convenience systems for an acceptable period of time. So what's all the fuss about?