State of Virginia Technology Centers Down

← Back to Stories (view on slashdot.org)

State of Virginia Technology Centers Down

Posted by Soulskill on Friday August 27, 2010 @04:00AM from the your-tax-dollars-at-work dept.

bswooden writes "Some rather important departments (DMV, Social Services, Taxation) in the state of Virginia are currently without access to documents and information as a technology meltdown has caused much of their infrastructure to be offline for over 24 hours now. State CIO Sam Nixon said, 'A failure occurred in one memory card in what is known as a "storage area network," or SAN, at Virginia's Information Technologies Agency (VITA) suburban Richmond computing center, one of several data storage systems across Virginia.' How does the IT for some of the largest departments in a state come to a screeching halt over a single memory card? Oh, and also, the state is paying Northrup Grumman $2.4 billion over 10 years to manage the state's IT infrastructure." Reader miller60 adds, "Virginia's IT systems drew scrutiny last fall when state agencies reported rolling outages due to the lack of network redundancy."

20 of 190 comments (clear)

Min score:

Reason:

Sort:

HA fail by Anonymous Coward · 2010-08-27 04:03 · Score: 4, Insightful

How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
1. Re:HA fail by cgenman · 2010-08-27 04:39 · Score: 4, Interesting
  
  Also, this can happen when you hire an external firm to manage something that you should be managing yourself. External managers for projects like this are motivated by extracting as much money as possible from you. Internal departments of technology, by comparison, are motivated by convincing co-workers to not shout at them.
  
  --
  The ______ Agenda
2. Re:HA fail by Wyatt+Earp · 2010-08-27 05:08 · Score: 3, Informative
  
  Sweet Zombie Jesus.
  If the RAM in our 8TB Netgear SAN fries it doesn't blow up my office, what the hell are they and Northrup Grumman doing?
3. Re:HA fail by pnutjam · 2010-08-27 05:35 · Score: 3, Funny
  
  I started working for a city government earlier this year, let's just say I was amazed, I won't qualify it as amazed in a good way or bad way, but, you know...
  
  --
  Cheap storage VM.
4. Re:HA fail by wkcole · 2010-08-27 05:49 · Score: 4, Interesting
  
  How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
  RTFA!
  The problem was a dual (or worse) failure. What the article reveals is that while they may have had all of the right hardware in place and a mechanism for it to handle the most likely failures, they were missing the 'soft' components of a good HA system: routine testing of failover and a rapid repair plan. In the auto industry where failed systems can halt factories and rack up hundreds of thousands of dollars of cost per hour of downtime, it is the norm for HA systems to have frequent failover tests, to have on-site spares for critical components that can be replaced by on-site staff, and to have support arrangements that put a skilled human on-site with replacement hardware in a small amount of time. This is why traditional "enterprise class" systems are so expensive. They are designed for rapid diagnosis and repair, and a well-run enterprise that needs truly HA systems pays for expensive HUMAN support by their own staff and/or from IBM, Sun^WOracle, EMC, HP, etc. and monitoring systems on top of that. If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...) and have the right staff, processes, and support contracts in place, you will find nearly all of the latent failures and have them fixed before a true production failure exposes them.
  The most appalling thing about this to me isn't the failure. Some systems don't have safe times for testing failovers, and I know from personal experience that a component in an HA system that was working perfectly Saturday and has been idle since Sunday can go tits-up when needed on Wednesday. The real problem is the long outage. If the clowns in the VA state government were doing their jobs, they would not have a system like this without vendor support contracts to fix well-defined hardware problems (e.g. "bad memory card" ) within a few hours at most. This was something I always loved about working in a shop with the top-grade EMC contract. The Symmetrix and its associated gadgetry would call EMC about failures and we'd have a tech show up at the DC with parts before we even noticed anything unusual: costly, but nowhere near as expensive as killing all of the SAN-reliant systems for a random day every 3 years. The 4th 9 is not cheap or simple, because it always requires humans.
5. Re:HA fail by ultranova · 2010-08-27 08:15 · Score: 3, Insightful
  
  B-b-but you're saying that the bloated corrupt government that takes money from people at gunpoint and has no incentives for efficiency might have done a better job than a private contractor that works on the God-given free enterprise system that rewards efficiency and punishes waste!
  
  On the contrary, the free market did exactly as it was supposed to: it eliminated the inefficiency of redundant systems and a safety margin. Efficiency or the safety of redundancy, you can have one or the other but not both. That's why any important system should be managed by the government, and free enterprise should be limited to the role of logistical optimization it's actually good at.
  Unfortunately some people nowadays consider free market their religion, so we got deregulation and resulting financial crisis. Oh well...
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
6. Re:HA fail by cgenman · 2010-08-27 09:46 · Score: 4, Insightful
  
  If you're big enough that you're not just going to be scaling staff up and immediately down again, hire your people in-house. It's not a question of government vs private companies. It's a question of hiring your best people to be on staff, or outsourcing to someone who doesn't have the same motivations. This is true if you're a government, a corporation, a private entity, or a high school marching band. Plus the markup on external IT services is just obscene.
  Poorly managed projects will be poorly managed internally or externally. But externally poorly managed projects are a lot more expensive, and harder to reign back under control.
  
  --
  The ______ Agenda
Re:card? by snookerhog · 2010-08-27 04:06 · Score: 3, Insightful

sounds like nobody in Virginia knows either
They need a better network admin by Nemesisghost · 2010-08-27 04:07 · Score: 4, Funny

Maybe they should hire Terry Childs, at least he won't let their network go down for something like this.
Re:card? by Culture20 · 2010-08-27 04:08 · Score: 3, Informative

A technically correct term, albeit against normal colloquialism which calls them memory chips. Memory chips are the black things on the cards.
Awful. by boneclinkz · 2010-08-27 04:11 · Score: 4, Insightful

Our primary concern should be a complete audit of World of Warcraft server hardware, to ensure that this vulnerability does not exist in other, more vital networks.
Re:It's always money by Daniel_Staal · 2010-08-27 04:14 · Score: 3, Insightful

Add in politics: Get a couple of representatives arguing over where the money (if any) should be spent, and all possibility of real redundancy and fault-tolerance go out the window.
It's true in larger government organizations than this. The failures just haven't occurred yet.

--
'Sensible' is a curse word.
Typical liberal overreaction by BitHive · 2010-08-27 04:27 · Score: 5, Funny

Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired and not receive any more contracts once word of this gets out. This will put pressure on them to provide better services, or be out-competed by other entrepreneurs. Our free market system works, you just need to expect this kind of thing when it's government doing the hiring.
Re:It's always money by cgenman · 2010-08-27 04:49 · Score: 4, Insightful

Everyone seems to think that a network outage is no big deal, until the network goes down. That's when people start thinking of the burn rate of an entire organization sitting on their thumbs while that network of off-the-shelf Linksys routers is replaced by some kid at Best Buy. Or how that 5k dollars per year for a backup external line suddenly pales in comparison to the 5k dollars per hour your organization is wasting because you were a cheap bastard.

--
The ______ Agenda
Re:Question. by MightyMartian · 2010-08-27 04:58 · Score: 3, Informative

Well, as Sherlock Holmes' greatest axiom goes "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." Using that logic, the answer is simple. They're not using a SAN. Somewhere along the line someone is bullshitting, and my gut tells me its management. A lot of folks who get government contracts pretty much view them as an opportunity to skim off the top. Why, take what should be a $50,000 solution and mock something up for $10,000, and that's $40,000 profit.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Even funnier by SteveFoerster · 2010-08-27 04:59 · Score: 3, Interesting

As a leftover from when Virginia-headquartered AOL was the king of connectivity, you see license plates here in Virginia touting us as the Internet Capital.

--
Space game using normal deck of cards: http://BattleCards.org
Re:It's always money by geekoid · 2010-08-27 05:09 · Score: 3, Interesting

This is a private sector failure. NG is the culprit here, not the government.
This is why you should be very wary of bidding out work to 3rd party. They don't care about your city. They are not thinking about how their decision impact the city in 10-20-50 years.
and while infrastructures is far more complex and expensive then people who don't deal with it realize, 2.5 billion of 10 years? 240million a year? That is a price where they should have a tested redundancy system. I single point SAN failure? Shame on NG.
I hate to burst your preconceive bubble, but in my years in the private sector and public sector as taught me, most government agency are far better at keeping there own infrastructure. More reliable and long standing.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Question. by Darth_brooks · 2010-08-27 05:13 · Score: 3, Interesting

Depends on the SAN. The article (as most tech articles are) is very short on scope & details. So "one chip" went bad. Should that bring everything to a screeching halt? The answer should be "no" but in practice we can all say that it's more often a case of "not usually." From TFA:
It was hailed as being able to suffer a failure to one part but continue uninterrupted service because standby parts or systems would take over. But when the memory card failed Wednesday, a fallback that attempted to shoulder the load began reporting multiple errors, Nixon said.
So Array Alpha shits the bed. You follow your failover procedures and start running on Array Zappa. That immediately starts throwing errors. Ok armchair QB's, let me switch to my Keeanu Reeves voice and ask "What do you do?" You built a pretty damned redundant system there and you're still down. Sure, it'd be nice if they had a backup in another DC they could fail to, but they don't. Doesn't matter, eventually you're playing the double / triple / quadruple hulled oil tanker game. Either way, Redundant SAN's aren't cheap and aren't all that easy (it's not exactly a "the bosses nephew who 'knows all about computers' set it up last weekend" level of complexity.) The TFA also has these points:
Full function may not be restored until Monday.
Experts who examined the system determined that no data were lost except for those being keyed into the system at the moment it failed, Nixon said.
Other than the fact that proofreading and the usage of proper grammar are no longer a requirements to work for a Virginia newspaper, what do those points tell us? Sounds to me like they hit the last line in the DR procedures: Restore from backup. Depending on what their backup strategy is (maybe they're splitting several terrabytes across a tape robot that only supports 200/400gig tapes because that robot is the only device the vendor supports.) and how truly important the affected system is (This may be a system where the powers that be said "fsck it, they can process renewals by hand and we'll bring everything back up on Monday after we test on Saturday") a return to business on Monday might be SOP. But that wouldn't sell newspapers (or make talking points with the voters...) now, would it?
Maybe there was a major screwup here. Maybe they never tested their failovers and maybe that 2nd SAN was bad out of the box. I'm a little more willing to cut some slack and say "man, that sucks. Glad it's not my ass on the line." Karma's a bitch like that. I like to take these stories as an opportunity to rethink my own single points of failure are rather than point & laugh and tell everyone how I'll never lose and data because it's I'm running RAID 5......

--
There are some people that if they don't know, you can't tell 'em.
It happens by kilodelta · 2010-08-27 05:30 · Score: 3, Informative

But $2.4 billion over ten years comes out to $240,000,000 per YEAR! With that kind of money they could replace their infrastructure a few times over every year.

This is a clear example of the malfeasance that happens when government gets corrupted by corporate interests. Taxpayers in VA should be up in arms about this one.

Here's my story of state agency screw-ups. Two jobs ago I was working for the Secretary of State's office here. We had the opportunity and funding to get our IT infrastructure in order when the Help America Vote Act (HAVA) became law. We were able to build out a secure and redundant room to house our critical infrastructure.

Physical access by key and alarm code only, Redundant power which included an APS Symmetra UPS system, backed up by a 125kW natural gas fired generator. Even made sure to extend tendrils from the redundant power out to the MDF so the ISP could use our power system. Also had redundant cooling tied to the generator.

The one Achilles Heel of the operation was DNS. Ours was provided from outside our space.Suggested they build a zone locally that way we'd have DNS services if the state's went down. But they quashed it as being too difficult! Ut si!

Well one day there's a massive power outage in the city. They were still up and running, lights on, air conditioning on but couldn't get in or out of the internal network even though the ISP circuits were still up. Yup, DNS!
Re:It's always money by cgenman · 2010-08-27 05:52 · Score: 3, Informative

"get 100 units and run them for 6 months..."
Which works if you presume a linear fail rate, which is bonkers. Systems always run better at the beginning of their lifecycle. Static buildup, electrical interference, repeated heating and cooling cycles, etc all take a toll on the electronics. Would you really personally estimate a real-world MTBF of off-the-shelf SATA drives at 70 years? No, because they work perfectly well for the first year, start having trouble the second, and are all dead by the 8th. But if you presume linear dropoff using just that first year of testing, they look pretty damn bomb proof because that's when they work best. It's a stupid system that's only valid if you replace all of your hardware every year.
And all systems have moving parts. Electrons move. The circuit boards expand and contract. Crap builds up on important components. Electroplating can move move metals from one part of the design to another. Stuff gets plugged in and unplugged.
I realize that MTBF has a very technical definition that is different than marketing departments utilize it as. I might agree with you that any engineer worth their salt can extrapolate a proper MTBF. But most of the MTBF's I've seen are just stupidly wrong. If people really believe those published fantasy numbers, no wonder they don't put enough redundancy in their systems.

--
The ______ Agenda