Slashdot Mirror


State of Virginia Technology Centers Down

bswooden writes "Some rather important departments (DMV, Social Services, Taxation) in the state of Virginia are currently without access to documents and information as a technology meltdown has caused much of their infrastructure to be offline for over 24 hours now. State CIO Sam Nixon said, 'A failure occurred in one memory card in what is known as a "storage area network," or SAN, at Virginia's Information Technologies Agency (VITA) suburban Richmond computing center, one of several data storage systems across Virginia.' How does the IT for some of the largest departments in a state come to a screeching halt over a single memory card? Oh, and also, the state is paying Northrup Grumman $2.4 billion over 10 years to manage the state's IT infrastructure." Reader miller60 adds, "Virginia's IT systems drew scrutiny last fall when state agencies reported rolling outages due to the lack of network redundancy."

8 of 190 comments (clear)

  1. HA fail by Anonymous Coward · · Score: 4, Insightful

    How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.

    1. Re:HA fail by cgenman · · Score: 4, Interesting

      Also, this can happen when you hire an external firm to manage something that you should be managing yourself. External managers for projects like this are motivated by extracting as much money as possible from you. Internal departments of technology, by comparison, are motivated by convincing co-workers to not shout at them.

    2. Re:HA fail by wkcole · · Score: 4, Interesting

      How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.

      RTFA!

      The problem was a dual (or worse) failure. What the article reveals is that while they may have had all of the right hardware in place and a mechanism for it to handle the most likely failures, they were missing the 'soft' components of a good HA system: routine testing of failover and a rapid repair plan. In the auto industry where failed systems can halt factories and rack up hundreds of thousands of dollars of cost per hour of downtime, it is the norm for HA systems to have frequent failover tests, to have on-site spares for critical components that can be replaced by on-site staff, and to have support arrangements that put a skilled human on-site with replacement hardware in a small amount of time. This is why traditional "enterprise class" systems are so expensive. They are designed for rapid diagnosis and repair, and a well-run enterprise that needs truly HA systems pays for expensive HUMAN support by their own staff and/or from IBM, Sun^WOracle, EMC, HP, etc. and monitoring systems on top of that. If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...) and have the right staff, processes, and support contracts in place, you will find nearly all of the latent failures and have them fixed before a true production failure exposes them.

      The most appalling thing about this to me isn't the failure. Some systems don't have safe times for testing failovers, and I know from personal experience that a component in an HA system that was working perfectly Saturday and has been idle since Sunday can go tits-up when needed on Wednesday. The real problem is the long outage. If the clowns in the VA state government were doing their jobs, they would not have a system like this without vendor support contracts to fix well-defined hardware problems (e.g. "bad memory card" ) within a few hours at most. This was something I always loved about working in a shop with the top-grade EMC contract. The Symmetrix and its associated gadgetry would call EMC about failures and we'd have a tech show up at the DC with parts before we even noticed anything unusual: costly, but nowhere near as expensive as killing all of the SAN-reliant systems for a random day every 3 years. The 4th 9 is not cheap or simple, because it always requires humans.

    3. Re:HA fail by cgenman · · Score: 4, Insightful

      If you're big enough that you're not just going to be scaling staff up and immediately down again, hire your people in-house. It's not a question of government vs private companies. It's a question of hiring your best people to be on staff, or outsourcing to someone who doesn't have the same motivations. This is true if you're a government, a corporation, a private entity, or a high school marching band. Plus the markup on external IT services is just obscene.

      Poorly managed projects will be poorly managed internally or externally. But externally poorly managed projects are a lot more expensive, and harder to reign back under control.

  2. They need a better network admin by Nemesisghost · · Score: 4, Funny

    Maybe they should hire Terry Childs, at least he won't let their network go down for something like this.

  3. Awful. by boneclinkz · · Score: 4, Insightful

    Our primary concern should be a complete audit of World of Warcraft server hardware, to ensure that this vulnerability does not exist in other, more vital networks.

  4. Typical liberal overreaction by BitHive · · Score: 5, Funny

    Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired and not receive any more contracts once word of this gets out. This will put pressure on them to provide better services, or be out-competed by other entrepreneurs. Our free market system works, you just need to expect this kind of thing when it's government doing the hiring.

  5. Re:It's always money by cgenman · · Score: 4, Insightful

    Everyone seems to think that a network outage is no big deal, until the network goes down. That's when people start thinking of the burn rate of an entire organization sitting on their thumbs while that network of off-the-shelf Linksys routers is replaced by some kid at Best Buy. Or how that 5k dollars per year for a backup external line suddenly pales in comparison to the 5k dollars per hour your organization is wasting because you were a cheap bastard.