State of Virginia Technology Centers Down
bswooden writes "Some rather important departments (DMV, Social Services, Taxation) in the state of Virginia are currently without access to documents and information as a technology meltdown has caused much of their infrastructure to be offline for over 24 hours now. State CIO Sam Nixon said, 'A failure occurred in one memory card in what is known as a "storage area network," or SAN, at Virginia's Information Technologies Agency (VITA) suburban Richmond computing center, one of several data storage systems across Virginia.' How does the IT for some of the largest departments in a state come to a screeching halt over a single memory card? Oh, and also, the state is paying Northrup Grumman $2.4 billion over 10 years to manage the state's IT infrastructure."
Reader miller60 adds, "Virginia's IT systems drew scrutiny last fall when state agencies reported rolling outages due to the lack of network redundancy."
How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
What is a SAN memory card?
I'll tell you exactly how. Some manager somewhere said that it cost too much to add redundancy. It's happened over and over at my extremely large company, and it will continue to happen as long as money is the prime concern.
Northrup Grumman already runs the U.S. military. Might as well turn over IT to them too.
SJW: Someone who has run out of real oppression, and has to fake it.
Maybe they should hire Terry Childs, at least he won't let their network go down for something like this.
Silly state, expecting to get redundancy for only $2.4 billion dollars. Don't they realize they're going to have to pay a lot more than that to get a reliable network?
Sent from my iPhone
Our primary concern should be a complete audit of World of Warcraft server hardware, to ensure that this vulnerability does not exist in other, more vital networks.
Heh, it shouldn't be about the money, though... they should have specified high availability from the very beginning. They often throw it out during the prototyping stage, saying they need to Keep It Simple Stupid just to get things working, but then all the software is never designed to be able to handle redundancy, and shoehorning it in later becomes pretty much like starting again from scratch.
Also, designing in redundancy is usually worse than having no redundancy at all if it's never tested. There should be a pretty simple test plan, where, say, the CTO comes in and is allowed to pull any single random wire or component out of the rack and see how the system reacts / recovers. But unfortunately people are usually using the system by that time, and it's too much of a hassle to come in off-hours and pay everyone overtime for such a test.
I think the id10ts who pulled off this stunt are rather DIMM....
GStreamer - The only way to stream!
Umm, so what's the point of having a SAN if it weren't redundant? Me thinks there is more to this story.
HAHAHAHHAHAHHAHHA - stupids
"This is supposed to be the best system you can buy, and it's never supposed to fail, but this one did," he said
And iv'e got a bridge for sale in San Francisco...
Throw in your city's cisco-powered WAN and I'll take it!
.... rrrr bad, m'kay?
Check your premises.
Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired and not receive any more contracts once word of this gets out. This will put pressure on them to provide better services, or be out-competed by other entrepreneurs. Our free market system works, you just need to expect this kind of thing when it's government doing the hiring.
not say. The F***Ing NDA stops me from saying anything about the stuff I saw in NGC's IT.
Well, I guess I can say it is BROKE NOW and you have to fix it. Told you so!
When are people going to stop trusting business people for technical decisions?
The moment tech people accept that taking risk of system failure to save cost is an acceptable business decision sometimes. I agree that this story proves that you need reasonable risk assessment to do that.
We don't talk about backups and failovers just to sound cool.
Yes we do. Too.
My company works on a project that N G lost on a re-compete bid. I can not go much into details, but suffice it to say: I am not at all surprised that they screwed up maintenance and management based on what I have had to deal with on the software they developed.
Yes, he was joking, I there's a whoooosh around here somewhere for you.
Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
This is what you get for hiring a military contractor to do a civilian persons job. All 2.5 billion gets you in the military is a manger and toilet seat. You don't start getting functional hardware until the budget reaches 100 billion.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Anyone know what brand of SAN went down? My company had a similar issue where our SAN had a major outage, and the vendor claimed it was "an error that never happens, we swear".
Meanwhile in the 49 other states....
----- You know you have ego issues when you register a domain in your name.
Funny that I should receive an email today inviting me to a Northrop Grumman Information Systems Hiring Event. The event occurs on the 25th of August and I received the email on the afternoon of the 27th. Failed there too!
As a leftover from when Virginia-headquartered AOL was the king of connectivity, you see license plates here in Virginia touting us as the Internet Capital.
Space game using normal deck of cards: http://BattleCards.org
Ok, in this case it probably is the bureaucracy at fault. But it isn't in all cases. In my previous job we had an architect who would take it upon himself to "value engineer" a vendor's solution, with unpredictable results. I'm not sure why -- we had budget. Maybe it was his way of seeming more valuable? This led to "solutions" like a SAN cobbled together from disk arrays, controllers and switches from three different vendors that were not meant to work together, had never been tested in the chosen configuration, and had to be integrated and maintained in-house. Word rapidly got around that if you wanted reliable access to your data, you didn't put it on the corporate SAN.
What I don't fully understand is how NG could get what amounts to a quarter billion dollars a year to manage the state's IT infrastructure and still allow a situation like this to occur. I mean, I understand how it can HAPPEN, I don't understand why it's allowed to. Over and over again I've seen companies who have outsourced their infrastructure enter into a "battered wife" relationship with the vendor, lacking anyone with the authority, cojones and understanding to bring the vendor to heel and get the uptime they've paid for. Instead corporate IT management will often enter into a dark relationship with vendor sales management to spin downtime to the stockholders as teething issues, inadequate documentation, out of scope, or some other hand-waving to explain why the savings from outsourcing has been more than offset by loss of revenue, IT management essentially working for the vendor while drawing a paycheck from the company. But don't get me started...
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
I'll do it for $2.3 billion!
What makes you think Northrop Grumman had a choice? They still work for the state IT department at the end of the day. If the state IT department says "buy this POS because it's cheaper and don't build in redundancy because it's too expensive," then those are the NGC employees' marching orders.
This likely happened because of a perfect storm (I feel dirty even using that term) of government cheapness, government contractors lacking backbone and an event ramming the two together at supercollider speeds. I bet you right now there are admins on both sides of the contractor/employee divide right now saying "cheap sons of bitches wouldn't do $X" because that is usually how these things work.
There's something rotten in the state of Virginia?
If Pandora's box is destined to be opened, *I* want to be the one to open it.
But $2.4 billion over ten years comes out to $240,000,000 per YEAR! With that kind of money they could replace their infrastructure a few times over every year.
This is a clear example of the malfeasance that happens when government gets corrupted by corporate interests. Taxpayers in VA should be up in arms about this one.
Here's my story of state agency screw-ups. Two jobs ago I was working for the Secretary of State's office here. We had the opportunity and funding to get our IT infrastructure in order when the Help America Vote Act (HAVA) became law. We were able to build out a secure and redundant room to house our critical infrastructure.
Physical access by key and alarm code only, Redundant power which included an APS Symmetra UPS system, backed up by a 125kW natural gas fired generator. Even made sure to extend tendrils from the redundant power out to the MDF so the ISP could use our power system. Also had redundant cooling tied to the generator.
The one Achilles Heel of the operation was DNS. Ours was provided from outside our space.Suggested they build a zone locally that way we'd have DNS services if the state's went down. But they quashed it as being too difficult! Ut si!
Well one day there's a massive power outage in the city. They were still up and running, lights on, air conditioning on but couldn't get in or out of the internal network even though the ISP circuits were still up. Yup, DNS!
In my first job, I changed the boot-up message on the VAX to "If only my girlfriend when down as often as this computer!" I kinda assumed it would scroll up off the terminal and nobody would see it. It, uh, didn't. One of our female programmers, who was famous for overreacting, came into work and threw a hissy fit. We fixed the message and decided to tell everyone we couldn't figure out who put it there. This is why you shouldn't give all developers administrator privileges!
I've abandoned my search for truth; now I'm just looking for some useful delusions.
This is what Politics in VA is all about.
Favors handed out; tax money wasted.; public screwed.
Rinse and repeat.
48
Cheap storage VM.
Actually, from the article, you would have seen that the contract was signed back in 2005, when Virginia enjoyed the presence of Mark Warner, Democrat, and now US Senator for Virginia.
Amusingly, Aneesh Chopra, the current CTO of the Obama administration, was the Virginia Secretary of Technology starting shortly after this was signed, and he never added redundancy to the service contract. This was during Warner's tenure and during Tim Kaine's (D-Va) tenure.
Also, counter to your argument, it was actually Bob McDonnell, the current Republican Governor, that renegotiated the contract to include redundancy.
With all of that said, I do not think Northrop Grumman was the best fit for this job and after so many egregious failures, they deserve to have their contract reworked in VA's favor, but bureaucracy being what it is, regardless of party politics, I doubt this will change. I really feel like this kind of contract could have gone to a small-to-medium sized VA business that could have handled it extremely well, and locally, for much less. The real sad thing is that the guy who's largest job was to oversee this contract, and did nothing, is now the CTO for the entire country. I don't care what party you are, that's a scary thought.
That they were going to do "Remediation" of the NAS when the problem started, and they had EMC guys on site, and everything. They must have killed the primary when they attached the secondary. Don't wait for a weekend outage window, lets just do it now on a Tuesday afternoon at 2pm. No one will know... oops....
Whenever I drive thru Virginia (up I-81) there's a sign announcing "Entering Virginia's Technology Corridor" which is followed by hundreds of miles of rolling green pastures.
What, there was a proliferation of cow-tipping?
Can we get a "-1 Wrong" moderation option?
Reminds me of a Classic TheDailyWTF: I'm Sure You Can Deal
To anybody who feels incredulous at the notion of a single point of failure taking down a purportedly redundant system:I suspect you have limited experience with the issues and challenges of managing a very large system infrastructure. The complexity of such systems goes well beyond the knowledge of any individual, so notions of fault tolerance across the enterprise are highly theoretical. Even with extensive planning and testing, the gotcha is in what you don't know. Sometimes, one of those What-You-Don't-Knows reveals itself, and that is when it first becomes known.
The need for continued live operation of production systems typically precludes the opportunity to test them as realistically or extensively as one would wish. In fact, across large organizations and locations and departments and applications, systems managers don't even attempt to assert that they are free of single-points-of-failure, nor do they provide guarantees of non-stop operation. Real attempts at non-stop fail-safe systems are generally limited to narrower, truly mission critical applications such as aeronautical systems where lives or huge measures of capital depend upon system availability. Such criticality can rarely be ascribed to administrative systems, and they therefore rarely get the attention or funding needed to build and assure non-stop operation. And rightfully so...the cost of non-stop operation is not justified by the costs/risks of occasional failures.
So for those of you who assert that Virginia's systems should never go down, or shouldn't go down for more than 24 hours, I ask: How do you justify that assertion? Does it have a cost/benefit basis, or is it perhaps just a "soft" assertion?
They tried to do a full renumbering of the state's IP address space. This has morphed into a MPLS rollout when it turned out that too much was breaking as they moved various offices around.
State workers hate the whole VITA idea - it has been nothing but disaster and failure since it started.
I love the idea that my tax dollars have been funding this clusterfuck.
The real IT outfits are deeply disadvantaged by feeling the need to actually deliver on the contract. That drives costs up and caps promises.
Not putting all of your eggs in one basket, even a double walled basket with 2 handles and shock absorbers, would be a good start.
And iv'e got a bridge for sale in San Francisco...
No thanks. I got a sweet deal on one in Brooklyn.
Lets privatize our most important infrastructure!
When a Salt Lake City router went offline, only government telecom contractor Harris knew that the backup card was not immediately available and one technician had access to where it was kept. Meanwhile, hundreds of aircraft and thousands of passengers were thrown off schedule as the lack of an FAA filing system left pilots submitting flight plans manually.
http://www.eweek.com/c/a/Enterprise-Networking/The-Story-Behind-FAAs-FlightPlan-System-Crash-773289/
http://www.eweek.com/c/a/Data-Storage/FAAs-FlightPlan-System-Crashes-Again-Delays-Hundreds-of-US-Flights-199160/
Dave
So tell us a little bit about your education Mr. X.
"Well, I have a certificate in Microsoft Administration and a computer science degree, plus I am Cisco certfied."
Oh excellent!!! Thank you for your time.
So tell us a little bit about your education Mr. Y.
"Well, I have a degree in computer science and I ran several storage area networks for several years now from my previous employer Widgets are Us"
But, do you have any certifications?
"No."
Thanks Mr. Y. It has been nice speaking with you, don't call us, we will call you.
Mr. X gets hired, and promptly tanks the whole storage network for the entire state.
-Hack
Got Geometrodynamics? Awe, too hard to figure out? Too bad.
Memory go bad in a "san device" (I say in quotes because nobody in their right mind would actually think a singlepathed non-redundant disk array is really san-grand hardware) from a fruit-flavored vendor before, I can actually have some pity for the guys responsible/working on it. Debugging it is a great time too, because your filesystem rebuild generally works. As does copying small amounts of data. It is only once you try to copy a couple terabytes things go to hell.
Filesystem data and inode corruption both coming and going. Best part is fsck of course just makes things worse as it detects the real errors and the fake errors induced by reads of the bad ram.
Luckily we had backups.
Slashdot Patriotism: We Support our Dupes!
S.R. Hadden: First rule in government spending: why build one when you can have two at twice the price?
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
Redundant Array of Inexpensive Disks. RAID. Ok, maybe that scared them. Redundant Array of Raid Controllers - RARC? Nope, sounds Chinese. How about Redundant Infrastructure Array Audits? Nope, than definitely will not do...
Do not mock my vision of impractical footwear
Well, they just proved it:
The difference between a system that can fail, and a system that can not possibly fail is, that when a system that can not possibly fail fails, the fault will be at a place impossible to get at and fix.
The Kansas Department of Health has had their systems offline for nearly a month due to a hard drive failure. As a result nobody can get birth certificates.
http://www.google.com/hostednews/ap/article/ALeqM5iWdp8MfL7qrxjB8X8UWvKYC8Jw-AD9HRDAI00
Uh, 47
46
Cheap storage VM.