State of Virginia Technology Centers Down
bswooden writes "Some rather important departments (DMV, Social Services, Taxation) in the state of Virginia are currently without access to documents and information as a technology meltdown has caused much of their infrastructure to be offline for over 24 hours now. State CIO Sam Nixon said, 'A failure occurred in one memory card in what is known as a "storage area network," or SAN, at Virginia's Information Technologies Agency (VITA) suburban Richmond computing center, one of several data storage systems across Virginia.' How does the IT for some of the largest departments in a state come to a screeching halt over a single memory card? Oh, and also, the state is paying Northrup Grumman $2.4 billion over 10 years to manage the state's IT infrastructure."
Reader miller60 adds, "Virginia's IT systems drew scrutiny last fall when state agencies reported rolling outages due to the lack of network redundancy."
How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
What is a SAN memory card?
I'll tell you exactly how. Some manager somewhere said that it cost too much to add redundancy. It's happened over and over at my extremely large company, and it will continue to happen as long as money is the prime concern.
Northrup Grumman already runs the U.S. military. Might as well turn over IT to them too.
SJW: Someone who has run out of real oppression, and has to fake it.
Maybe they should hire Terry Childs, at least he won't let their network go down for something like this.
HAHAHAHHAHAHHAHHA - stupids
"This is supposed to be the best system you can buy, and it's never supposed to fail, but this one did," he said
And iv'e got a bridge for sale in San Francisco...
Silly state, expecting to get redundancy for only $2.4 billion dollars. Don't they realize they're going to have to pay a lot more than that to get a reliable network?
Sent from my iPhone
Our primary concern should be a complete audit of World of Warcraft server hardware, to ensure that this vulnerability does not exist in other, more vital networks.
Heh, it shouldn't be about the money, though... they should have specified high availability from the very beginning. They often throw it out during the prototyping stage, saying they need to Keep It Simple Stupid just to get things working, but then all the software is never designed to be able to handle redundancy, and shoehorning it in later becomes pretty much like starting again from scratch.
Also, designing in redundancy is usually worse than having no redundancy at all if it's never tested. There should be a pretty simple test plan, where, say, the CTO comes in and is allowed to pull any single random wire or component out of the rack and see how the system reacts / recovers. But unfortunately people are usually using the system by that time, and it's too much of a hassle to come in off-hours and pay everyone overtime for such a test.
I think the id10ts who pulled off this stunt are rather DIMM....
GStreamer - The only way to stream!
Umm, so what's the point of having a SAN if it weren't redundant? Me thinks there is more to this story.
Excellent, I guess this means I'll be able to safely travel through Virginia without risking getting picked up on all my outstanding warrants.
The only thing I can think of is that they decided it cost too much money. This is the problem with letting penny-counters make these decisions. "Oh, this one costs a fraction as much, and they're pretty much all the same. Right???"
When are people going to stop trusting business people for technical decisions? When are they going to figure out that they hired us for our knowledge, and not to just push buttons? We don't talk about backups and failovers just to sound cool. We're trying to save their butts from a meltdown like this. My advice is that if you're in a position that you have a penny-counter telling you what to buy, then just point at this story to give your opinion more weight -- especially if you've been trying to tell them for years or something.
I guess I'm lucky that I have a boss who used to work in IT, and so she gives my opinion a lot more weight than most supervisors do. We have several redundant backups, and we have two servers that can each pick up the slack of the other at a moment's notice (it's not that big a network). Not the best solution, but far better than the State of Virginia, apparently. We've already had a couple of hiccups that this arrangement worked great through. The users didn't even notice.
I'm not saying this to brag. We have a non-profit-sized budget (read: shoestring budget). If we can do it on our budget, then so should a US state.
.... rrrr bad, m'kay?
Check your premises.
Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired and not receive any more contracts once word of this gets out. This will put pressure on them to provide better services, or be out-competed by other entrepreneurs. Our free market system works, you just need to expect this kind of thing when it's government doing the hiring.
I work in the DMV with each jurisdiction, it is sad but Virginia is head and shoulders above Maryland and DC. Maryland's access to criminal records goes down weekly for extended periods. DC has been working to update their system to NCIC 2000 standards for 10 years. Virgina has put in more money then either jurisdiction and usually they are the most coordinated.
> Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired
> and not receive any more contracts once word of this gets out. This will put pressure on
> them to provide better services, or be out-competed by other entrepreneurs. Our free market
> system works, you just need to expect this kind of thing when it's government doing the hiring.
What? Are you joking? Do you even know who these people are?
At worst they will get a pat on the back after this. They are
an incestuous government contractor. That's why they got this
job and someone else didn't to begin with. The real IT outfits
can't because the great advantage that legacy players have here.
A Pirate and a Puritan look the same on a balance sheet.
not say. The F***Ing NDA stops me from saying anything about the stuff I saw in NGC's IT.
Well, I guess I can say it is BROKE NOW and you have to fix it. Told you so!
My company works on a project that N G lost on a re-compete bid. I can not go much into details, but suffice it to say: I am not at all surprised that they screwed up maintenance and management based on what I have had to deal with on the software they developed.
This is what you get for hiring a military contractor to do a civilian persons job. All 2.5 billion gets you in the military is a manger and toilet seat. You don't start getting functional hardware until the budget reaches 100 billion.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Anyone know what brand of SAN went down? My company had a similar issue where our SAN had a major outage, and the vendor claimed it was "an error that never happens, we swear".
Funny that I should receive an email today inviting me to a Northrop Grumman Information Systems Hiring Event. The event occurs on the 25th of August and I received the email on the afternoon of the 27th. Failed there too!
I wonder if they are using 3PAR.
As a leftover from when Virginia-headquartered AOL was the king of connectivity, you see license plates here in Virginia touting us as the Internet Capital.
Space game using normal deck of cards: http://BattleCards.org
Ok, in this case it probably is the bureaucracy at fault. But it isn't in all cases. In my previous job we had an architect who would take it upon himself to "value engineer" a vendor's solution, with unpredictable results. I'm not sure why -- we had budget. Maybe it was his way of seeming more valuable? This led to "solutions" like a SAN cobbled together from disk arrays, controllers and switches from three different vendors that were not meant to work together, had never been tested in the chosen configuration, and had to be integrated and maintained in-house. Word rapidly got around that if you wanted reliable access to your data, you didn't put it on the corporate SAN.
What I don't fully understand is how NG could get what amounts to a quarter billion dollars a year to manage the state's IT infrastructure and still allow a situation like this to occur. I mean, I understand how it can HAPPEN, I don't understand why it's allowed to. Over and over again I've seen companies who have outsourced their infrastructure enter into a "battered wife" relationship with the vendor, lacking anyone with the authority, cojones and understanding to bring the vendor to heel and get the uptime they've paid for. Instead corporate IT management will often enter into a dark relationship with vendor sales management to spin downtime to the stockholders as teething issues, inadequate documentation, out of scope, or some other hand-waving to explain why the savings from outsourcing has been more than offset by loss of revenue, IT management essentially working for the vendor while drawing a paycheck from the company. But don't get me started...
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
I'll do it for $2.3 billion!
While there is more to this story than meets the eye, there is no excuse for not having redundancy if you are a state body. It could be a case that there is a backup of the data and maybe that the needed parts to fix the issue are not availible yet or haven't been delivered. Nevertheless, without details of the infrastructure, we cannot jump to conclusions of why this happened. Most contracting companies like that clearly state that they will design the system to run and build it, but backup managment is the responsibilty of the customer. That is the normal CYA tactic. We don't know that NG was even tasked with the Backup or redundancy. I will be looking for more of this to come to light in regards to the actual cause and resolution.
What makes you think Northrop Grumman had a choice? They still work for the state IT department at the end of the day. If the state IT department says "buy this POS because it's cheaper and don't build in redundancy because it's too expensive," then those are the NGC employees' marching orders.
This likely happened because of a perfect storm (I feel dirty even using that term) of government cheapness, government contractors lacking backbone and an event ramming the two together at supercollider speeds. I bet you right now there are admins on both sides of the contractor/employee divide right now saying "cheap sons of bitches wouldn't do $X" because that is usually how these things work.
There's something rotten in the state of Virginia?
If Pandora's box is destined to be opened, *I* want to be the one to open it.
But $2.4 billion over ten years comes out to $240,000,000 per YEAR! With that kind of money they could replace their infrastructure a few times over every year.
This is a clear example of the malfeasance that happens when government gets corrupted by corporate interests. Taxpayers in VA should be up in arms about this one.
Here's my story of state agency screw-ups. Two jobs ago I was working for the Secretary of State's office here. We had the opportunity and funding to get our IT infrastructure in order when the Help America Vote Act (HAVA) became law. We were able to build out a secure and redundant room to house our critical infrastructure.
Physical access by key and alarm code only, Redundant power which included an APS Symmetra UPS system, backed up by a 125kW natural gas fired generator. Even made sure to extend tendrils from the redundant power out to the MDF so the ISP could use our power system. Also had redundant cooling tied to the generator.
The one Achilles Heel of the operation was DNS. Ours was provided from outside our space.Suggested they build a zone locally that way we'd have DNS services if the state's went down. But they quashed it as being too difficult! Ut si!
Well one day there's a massive power outage in the city. They were still up and running, lights on, air conditioning on but couldn't get in or out of the internal network even though the ISP circuits were still up. Yup, DNS!
In my first job, I changed the boot-up message on the VAX to "If only my girlfriend when down as often as this computer!" I kinda assumed it would scroll up off the terminal and nobody would see it. It, uh, didn't. One of our female programmers, who was famous for overreacting, came into work and threw a hissy fit. We fixed the message and decided to tell everyone we couldn't figure out who put it there. This is why you shouldn't give all developers administrator privileges!
I've abandoned my search for truth; now I'm just looking for some useful delusions.
This is what Politics in VA is all about.
Favors handed out; tax money wasted.; public screwed.
Rinse and repeat.
Which is why service level agreements are so important. You never have to fire them. When their profit margin on the project hits zero, they'll quit.
That they were going to do "Remediation" of the NAS when the problem started, and they had EMC guys on site, and everything. They must have killed the primary when they attached the secondary. Don't wait for a weekend outage window, lets just do it now on a Tuesday afternoon at 2pm. No one will know... oops....
Whenever I drive thru Virginia (up I-81) there's a sign announcing "Entering Virginia's Technology Corridor" which is followed by hundreds of miles of rolling green pastures.
What, there was a proliferation of cow-tipping?
Can we get a "-1 Wrong" moderation option?
Reminds me of a Classic TheDailyWTF: I'm Sure You Can Deal
To anybody who feels incredulous at the notion of a single point of failure taking down a purportedly redundant system:I suspect you have limited experience with the issues and challenges of managing a very large system infrastructure. The complexity of such systems goes well beyond the knowledge of any individual, so notions of fault tolerance across the enterprise are highly theoretical. Even with extensive planning and testing, the gotcha is in what you don't know. Sometimes, one of those What-You-Don't-Knows reveals itself, and that is when it first becomes known.
The need for continued live operation of production systems typically precludes the opportunity to test them as realistically or extensively as one would wish. In fact, across large organizations and locations and departments and applications, systems managers don't even attempt to assert that they are free of single-points-of-failure, nor do they provide guarantees of non-stop operation. Real attempts at non-stop fail-safe systems are generally limited to narrower, truly mission critical applications such as aeronautical systems where lives or huge measures of capital depend upon system availability. Such criticality can rarely be ascribed to administrative systems, and they therefore rarely get the attention or funding needed to build and assure non-stop operation. And rightfully so...the cost of non-stop operation is not justified by the costs/risks of occasional failures.
So for those of you who assert that Virginia's systems should never go down, or shouldn't go down for more than 24 hours, I ask: How do you justify that assertion? Does it have a cost/benefit basis, or is it perhaps just a "soft" assertion?
Northrup Grumman outsource part of there own IT as well. They don't own any of there own hardware no they leasing it and the leasing firm as there own tech guys as well.
I'll manage the system for you. It will only cost you $0.37 and transferred to my account in India.
They tried to do a full renumbering of the state's IP address space. This has morphed into a MPLS rollout when it turned out that too much was breaking as they moved various offices around.
State workers hate the whole VITA idea - it has been nothing but disaster and failure since it started.
I love the idea that my tax dollars have been funding this clusterfuck.
Not putting all of your eggs in one basket, even a double walled basket with 2 handles and shock absorbers, would be a good start.
Lets privatize our most important infrastructure!
When a Salt Lake City router went offline, only government telecom contractor Harris knew that the backup card was not immediately available and one technician had access to where it was kept. Meanwhile, hundreds of aircraft and thousands of passengers were thrown off schedule as the lack of an FAA filing system left pilots submitting flight plans manually.
http://www.eweek.com/c/a/Enterprise-Networking/The-Story-Behind-FAAs-FlightPlan-System-Crash-773289/
http://www.eweek.com/c/a/Data-Storage/FAAs-FlightPlan-System-Crashes-Again-Delays-Hundreds-of-US-Flights-199160/
Dave
So tell us a little bit about your education Mr. X.
"Well, I have a certificate in Microsoft Administration and a computer science degree, plus I am Cisco certfied."
Oh excellent!!! Thank you for your time.
So tell us a little bit about your education Mr. Y.
"Well, I have a degree in computer science and I ran several storage area networks for several years now from my previous employer Widgets are Us"
But, do you have any certifications?
"No."
Thanks Mr. Y. It has been nice speaking with you, don't call us, we will call you.
Mr. X gets hired, and promptly tanks the whole storage network for the entire state.
-Hack
Got Geometrodynamics? Awe, too hard to figure out? Too bad.
Memory go bad in a "san device" (I say in quotes because nobody in their right mind would actually think a singlepathed non-redundant disk array is really san-grand hardware) from a fruit-flavored vendor before, I can actually have some pity for the guys responsible/working on it. Debugging it is a great time too, because your filesystem rebuild generally works. As does copying small amounts of data. It is only once you try to copy a couple terabytes things go to hell.
Filesystem data and inode corruption both coming and going. Best part is fsck of course just makes things worse as it detects the real errors and the fake errors induced by reads of the bad ram.
Luckily we had backups.
Slashdot Patriotism: We Support our Dupes!
S.R. Hadden: First rule in government spending: why build one when you can have two at twice the price?
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
If you think that's bad, Virginia Tech's main educational portal, Scholar, has been down for like two hours now!
Redundant Array of Inexpensive Disks. RAID. Ok, maybe that scared them. Redundant Array of Raid Controllers - RARC? Nope, sounds Chinese. How about Redundant Infrastructure Array Audits? Nope, than definitely will not do...
Do not mock my vision of impractical footwear
The Kansas Department of Health has had their systems offline for nearly a month due to a hard drive failure. As a result nobody can get birth certificates.
http://www.google.com/hostednews/ap/article/ALeqM5iWdp8MfL7qrxjB8X8UWvKYC8Jw-AD9HRDAI00