Anatomy of the VA's IT Meltdown
Lucas123 writes "According to a Computerworld story, a relatively simple breakdown in communications led to a day-long systems outage within the VA's medical centers. The ultimate result of the outage: the cancellation of a project to centralize IT systems at more than 150 medical facilities into four regional data processing centers. The shutdown 'left months of work to recover data to update the medical records of thousands of veterans. The procedural failure also exposed a common problem in IT transformation efforts: Fault lines appear when management reporting shifts from local to regional.'"
Business as usual for the VA.
Once again, the VA shows its true colors and mucks up another project funded by taxpayers for the well-being of our nations Veterans. A more screwed up organization one will not find.
1)RTFA
2)simple conventions:
VA = Virginia
The VA = The US Department of Veterans Affairs
(of course, it would be a first for 'em... even if it's the "wrong" Vista we're talking here).
Quo usque tandem abutere, Nimbus, patientia nostra?
There clearly is just not enough synergy..
So basically, -1 troll/offtopic is really slashdots way of saying "I hate that you thought of something before me."
The article said the project was pulled back and will be looked at - that doesn't necessarily mean cancellation
//m
No organization that I know of has EVER had good luck with the name VISTA.
Volpp assumed that the data center in Sacramento would move into the first level of backup -- switching over to the Denver data center. It didn't happen.
DOH! Looks like it was all just due to someone's assumption that someone else would do their job.
From my experience, you can assume things happened, but if you don't verify that they actually happened - you are DOOMED.
He who knows best knows how little he knows. - Thomas Jefferson
Why didn't the build the second "centralized" system in parallel to the one that already existed? This way, when the new system failed miserably, just flip the switch (or DNS record) back to the old servers and retool the "solution" that you were testing.....
that brings another point to mind...
DIDN'T THEY TEST THE FREAKING THING!?
NewslilySocial News. No lolcats allowed.
unfortunately one of the best ways to learn how well your disaster recovery system works is to have a disaster. The problem with scheduled drills is the scenarios themselves are planned out and typically not run system wide ie test the part of the system then that part of the system etc. on RTFA it seems much of the breakdown occurred because too many people assumed. There was also no centralized decision making identities who had access to all the information. All scenarios when view from there individual perspective seemed to have made the right decision. However sometimes when implementing a global recovery plan one system may have to be sacrificed by another.
Awesome, sorry if someone already posted but I just couldn't resist the following quote:
Instantly, technicians present began to troubleshoot the problem. "There was a lot of attention on the signs and symptoms of the problem and very little attention on what is very often the first step you have in triaging an IT incident, which is, 'What was the last thing that got changed in this environment?'" Raffin said.
p.s. I am shocked at how many junior cowboy IT people remain employed, given the supposed glut of hire-able and knowledgeable folks.
stuff |
and sourceforge, too?
"Flyin' in just a sweet place,
Never been known to fail..."
I'm sure I'll get modded to -5, Flamebait, but fucking A, Zonk, Slashdot isn't a newspaper. You don't need to be so economical in your headlines. When I saw the headline, I first thought of VA Linux--you know, the guys who kinda sorta own you. "Medical centers" threw me, so I thought for a second that it might mean the state of Virginia. Then it dawned on me that you probably meant the United States Department of Veterans Affairs. I'm sure I'm not the only one.
Please, God, isn't there some kind of Editing 101 correspondence-school course we can send all these guys to? I mean, I love Slashdot to death, but please God, can you give the staff just one ounce of basic editorial skills: spelling, grammar, etc? Teach them to write for clarity, not just brevity? Maybe go for broke and touch on dupe-checking, fact-checking, changing links so they point to the original article instead of some guy's AdSense-laden blog page that says nothing more than "here's the story"?
You're EDITORS, for God's sake (even if in name only), you are indeed allowed to EDIT submissions.
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Isn't it obvious that the acronym "VA" isn't good to use in a title? FYI, it stands for "U.S. Veteran's Administration".
I wonder why higher management always wants to centralize their resources. The internet protocol and subsequent many IT applications were built to be efficient in small and decentralized environments.
1) Trying to centralize gives us large expensive computers that are made out of the same components as smaller ones and thus fail just as the smaller ones do, however, ever trying to cram more crap on the same machine will bring down everything at once whenever it fails.
2) Trying to centralize has the ultimate goal to eliminate jobs but they need those people since they know all the little details and hickups their systems have. If people know a project is going to eliminate their job, they won't be cooperative. IT not being cooperative is very bad in this world where everything is computerized.
3) Eventually the same number of people is going to have to work in the centralized system just because you also centralize the problems and more problems will bring more people, more people will bring more overhead and inefficiency, more inefficiency will bring more people (at least that's the default in today's business world, throwing more people at an IT problem doesn't make it disappear faster)
4) More people in a project that was designed to be more cost efficient means the managers will have to cut expenses. Cut expenses brings underpaid people, underpaid people bring less or no experience and higher turnover, higher turnover means more cutting expenses.
Therefore: keep your local IT guy(s) and infrastructure although you can't squeeze 100% of work/day and it will bring a little more expense. The end-users have a better relationship with the guy(s) and that makes happier people. Centralizing brings more overhead, less customer-interaction with IT and thus more inefficiency throughout the business.
Custom electronics and digital signage for your business: www.evcircuits.com
It's another one of these monstrous systems integration projects that will/never work and every hospital/med center is doing them. They want everything to talk to everything and the only reason behind it is really more big brother Total Info Awaren. BS. I worked for a hospital in the 90's. It was started before I got there and still going on years later after I left. I'm sure it's still going on. It is a monstrous bureaucracy that costs millions (billions?) and you can expect problems of this scale to increase as they continue to centralize their vulnerabilities.
And I've never heard of anyone running even a piece of a datacenter on Vista. Everyone complains that outsourcing companies are too expensive, but honestly, we're a LOT smarter than the fools who implemented this. We would never have this mistake.
UbuntuDupe seems to have a major attitude about Ubuntu. For anyone who doesn't know the story here's why. Basically, UbuntuDupe ran into problems installing Ubuntu and, when he asked for help on Ubuntu Forums, immediately started attacking the people that were sincerely trying to help him. Even with his major attitude the Ubuntu folks still tried their best to help him until they just couldn't put up with him any more. Read it for yourself and you'll see UbuntuDupe's Slashdot postings on Ubuntu in a new light.
I had a real fun time parsing this article.
... ... crown? The Queen? Perhaps they mean *our* overlords, VA Linux? Or is VA Linux a monarchist organisation now? ... medical? Why are th... oh HANG ON WAIT A MINUTE ... government! Crown, government, get it? So, VA Linux screwed up a government's medical system? That makes ... ... sense, but ... something's out of place, something's ... just ... not ... quite ... ... ... carefully ... the VA, why the VA, shouldn't it be ... Vir..ginia?!
...
1. Looks at title: omg! Slashdot's parent company had an IT meltdown! ha-ha! But waitaminute
2. Looks at icon: a
3. Looks at summary: and
4. Looks at icon: I remember that! It means
5. Looks into the inner recesses of my mind:
7. Looks at lightbulb over head: of course! There *is* no VA Linux! It's Sourceforge, Inc now! But that must mean
6. Looks at summary:
Gee thanks, Zonk, just what I needed before going to sleep. Now I'll dream of the Queen in Virginia melting down medical computers for Slashdot's open source overlords. Again.
Last thing I needed
Yeah, time to fire your IT organization's management. And a few of their leads, too. And maybe some of the techs.
Couple of reasons: First, they're running Vista. I'm not trying to be all "You must only run Linux or ur a n00b" here -- you can run Windows servers just fine, but no reasonable IT planner should ever, *ever* consider using an OS that new for a mission-critical enterprise application. If it doesn't have two or three years in the field, don't even consider it.
Second, their failover plan sucked. Live data syncs are good for physical disasters (fires, earthquakes, zombie attacks) but, as the VA discovered, they leave you shitting your pants when you run into an issue that may or may not be data-related. The solution to this, of course, is to keep a day or week-old copy someplace along with an up-to-date (but not implemented!) transaction log that you can go through and update with once you've sanity-checked it.
Third, letting the vendor run "tests" on your production system. Nobody, and I mean nobody, should ever get to touch any production system unless they're implementing a specific change that's been tested in an identical environment, passed QA and review by folks who know the system and then only with a published implementation, testing and backout plan. If a system needs "tests", you pull it out of production before you start messing with it.
Finally, their "virtualized team" approach (read: our people are scattered all over the place) is moronic -- you see this sort of thing, and without fail it's the result of political pressures rather than sane management. In this case, I'll bet my hat is was a situation where a bunch of middle managers were allowed to maneuver to keep their fingers in the pie when centralization tool place, so instead of having everyone you need on hand and in one group you're busy setting up conference calls.
Plus, now their solution is to bring in a bunch of consultants. Yeah, that always works. Good luck, guys! You're gonna need it.
Every year during my review, I just pray the words "slashdot.org" aren't mentioned.
In my experience with federal govt IT jobs, you usually have to FORCE others to do their jobs if your job depends on someone else doing their job to completion. This involves lots of whistleblowing and reporting them up the chain of command when they're slacking, and generally playing cop over them. You must always save all emails, memos and all forms of correspondence with them, take plenty of detailed notes at all meetings, and I've found that secretly recording your conversations and meetings with a portable voice recorder slipped into your pocket works wonders to help get a project done. Federal agencies have pretty much all de-evolved into a perpetual surveilence society (that's why they're doing it to the citizens now too), and surveillance is a language they understand clearly. You have to think and act like a prosecuting attorney to keep your project on track.
This is funny to me. I was hired by the VA in St. Petersburg, Florida a few years ago when Windows 2003 first came out to train all of the NT administrators on the migration to 2003. Of the 60 or so NT administrators, all but three of them were losing their title and becoming helpdesk for their site and "physical hands" for the few remaining administrators.
A lof of the admins were unhappy about that, as I would have been. I am just curious if the failure to complete the project had to do with the lack of respect for the older employees with NT experience and essentially downgrading those employees.
Vista != Microsoft Windows Vista
Within the Department of Veterans Affairs, they have an computerized medical records system. That medical records system has been called Vista for decades. It's an unfortunate source of confusion that Microsoft chose the same word for their new OS release, but the two have nothing to do with each other.
Does anybody else get the impression that they created an Ethernet loop and couldn't figure it out for a whole day ?
staffers from Hewlett-Packard Co. conducting a review of the center's HP AlphaServer system running on Virtual Memory System and testing its performance.
We hardly knew ye.
--- I do not moderate.
What they were doing was a major change to their IT infrastructure. That's massive. Things happen. The fact that they were down at 17 of 128+3 (131) data centers because some IT staffer changed a port # at one of their hub data centers without following proper procedure -- that's minor.
Seems to me that things worked otherwise well is a major accomplishment. They are still on the old system and are entering in data back into that system and migrating into the new system. But it seems things went well otherwise.
Anytime you do a major shift like this, it's hard. The users hate it because they can do their job very quickly on the system they are use to, but now have to learn a new system and slow down.
Things happen.
And these are the clowns the dems want to put in charge of healthcare...
1st off... VISTA is not Windows VISTA. It's the "Veterans Health Information Systems and Technology Architecture". Do a google search on that.
.:Thinks:. Too bad they don't know about everything we've short changed to make such an obscene deadline!
VISTA runs on HP's VMS, and on top of that it runs Cache from Intersystems. (And yes it costs the tax payers a lot! But a lot less since we've been centralizing it over the last 3 or 4 years.)
It is a HUGE system.
The centralization that we're currently undergoing is massive, this problem was (IMHO) scape goated to a poor change control process.
I know what was change, I know who changed it, and I know when they changed it. However, this 'melt down' has happened three times... (Not to the same drastic outcome.) It comes down to VMS locking out logons because locks aren't being released properly. (Now you could argue that the reason locks got behind was this change... But I don't think that is the real reason because of our previous problems.)
It's that simple. Ask the VISTA manager over lunch sometime. They weren't afraid of data corruption. They were afraid if they moved the systems, the other system would lock up too with too much user load.
There goes "VISTA". Everyone logged in is fine. Everyone not on... Isn't getting on.
Now comes the bad part... No procedures!
We take 32 medical centers, and throw their IT into a data center. You 'had' clear lines of who owns what, and what happens when they go down. Now you centralize all that... Who raises the flag when something bad happens? Is it the site that has the problem? Is it someone who now controls the system at the data center? Who is responsible for what?
Oh wait... OI&T only has a dozen staff... And almost NONE of those people are technical. Everyones pay was simply moved from one appropriation to another. But what about the IT systems?!?! We moved those too, but didn't hire any permanent staff to take care of it? We just rubber banded a bunch of people together that work across the whole west coast and hand them a pager and say good luck?
Suffice it to say, we have some REALLY REALLY hard working people... And some really bad management. (Congress forcing us to do things on a time table is really annoying. Especially since they expect results, but don't expect any documentation... What do you think is going to get skipped?)
Congress: How is that data center move going!
Howard: We've moved 28 sites!
Congress: Good Job!
Howard:
Then again... Howard doesn't even know everything we skip to get things done.
Bah
Cowboy IT people remain employed because they're cheap!
First thing I learned in the military: your weapon was made by the lowest bidder.
668: Neighbour of the Beast
I work for a company that uses the Intersystems Cache database and I have to say that I imagine that Cache is a large part of the problem. The amount of good documentation for Cache lies between very little and none and my company has been on a nationwide search for people experienced with Cache and they too seem few and far between. Of course, I don't know that Cache really is a "worse" or "better" database that Oracle, SQL Server or MySQL for that matter, however, what I do know that is when it comes to experience, common tasks, documentation, examples and just getting things done, Cache lags far behind the others, not to mention Universities are still teaching relational db theory, not object db theory, at least when I graduated Rutgers a few short years ago. I suspect that given the task of merging databases, even large databases, there are plenty of experienced and knowledge SQL Server, Oracle, mySQL guys out on Monster or some other job site that know how to get the job done, efficiently and correctly, and have done the job a few times before. Based on our current and past searches for people capable of even easier tasks within Cache, there aren't many people out there with any Cache experience, never mind good people with Cache experience, and it's easy to fudge a task when you aren't given much good documentation, examples or experience. In a past career, I worked for a healthcare company that used SQL server for electronic medical records (EMRs) and the system worked rather well. There might have been better ways to design the database, stored procs or application code, however, we never had a problem hiring good staff that understood the database design, SQL queries, T-SQL/stored procs and as i said, I can't say the same about trying to hire good people who know and understand Mumps ("M" the language, not the disease) or Cache ObjectScript or find the Cache tools to be easy and intuitive. Just my $.02, and I don't mean to start a DB debate, just stating that it might just also be time for the VA's to move on from MUMPS/Cache to a more widely used and documented database and programming language, find some new blood.
I can't wait until all of our health care is provided by the government, seeing as how it's being done so well on a smaller scale.
Welcome to the modern enlightened world. You see way too many people feel that patriotism is stupid. If you join the military you are patriotic and so you are stupid. Why should they care?
I have seen people refuse to stand for the National Anthem on Veterans day at an airshow. Did you miss the people complaining about Google have a banner for Veterans Day?
If people will not stand and actively complain about a Google's Veterans Day banner why should they want to fund or fix the VA? That actually costs real money.
Yea we should fix the VA.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
I read the title and immediately thought of this VA IT meltdown.
Blind patriotism is stupid.
I have a security clearance and work with the DoD and IMO really smart people should not join the military. At this point we are so far into overkill mode that cutting the DoD's spending in 1/2 could make little difference to our overall safety if done in a reasonable fashion. (Ok, that level of change would probably create a mess, but slowly trimming the fat is a good idea IMO.) If you really want to make the US a better place go into private industry and start the next Google. Fighting in some pointless war is of limited value.
However, I feel the VA should be better funded and revamped. When we put a person's life in danger we should respect that and pay them back for their service.
If the incident actually occurred it made squat difference to treatment. I'm under fairly constant care at two related VA facilities and my treatment wasn't affected by any such thing. Sounds like it's just IT's problem.
I've gone to VA hospitals since 1989. I got insurance when I started teaching and started going to local doctors and hospitals. Before a year was up I was going back to the VA. Treatment that the VA doesn't provide is treatment the vet didn't request. To be fair, at the VA you need to request harder than elsewhere because elsewhere is going to get money for more treatment and the VA gets nothing different for more or less treatment, except in the largest sense in requesting federal funding after stats show what's been done easily and what took too long to accomplish.
In 1989 it took months to get an appointment for most things. The excuse was "we don't have enough money". Now if it takes a week, you get the same excuse. If I need to see my primary care physician I can get in same day or next day depending on what time of day I call.
I am presently getting the best care I've received to date, this at the hospital the VA said had the lowest marks just 3 years ago.
Sorry you have a problem, IT. You can rest a bit easier knowing it didn't affect the providers.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
They have basically no investment in hiring and personnel systems. I know because I'm a recent MPA grad and have seen their postings for personnel. They're worse than any other federal system (and federal hiring is generally inane).
Two months ago the VA was posting on usajobs and requiring all documents be mailed in -- no fax, no email, no online systems. And some of these positions had a 5 day application window with no leeway for stamped post. Even more bizarre, one of the postings stated flat out that "alternate delivery methods" should be used because the post office wasn't reliable (Read: HAND DELIVERY). It even gave the room number to hand deliver the application to...and this was for an entry level career position (college/MA grads). Clearly they were planning to hire someone internally in that case--at least I hope that corruption was the issue and not incompetence.
With such a high amount of hassle to simply apply for work at the VA, such poorly managed hiring systems, and disinterest in bringing in outsiders, why would anyone competent want to work for the VA? I know I wouldn't want to...I got my MPA to help improve systems, but that requires at least a marginally competent culture and active management...and the VA simply doesn't have either.
So have you written or called your rep about? When voting are you going to look at their voting record on VA funding?
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
But wouldn't you agree the made-up support team's response was about what I got from the forums?
-Ask to do stuff I already tried? Check.
-Pretend like a download/burn failure could cause this specific problem? Check.
-Give inconsistent story about which CD is needed to fix boot errors? Check.
-Ignore information about error message? Check.
-Focus on irrelevant Windows usage? Check.
-Feigning surprise that someone would run Ubuntu in a completely anticipated, common environment? Check.
I know a lot of what I've said about Ubuntu has cost me a lot of fans and gained me some freaks. (How mature!) Comments on other issues have been well-respected. Even after the revenge modding I still have excellent karma.
The reason I keep bringing this up is that people make my exact same design criticisms (that I made on Ubuntu) in many, many other contexts, and then get modded to 5. You'd almost think there was a sacred cow here...
Apology to Ubuntu forum.
* Being so emotionally wrapped up in something, you desperately try to put yourself in a good light ("I followed all the steps") while marking down those you feel have wronged you with clever checklists: Check
To put your issues in perspective, people on slashdot have LOST THEIR JOBS due to some IT issue, yet they managed to get over it. Meanwhile, you are CONSTANTLY churning out whines about a BOOT LOADER FAILURE that happened TWO YEARS AGO. My fucking server crashed with full loss-of-data three months ago, and I'm not whining about it here.
What will make you happy? What will soothe this wound that runs oh so deep? Want Mark Shuttleworth to suck your dick? Want a free computer? Your quest for justice in this matter is obviously THE LARGEST PROBLEM HUMANKIND HAS EVER FACED; what measures should I ask my elected representatives to support? Let me know real quick, I want to put this NUMBER ONE PRIORITY issue to rest in a bipartisan manner so we can go back to other more menial issues, like world hunger.
UbuntuDupe, do yourself a favor this holiday season: Volunteer at a soup kitchen, help out those that are less fortunate than you. There are people who have nothing. No computer to install Ubuntu on, no roof over their heads, no GRUB in their tummies. That should put some things in perspective.
Urgh!
Walter Reed ARMY Medical Center is Army/DoD health care for active duty personnels.
WRAMC may or may not exhibit similar issues with VA hospital system. That remain to be seen.
However, confusion between the two systems does not help your credibility.
You are right that the Walter Reed scandal was a travesty. However you are missing one key detail. It ISN'T a VA facility! Walter Reed is an army hospital, meaning it is run by the Department of Defense. The VA and DoD are separate departments, each with their own cabinet secretary.
Put another way, would it be reasonable or appropriate to blame NIH (Dept of Health and Human Services) for security breaches at Los Alamos (Dept of Energy)? I mean, they both do basic science research, so they must be the same, right?
-Ask to do stuff I already tried? Check.
What, they're supposed to be mind readers? Yes, I know that some people will gloss over (and did gloss over) what you wrote but you can't blame everyone that's trying to help. There's nothing wrong with pointing out their oversight, but still, there's a line between being pointing something out and being a dick. The first person was even apologetic about the oversight. There were also more detailed questions about the information that you included but you counted them as the same thing and just got annoyed at them for asking instead of providing the extra detail.
-Pretend like a download/burn failure could cause this specific problem? Check.
Maybe it could. I've seen bad burns do *very* weird things like cause some programs to not run (with a seemingly successful installation). Also, because you didn't get any errors doesn't mean you don't have a bad burn. Just because you dismiss a possible cause as being impossible, here are a few personal anecdotes. I used to build computers and I've tried to get customer to follow my instructions because I knew that the answer to the boot problem was simply a reversed floppy cable. The customer refused to listen to me and he brought his computer in. I had him watch me reverse the cable and the problem went away. Just grit your teeth and try their suggestions, even if it sounds stupid (especially if you've already tried everything else). I've seen problems on Dells go away because I reseated the CPU at the suggestion of one of their desktop techs. It sounded stupid at the time but it worked. My boss (who is not very technical) sometimes has very stupid sounding recommendations. I've learned to attempt them (if I don't think that it will cause more problems) because he's been right more than once in the past. You are not special. You are not above having strange problems that have seemingly strange or absurd solutions.
-Give inconsistent story about which CD is needed to fix boot errors? Check.
Ah, well they're volunteers with varying levels of experience. That may be a valid complaint, but still no reason to be a dick.
-Ignore information about error message? Check.
Other than not being provided with an exact solution that worked or one that you actually felt was worthy of your time, I don't see where everyone ignored your error message. In fact, I see a few posts that attempted to address that error specifically.
-Focus on irrelevant Windows usage? Check.
Actually, Windows has a boot loader too so it's pretty relevant if trying to get you back into at least one OS. Even if that wasn't the goal, the boot strap process is different between Windows 98 and Windows XP (as examples). You could use the Windows XP boot loader to boot Linux but that can't be done under Windows 98. Just answer the damned question about what version of Windows it is. Sure, the original person asking the question might not be able to do anything with that information, but someone else might. Just because you think it's irrelevant doesn't mean that it is.
-Feigning surprise that someone would run Ubuntu in a completely anticipated, common environment? Check.
That's just a hindsight is 20/20 kind of thing. Yes, some people are going to say things that are not helpful at the given moment, even if it's sound general advice. The fact that some mentioned that you should've done a test install first shouldn't count against them and I hope that it's something that you now make a conscious decision about whether or not to do. For the record, having three hard drives is not a common environment.. at least, not if the literally hundreds of (probably around a thousand) PCs that I've directly installed or serviced are representative of "common." (Certainly not unheard of, but not common as you claim.)
'I should never have believed all that crap about "providing access to all".'
Be a dick on the second post?- Check.
"You also have refused to answer the question of how to edit the boot loader."
Assume that you alre
There has never been a successful failover test of the RI Department of State's Central Voter Registration System. Most of that is because of the obstinacy of the RI Department of Administration.
And State recently suffered a MAJOR web outage. Press says it was hacked, I know better. I used to manage that web server before I was summarily laid off. The MySQL database would start going haywire because it was an ancient version. All you had to do was kill the MySQL slave and restart MySQL and all would be fine.