IT Infrastructure As a House of Cards

← Back to Stories (view on slashdot.org)

IT Infrastructure As a House of Cards

Posted by Soulskill on Monday May 24, 2010 @10:20AM from the if-it-ain't-broke-it-will-be-soon-enough dept.

snydeq writes "Deep End's Paul Venezia takes up a topic many IT pros face: 'When you've attached enough Band-Aids to the corpus that it's more bandage than not, isn't it time to start over?' The constant need to apply temporary fixes that end up becoming permanent are fast pushing many IT infrastructures beyond repair. Much of the blame falls on the products IT has to deal with. 'As processors have become faster and RAM cheaper, the software vendors have opted to dress up new versions in eye candy and limited-use features rather than concentrate on the foundation of the application. To their credit, code that was written to run on a Pentium-II 300MHz CPU will fly on modern hardware, but that code was also written to interact with a completely different set of OS dependencies, problems, and libraries. Yes, it might function on modern hardware, but not without more than a few Band-Aids to attach it to modern operating systems,' Venezia writes. And yet breaking this 'vicious cycle of bad ideas and worse implementations' by wiping the slate clean is no easy task. Especially when the need for kludges isn't apparent until the software is in the process of being implemented. 'Generally it's too late to change course at that point.'"

8 of 216 comments (clear)

Min score:

Reason:

Sort:

All comes down to budget by Admodieus · 2010-05-24 10:23 · Score: 5, Informative

In most organizations, the IT department is treated as pure cost instead of something that provides strategic value. These IT departments have no chance of getting a budget approved that will allow them to "start over" on any part of their implementation; hence the constant onslaught of temporary fixes and patches.

--
"It's a reverse vampire...they....they crave the sun!"
1. Re:All comes down to budget by eln · 2010-05-24 10:42 · Score: 5, Informative
  
  The problem is not with kludges themselves, but with the fact that IT management does not stress documentation and proper change control procedures enough. If a kludge works, is documented, was implemented with proper change controls, and can be repeated, is it really a kludge anymore? IT has to screw around with stuff to make it work, that's what they (we) get paid for. If all we ever had to do was click on an install button and have everything work perfectly from there, what would be the purpose of an IT department at all? Off-the-shelf software and hardware can never be made to work perfectly for everyone's requirements. IT folks are paid to get non-unique components to work for unique requirements.
  
  The problem is not with these fixes, it's that nobody ever documents what they did, and documentation is not readily available when needed. So, these kludges become tribal knowledge, and people only know about them because they were around when they were implemented or they've heard stories. When this happens, these wacky fixes can come back and bite you in the ass later when something mysteriously crashes and no one can get it to work like it did because nobody remembers what was done to make it work before. As people come and go, and institutional knowledge of older systems slowly erodes, we end up in a situation where everyone thinks the current system is crap, nobody knows why it was built that way, and everyone figures the only way out is to nuke the site from orbit and start over. The trick is keeping it from getting to that point.
  
  Of course, nobody likes jumping through all these hoops like filing change control requests or writing (and especially maintaining!) documentation, so it gets dropped. IT management is more worried about getting things done quickly than documenting things properly, so there's no incentive for anyone to do any of it. Before long, you get a mass of crap that some people know parts of, but nobody knows all of, and nobody knows how or where to get information about any of it except by knowing that John Geek is the "network guru" and Jane Nerd is the "linux guru".
  
  We will never get hardware and software that works together exactly the way we want them to. We will always have to tweak things to get them to work right for us. Citing lack of budgets or bug-ridden software may be perfectly valid, but those problems are never really going to be solved. Having our own house in order does not mean fixing all the bugs or being able to refresh our technology every 6 months. Having our own house in order means we know exactly what we did to make each system work right, we can repeat what we did, and everyone knows how to find information on what we did and why.
Take responsibility and stop the magical thinking by DragonWriter · 2010-05-24 10:36 · Score: 3, Informative

The constant need to apply temporary fixes that end up becoming permanent are fast pushing many IT infrastructures beyond repair. Much of the blame falls on the products IT has to deal with.

Well, sure, IT departments place the blame there. The problem, though, is not so much with the products that IT "has to deal with" as with the fact that IT departments either actively choose the penny-wise-but-pount-foolish course of action of applying band-aids rather than dealing with problems properly in the first place, or because -- when the decision is not theirs -- they simply fail to properly advise the units that are making decisions of the cost and consequence of such a short-sighted approach.
When IT units don't take responsibility for assuring the quality of the IT infrastructure, surprisingly enough, the IT infrastructure, over time, becomes an unstable house of cards, with the IT unit pointing fingers everywhere else.

And yet breaking this 'vicious cycle of bad ideas and worse implementations' by wiping the slate clean is no easy task. Especially when the need for kludges isn't apparent until the software is in the process of being implemented. 'Generally it's too late to change course at that point.'
If your process -- whether its for development or procurement -- doesn't discover holes before it is too late to do anything but apply "temporary" workarounds, then your process is broken, and you need to fix it so you catch problems when you can more effectively address them.
If your process leaves those interim workarounds fixes in place once they are established without initiating and following through on a permanent resolution, then, again, your process is broken and needs fixed.
You don't fix the problems with your infrastructure that have resulted from your broken processes by "wiping the slate clean" on your infrastructure and starting over. You fix the problems by, first, improving your processes so your attempts to address the holes you've built into your infrastructure don't create two more holes for every one you fix, then by attacking the holes themselves.
If you try to through the whole thing out because its junk -- blaming the situation on the environment and the infrastructure without addressing your process -- then:
(a) you'll waste time redoing work that has already been done, and
(b) you'll probably make just as many mistakes rebuilding the infrastructure from scratch as you made building it the first time, whether they are the same or different mistakes.
Magical thinking like "wipe the slate clean" doesn't fix problems. Problems are fixed by identifying them and attacking them directly.
pay off your credit cards? by Matthew+Weigel · 2010-05-24 10:45 · Score: 5, Informative

This the essence of technical debt. Whether you're programming or deploying IT infrastructure, it's inescapable that sometimes you're going to have to include kludges to work around edge conditions, a vocal 1% of your users, or whatever. These kludges are eyesores, and fragile, but they're also as far as you could go with the time and budget you had.
Sometimes, accruing debt like this enhances your liquidity and ability to respond to change, so avoiding all kludges introduces other more obvious costs that slow you down and make you seem unresponsive to users or customers. But you can't just go on letting your debt grow all the time and not eventually come up technically bankrupt. Let it grow when you have to, but just as importantly make time to pay it down. A lot of this stuff can be paid down a little at a time, as you come across it a few months later. The pay-off if you're vigilant is that the next ridiculously urgent fix to that system can often be handled much more easily, without dipping down further... with patience and attention to maintaining this balance, you can reduce your technical debt and make the whole system hum.
The downside is that there isn't a quick fix when you find yourself deep in technical debt. You can't just spend all your time reducing it; your highest aspiration at that point should be maintaining the level of technical debt, rather than letting it grow, but it's generally been my experience that altering the curve of debt growth even a little can set you on the right path.

--
--Matthew
Re:Software = untouchable mentality by Ichijo · 2010-05-24 12:28 · Score: 3, Informative

There's this belief (often held all the way up the management chain to the top) that software, even bad software, represents some kind of massive, utterly permanent investment that must never be thrown away and re-written.
Ah yes, the sunk cost fallacy.

--
Any sufficiently unpopular but cohesive argument is indistinguishable from trolling.
Re:As a non-developer, this is what I see by Em+Emalb · 2010-05-24 12:33 · Score: 2, Informative

The network it was running was not a small network. Not at all. It was a travesty that this poor switch was running the network. Well over 200 devices plugged into other 2548s all bridged back to the poor "core" switch.

--
Sent from your iPad.
Re:like bubblegum under a desk... by Tridus · 2010-05-24 13:05 · Score: 3, Informative

Yeah, I saw that line and immediately thought about some of the "temporary solutions" people have proposed over the years. The statement is an oxymoron. It's either not a solution to the problem, or it's not temporary.
We've got less of those being made now, because I've taken to listing the previous "temporary solutions" every time someone proposes a new one.

--
-- "So they told me that using the download page to download something was not something they anticipated." - Bill Gates
Re:I was torn between modding this up and commenti by Animats · 2010-05-25 03:51 · Score: 2, Informative

Some of the concurrency stuff needs a complete rewrite - acquiring synchronization primitives is painful, the new 'amazingly fast' locking that they use for GCD is marginally better than a FreeBSD mutex, and between one and three orders of magnitude (depending on load) faster than a Darwin mutex. Part of this is a userspace problem (not optimising for the uncontended case, which is the most common in good code), but a lot of it comes from the route down through the myriad kernel layers when sleeping a thread.
That problem in Mach is part of what gave microkernels a bad name. QNX, which is a real microkernel (about 65K of code) does thread dispatching, locking, and message passing very fast, in constant time, and without long interrupt lockouts. Those are the functions which must go fast in a microkernel, because they're used so much. In QNX, locking a mutex in the uncontested case is about three instructions in-line, with no system call. Those three functions are most of what the QNX kernel really does. In Mach, they were an afterthought, written on top of BSD.
This really belongs in the "when is it time to rewrite" thread.