IT Infrastructure As a House of Cards
snydeq writes "Deep End's Paul Venezia takes up a topic many IT pros face: 'When you've attached enough Band-Aids to the corpus that it's more bandage than not, isn't it time to start over?' The constant need to apply temporary fixes that end up becoming permanent are fast pushing many IT infrastructures beyond repair. Much of the blame falls on the products IT has to deal with. 'As processors have become faster and RAM cheaper, the software vendors have opted to dress up new versions in eye candy and limited-use features rather than concentrate on the foundation of the application. To their credit, code that was written to run on a Pentium-II 300MHz CPU will fly on modern hardware, but that code was also written to interact with a completely different set of OS dependencies, problems, and libraries. Yes, it might function on modern hardware, but not without more than a few Band-Aids to attach it to modern operating systems,' Venezia writes. And yet breaking this 'vicious cycle of bad ideas and worse implementations' by wiping the slate clean is no easy task. Especially when the need for kludges isn't apparent until the software is in the process of being implemented. 'Generally it's too late to change course at that point.'"
In most organizations, the IT department is treated as pure cost instead of something that provides strategic value. These IT departments have no chance of getting a budget approved that will allow them to "start over" on any part of their implementation; hence the constant onslaught of temporary fixes and patches.
"It's a reverse vampire...they....they crave the sun!"
Well, sure, IT departments place the blame there. The problem, though, is not so much with the products that IT "has to deal with" as with the fact that IT departments either actively choose the penny-wise-but-pount-foolish course of action of applying band-aids rather than dealing with problems properly in the first place, or because -- when the decision is not theirs -- they simply fail to properly advise the units that are making decisions of the cost and consequence of such a short-sighted approach.
When IT units don't take responsibility for assuring the quality of the IT infrastructure, surprisingly enough, the IT infrastructure, over time, becomes an unstable house of cards, with the IT unit pointing fingers everywhere else.
If your process -- whether its for development or procurement -- doesn't discover holes before it is too late to do anything but apply "temporary" workarounds, then your process is broken, and you need to fix it so you catch problems when you can more effectively address them.
If your process leaves those interim workarounds fixes in place once they are established without initiating and following through on a permanent resolution, then, again, your process is broken and needs fixed.
You don't fix the problems with your infrastructure that have resulted from your broken processes by "wiping the slate clean" on your infrastructure and starting over. You fix the problems by, first, improving your processes so your attempts to address the holes you've built into your infrastructure don't create two more holes for every one you fix, then by attacking the holes themselves.
If you try to through the whole thing out because its junk -- blaming the situation on the environment and the infrastructure without addressing your process -- then:
(a) you'll waste time redoing work that has already been done, and
(b) you'll probably make just as many mistakes rebuilding the infrastructure from scratch as you made building it the first time, whether they are the same or different mistakes.
Magical thinking like "wipe the slate clean" doesn't fix problems. Problems are fixed by identifying them and attacking them directly.
This the essence of technical debt. Whether you're programming or deploying IT infrastructure, it's inescapable that sometimes you're going to have to include kludges to work around edge conditions, a vocal 1% of your users, or whatever. These kludges are eyesores, and fragile, but they're also as far as you could go with the time and budget you had.
Sometimes, accruing debt like this enhances your liquidity and ability to respond to change, so avoiding all kludges introduces other more obvious costs that slow you down and make you seem unresponsive to users or customers. But you can't just go on letting your debt grow all the time and not eventually come up technically bankrupt. Let it grow when you have to, but just as importantly make time to pay it down. A lot of this stuff can be paid down a little at a time, as you come across it a few months later. The pay-off if you're vigilant is that the next ridiculously urgent fix to that system can often be handled much more easily, without dipping down further... with patience and attention to maintaining this balance, you can reduce your technical debt and make the whole system hum.
The downside is that there isn't a quick fix when you find yourself deep in technical debt. You can't just spend all your time reducing it; your highest aspiration at that point should be maintaining the level of technical debt, rather than letting it grow, but it's generally been my experience that altering the curve of debt growth even a little can set you on the right path.
--Matthew
Ah yes, the sunk cost fallacy.
Any sufficiently unpopular but cohesive argument is indistinguishable from trolling.
The network it was running was not a small network. Not at all. It was a travesty that this poor switch was running the network. Well over 200 devices plugged into other 2548s all bridged back to the poor "core" switch.
Sent from your iPad.
Yeah, I saw that line and immediately thought about some of the "temporary solutions" people have proposed over the years. The statement is an oxymoron. It's either not a solution to the problem, or it's not temporary.
We've got less of those being made now, because I've taken to listing the previous "temporary solutions" every time someone proposes a new one.
-- "So they told me that using the download page to download something was not something they anticipated." - Bill Gates
Some of the concurrency stuff needs a complete rewrite - acquiring synchronization primitives is painful, the new 'amazingly fast' locking that they use for GCD is marginally better than a FreeBSD mutex, and between one and three orders of magnitude (depending on load) faster than a Darwin mutex. Part of this is a userspace problem (not optimising for the uncontended case, which is the most common in good code), but a lot of it comes from the route down through the myriad kernel layers when sleeping a thread.
That problem in Mach is part of what gave microkernels a bad name. QNX, which is a real microkernel (about 65K of code) does thread dispatching, locking, and message passing very fast, in constant time, and without long interrupt lockouts. Those are the functions which must go fast in a microkernel, because they're used so much. In QNX, locking a mutex in the uncontested case is about three instructions in-line, with no system call. Those three functions are most of what the QNX kernel really does. In Mach, they were an afterthought, written on top of BSD.
This really belongs in the "when is it time to rewrite" thread.