IT Infrastructure As a House of Cards
snydeq writes "Deep End's Paul Venezia takes up a topic many IT pros face: 'When you've attached enough Band-Aids to the corpus that it's more bandage than not, isn't it time to start over?' The constant need to apply temporary fixes that end up becoming permanent are fast pushing many IT infrastructures beyond repair. Much of the blame falls on the products IT has to deal with. 'As processors have become faster and RAM cheaper, the software vendors have opted to dress up new versions in eye candy and limited-use features rather than concentrate on the foundation of the application. To their credit, code that was written to run on a Pentium-II 300MHz CPU will fly on modern hardware, but that code was also written to interact with a completely different set of OS dependencies, problems, and libraries. Yes, it might function on modern hardware, but not without more than a few Band-Aids to attach it to modern operating systems,' Venezia writes. And yet breaking this 'vicious cycle of bad ideas and worse implementations' by wiping the slate clean is no easy task. Especially when the need for kludges isn't apparent until the software is in the process of being implemented. 'Generally it's too late to change course at that point.'"
In most organizations, the IT department is treated as pure cost instead of something that provides strategic value. These IT departments have no chance of getting a budget approved that will allow them to "start over" on any part of their implementation; hence the constant onslaught of temporary fixes and patches.
"It's a reverse vampire...they....they crave the sun!"
...but I believe in Duct Tape.
As long as your backup and tertiary machines have different kludges keeping them running, there's no problem...
Maintaining code is boring.
Everyone wants to work on the latest and greatest stuff, no one wants to maintain or even release patches.
It sucks, especially since it isn't limited to just software development.
I've seen companies where their "core switch" was a Cisco 2548. This wasn't 10 years ago, this was last year! Unreal.
Sent from your iPad.
There are a lot of people writing shitty software. Film at 11.
... and then they built the supercollider.
Well, sure, IT departments place the blame there. The problem, though, is not so much with the products that IT "has to deal with" as with the fact that IT departments either actively choose the penny-wise-but-pount-foolish course of action of applying band-aids rather than dealing with problems properly in the first place, or because -- when the decision is not theirs -- they simply fail to properly advise the units that are making decisions of the cost and consequence of such a short-sighted approach.
When IT units don't take responsibility for assuring the quality of the IT infrastructure, surprisingly enough, the IT infrastructure, over time, becomes an unstable house of cards, with the IT unit pointing fingers everywhere else.
If your process -- whether its for development or procurement -- doesn't discover holes before it is too late to do anything but apply "temporary" workarounds, then your process is broken, and you need to fix it so you catch problems when you can more effectively address them.
If your process leaves those interim workarounds fixes in place once they are established without initiating and following through on a permanent resolution, then, again, your process is broken and needs fixed.
You don't fix the problems with your infrastructure that have resulted from your broken processes by "wiping the slate clean" on your infrastructure and starting over. You fix the problems by, first, improving your processes so your attempts to address the holes you've built into your infrastructure don't create two more holes for every one you fix, then by attacking the holes themselves.
If you try to through the whole thing out because its junk -- blaming the situation on the environment and the infrastructure without addressing your process -- then:
(a) you'll waste time redoing work that has already been done, and
(b) you'll probably make just as many mistakes rebuilding the infrastructure from scratch as you made building it the first time, whether they are the same or different mistakes.
Magical thinking like "wipe the slate clean" doesn't fix problems. Problems are fixed by identifying them and attacking them directly.
Kernighan & Plauger
The Elements of Programming Style
2nd edition, 1974 (exemplified in FORTRAN and PL/1!)
i guess its ok that the sysadminds coopted the work 'implemented' where one would normally
say 'installed'
but that kind of leaves the actual implementors without a word now
and in this particular usage, its kind of odd, because usually the best time to
find and fix these problems is exactly when its being implemented, rather than
when its being installed
Wait, you mean there have been newer and faster processors released since then? So Mordac really has been hiding something from me...
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
In the defense of IT, those people they're trying to advise aren't always the best at taking advice. (But then again, neither are IT admins always the best at giving it.)
The World Wide Web is dying. Soon, we shall have only the Internet.
This the essence of technical debt. Whether you're programming or deploying IT infrastructure, it's inescapable that sometimes you're going to have to include kludges to work around edge conditions, a vocal 1% of your users, or whatever. These kludges are eyesores, and fragile, but they're also as far as you could go with the time and budget you had.
Sometimes, accruing debt like this enhances your liquidity and ability to respond to change, so avoiding all kludges introduces other more obvious costs that slow you down and make you seem unresponsive to users or customers. But you can't just go on letting your debt grow all the time and not eventually come up technically bankrupt. Let it grow when you have to, but just as importantly make time to pay it down. A lot of this stuff can be paid down a little at a time, as you come across it a few months later. The pay-off if you're vigilant is that the next ridiculously urgent fix to that system can often be handled much more easily, without dipping down further... with patience and attention to maintaining this balance, you can reduce your technical debt and make the whole system hum.
The downside is that there isn't a quick fix when you find yourself deep in technical debt. You can't just spend all your time reducing it; your highest aspiration at that point should be maintaining the level of technical debt, rather than letting it grow, but it's generally been my experience that altering the curve of debt growth even a little can set you on the right path.
--Matthew
There's nothing more permanent than a temporary fix.
the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff
From the original message we read that the "code was also written to interact with a completely different set of OS dependencies, problems, and libraries." This seems to imply that the IT organizations are allowing outside interests to dictate the rules of the game. If there were a stable set of operating system calls and libraries to rely on, then the software vendors would have an easier time maintaining software. I recognize that Linux changes, but the operating system calls work well and API is quite stable. I have used UNIX for a long time and I have compiled programs from 25 years ago under Linux. There have been some additions since then, but the basics of Linux work like the basics of UNIX from 25 years ago.
At present there are some applications available only on Windows and some only on Windows/Mac OSX. This might be difficult to change, but going along with someone's plan for computing which is based on continued obsolescence seems inappropriate. At least those who are more or less forced by software availability to use Windows should investigate Linux and negotiate with their vendors to supply Linux solutions.
Computers are hard to manage and hard to program. It is not helpful to undergo regular major overhauls in operating systems.
Ray Seyfarth, ray.seyfarth@gmail.com, http://rayseyfarth.blogspot.com
This happens in commercial software development, too. There's this belief (often held all the way up the management chain to the top) that software, even bad software, represents some kind of massive, utterly permanent investment that must never be thrown away and re-written.
I've worked with managers who would think nothing of throwing away a million dollar manufacturing machine to replace it because it's old, yet cling with all their might to ancient software code that represents a similar level of investment.
It doesn't look like "doing it right the first time" is an option here. RTFA. They're talking about vendor applications being crappy and crufty, and IT departments being required to support them. The IT department didn't pick the app, and isn't allowed to not support it. They can't switch to another app (usually apps like this have little or no competition, and they're probably locked-in anyway).
So there's really nothing they can do but complain as long as they're required to support some shitty application on the latest version of Windows, as these are the requirements set down by upper management.
I'm going to tackle some of the conceptual problems that are hinted at above, which is usually where the difficulties lie, usually in trying to use the wrong software and expecting to somehow "make everything better" if you just make it work "my way" - the true "Magical Thinking".
I tend to agree with your conclusions, "wipe the slate clean" is a drastic action. I disagree with some of the approach you use to arrive at them:
a.) Problems are solved by people being invested in solving them, not process. This requires the antithesis of "Units" - Ownership; Ownership in the company, Ownership of the mission, and a direct heart felt connection to the success of the company. Until you have staff, from the CEO down, that own problems, from the mess in the coffee room to server down time, you will have a "business house of cards" no matter how good the process. In fact, most of the time, fixing things involves re-writing and/or reconsidering process - usually starting with asking the question - "Do we really need that?"
b.) Sometimes you really do have a train wreck on your hands. If you have mastered a.) b follows almost effortlessly, because now, you can *talk* about this behemoth that is eating your company and everybody sees the discussion for what it is, not empire building or managerial fingerprinting.
when you run into a train wreck - assess your tech problem - is the fix easily found? Are your processes using the software at cross purposes? if so, which is cheaper to fix? No amount of bug fixing will repair using the wrong software. It won't even fix using the right software in the wrong way.
In the end, re-asses often, be frugal, not cheap, if it truly is a requirement to run your business, buy the most appropriate. If you've made the mistake of buying a Kenworth long hauler when you needed 3 old UPS trucks - admit it, sell it back, take your loss and get what you really need.
Thats not "magical thinking" it's just common sense.
More than any other type, businesses are run by salesmen. These are people whose strongest attributes are the ability to build relationships, to communicate value, and a strong inclination to increase their personal wealth.
Increasingly, the stuff salesmen sell is based on complex technologies that, really, are beyond the reach of their comprehension. They kind of understand the products they sell, but really, they don't. If the world only had salesmen, there wouldn't be any sophisticated products.
Say hello to the engineer...a person who builds products. His strongest attributes are a desire to solve problems, a willingness to absorb the tedious but essential details needed to build a complex system, and a personality that derives gratification from doing so.
We now begin the business cycle. The salesman says, "Build me something I can sell."
The engineers says, "I will build you something that works well."
And therein begins a lifetime of the two, symbiotically, talking past each other. The engineer serves the salesman, and the salesman serves himself. But make no mistake about it: the salesman is in control.
For a salesman, QUALITY means it works well enough for him to sell more, and most importantly, to make more money for himself. For an engineer, QUALITY means it works reliably and efficiently. To be sure, QUALITY is an abstract and moving target that varies according to the eyes of the beholder. But to understand why we have the predicament described in this article, we need only understand the SIGNIFICANCE OF QUALITY TO A SALESMAN.
I would continue to expound, but then, most readers here need only reflect on their already frustrated pasts to understand the mechanics of this convenient but often vacuous relationship.
Both are, IMO, essential, which is while while I pointed at particular areas of process, my big picture message was about IT shops taking "responsibility for assuring the quality of the IT infrastructure."
Neglect of process is a symptom of people not being invested in solving problems that leads to bad results on its own, but even a good (nominal) process isn't going to work well if people aren't invested.
I prefer "responsibility"; "ownership" is, IMO, misapplied here. (Though, arguably, one of the reasons people do not take responsibility is because they don't, in fact, have ownership -- but ownership is a material relationship, and responsibility is the relevant attitude.)
But I think in substance we generally agree.
You kind of contradict yourself there: if fixing things usually requires changing the process, then "how good the process" is obviously has fairly direct bearing on success. The key thing is that processes aren't good (or bad) in a vacuum, they are good or bad based on the effects they have in your organization, in acheiving your mission; the same nominal process that is good for a group of people when considered against one mission is going to be bad for the same group of people when considered against different goals, and the same process that is good for one group of people with a given mission will suck for another group of people with the same mission, because people matter.
I was torn between modding this up and commenting.
I picked commenting.
This statement:
Everyone wants to work on the latest and greatest stuff, no one wants to maintain or even release patches.
is very, very true. We (Apple) have a hard time getting applicants who want to do anything other than work on the next iPhone/iPad/whatever. Mainline kernel people are difficult to hire, even though the same kernel is being used on the iDevices as is being used on the regular Macs. Everyone wants to work on the new sexy. For some positions, that works, but for most of them, you have to prove yourself elsewhere before you get your shot.
I think that, for the most part, we see the same thing in marketing for higher education (with the exception of one track, one of the universities I went to has become a diploma mill for Flash game programmers; sadly, I would not hire recent graduates from there unless they have an experience track record). There are video game classes at most universities, but while it might be sexy, you are most likely not going to be getting a job doing video games, 3D modeling for video games, or anything video game related, really, unless you get together with some friends and start your own company, and even then it's a 1 out of 100 chance of staying in business.
I don't really know how to address this, except by the people who think they are going to be the next great video game designer remaining unemployed.
-- Terry
There is constant pressure to re-implement existing architecture. Most of the time, the people who want to do this do not have a clear understanding of the business process involved, don't realize that the existing frameworks represent years of bug fixes and are at least stable for that reason. They only think "Wow this sucks, a new one would HAVE to be better."
I'm not saying that you should never rebuild something from the ground up, but the scope of the project should be limited and the entire endeavor should be well documented and well understood from the beginning. And if the guy who's pushing for a rewrite can't demonstrate a deep and fundamental understanding of the business flows being automated, he should be taken out and shot (Or at least pummeled soundly.)
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Been there, done that.
If you've got even a small or medium sized enterprise application (whatever that buzz word means) at a larger company (Boeing, for example), it might have its hooks into a dozen or more peer systems/hosts/databases/whatever. They are all 'owned' by different depatments, installed and upgraded over many years. Each on their own schedule and budget. When one group gets the funds together to address their legacy ball of duct tape and rubber bands, they roll the shiney new hardware in and install the spiffy new app. But everyone else is a few years away from affording new systems. And so the inter-system duct tape is simply re-wrapped.
The IT department tried selling everyone on architecture standardization. But due to the gradual pace of system upgrading, the plan was out of date before everyone got caught up to the old one. And today's 'standard' architecture wouldn't play nicely with what was state of the art a few years ago (thanks Microsoft). The whole architecture standard ploy is a salesman's pitch to get management locked into their system. Unless you've got a small enough shop that you can change out everyone's desktop and the entire contents of the server room over a holiday weekend (another salesman's wet dream), it ain't gonna work.
The solution is to bite the bullet and admit that your systems are always going to be duct-taped together. And then make a plan for maintaining the bits of duct tape. There's nothing wrong with some inter-system glue as long as you treat it with the same sort of respect and attention to detail that one would use for the individual applications.
Have gnu, will travel.
The problem with the whole idea of "if we only had enough documentation and change control" is that it becomes a non-trivial event to actually read through the documentation. Let's take an imaginary system that's been in production for 5 years...assume every last drib and drab of change has been documented...now you've got a 2000 page document and several hundred change records that tell you *everything*. Except, when it comes right down to it, mastering that 2000 pages of documentation and all the changes made afterwards is a months if not years long project - hardly effective for dealing with production problems that need to be solved in minutes or hours.
The illusion being perpetrated here is that people are interchangeable, and if you just have enough documentation, you can replace Mr. Jones with 20 years of hands on experience with the system with Mr. Vishnu living in Bangalore (or even Mr. Smith in the next cube, for that matter), with a net cost savings.
Now, I'm not saying documentation is a bad thing -> lord knows, it helps to have a knowledge base you can search...but knowing what to search for is knowledge you only get by real world experience with maintaining a production system. This is not digging ditches, boys and girls, this is skilled, if not essentially artistic labor.
Simply put, people matter more than process.
We (Apple)...
...
...has become a diploma mill for Flash game programmers; sadly, I would not hire recent graduates from there...
Sounds about right to me!
Odds of +1 funny over flamebait/troll/offtopic: slim to none. I just hope your 2548 dies before you can mod me down!
The problem, though, is not so much with the products that IT "has to deal with" as with the fact that IT departments either actively choose the penny-wise-but-pount-foolish course of action of applying band-aids rather than dealing with problems properly in the first place, or because -- when the decision is not theirs -- they simply fail to properly advise the units that are making decisions of the cost and consequence of such a short-sighted approach.
I've found the problem to almost always be the last thing listed. It's the contractor syndrome. "If you give me $1,000,000 now, I'll save your $500,000 a year for the rest of the time you'd have used that." Well, they think you are lying. They think that you wouldn't actually save the $500,000 a year, but would take the $1,000,000 this year and add it to your budget as a permanent line item, costing them $1,500,000 a year, rather than saving $500,000.
You can blame the IT director/manager/CIO/whatever for not being convincing enough, but there seems to be a pattern where people bid low then have massive overruns where the highest bidder would have been cheaper. As such, the people the IT person is talking to are often so jaded they don't trust anyone with price estimates.
When IT units don't take responsibility for assuring the quality of the IT infrastructure, surprisingly enough, the IT infrastructure, over time, becomes an unstable house of cards, with the IT unit pointing fingers everywhere else.
And when the IT units have the responsibility, but not the authority to fix things, what then? Most all places tie the hands of IT then complain when the solution isn't perfect.
Learn to love Alaska
I am shocked! Shocked I tell you. Apple applicants are only attracted to shiny new things? And all along I thought it was just the customers.
Everyone wants to work on the latest and greatest stuff, no one wants to maintain or even release patches.
I don't really know how to address this, except by the people who think they are going to be the next great video game designer remaining unemployed.
Here's how you address it: you hire one of those 9 out of 10 CS graduates who "Just got in it for the money". Had you offices in the Midwest, you'd have no problem finding programmers whose only ambition is to crunch out brain-dead code until they can move into management. Trust me, I work with these people and they're even worse than the primadonnas interested only in the "cool" things. Naturally, not everyone can be the next game programmer, or work on cool things, but you probably don't want to hire those whose only ambition is to do the grunt work.
Typically, the primadonna has to have his ego coaxed into doing the grunt work. But you can usually count on him to do it fast, and not to make a total mess of things. Granted, some people have a higher estimation of their abilities than their peers. But at least someone passionate about coding can be inspired to improve their code; they'll actually accept coding standards once reasonably explained. But here's a short list of problems with the typical "career type":
It's easier to convince a rock-star programmer that documentation is necessary than it is to convince the career-track political programmer that a race condition is a problem, that architecture matters, that maintainability and scalability are important. Just the other day, I had a department manager question the value of writing reusable code - in fact, he was so hostile as to suggest that it wasn't worth our time to make code reusable... (And not only that, but reported to my boss that my suggestion otherwise was "distracting to what we're trying to accomplish here"...)
I know the starry-eyed programmers can be a handful at times, but those indifferent to technical issues will lay a minefield in your company. Suddenly, years after they've moved on, you'll find your new hires telling you the projects they built aren't worth salvaging, that you'll have to start over, etc... I've seen these types move into management and turn an otherwise fun profession into a death march. You don't want the stupid, or the political, types of people writing code. They'll set your company up for failure every time.
The society for a thought-free internet welcomes you.
Some of the concurrency stuff needs a complete rewrite - acquiring synchronization primitives is painful, the new 'amazingly fast' locking that they use for GCD is marginally better than a FreeBSD mutex, and between one and three orders of magnitude (depending on load) faster than a Darwin mutex. Part of this is a userspace problem (not optimising for the uncontended case, which is the most common in good code), but a lot of it comes from the route down through the myriad kernel layers when sleeping a thread.
That problem in Mach is part of what gave microkernels a bad name. QNX, which is a real microkernel (about 65K of code) does thread dispatching, locking, and message passing very fast, in constant time, and without long interrupt lockouts. Those are the functions which must go fast in a microkernel, because they're used so much. In QNX, locking a mutex in the uncontested case is about three instructions in-line, with no system call. Those three functions are most of what the QNX kernel really does. In Mach, they were an afterthought, written on top of BSD.
This really belongs in the "when is it time to rewrite" thread.