The Four Fallacies of IT Metrics
snydeq writes "Advice Line's Bob Lewis discusses an all-too-familiar IT mistake: the use of incidents resolved per analyst per week as a metric for assessing help-desk performance. 'If you managed the help desk in question or worked on it as an analyst, would you resist the temptation to ask every friend you had in the business to call in on a regular basis with easy-to-fix problems? Maybe you would. I'm guessing that if you resisted the temptation, not only would you be the exception, but you'd be the exception most likely to be included in the next round of layoffs,' Lewis writes. 'The fact of the matter is it's a lot easier to get metrics wrong than right, and the damage done from getting them wrong usually exceeds the potential benefit from getting them right.' In other words, when it comes to IT metrics, you get what you measure — that's the risk you take."
Metrics are great for some things. For making sure that your employees are working they are terrible. I used to work in a metric free environment and there was a great team atmosphere. Then metrics came along and it all went to hell. Now everyone is so focussed on making their numbers look good that the whole organisation is suffering from a weird sense of internal competitiveness. People no longer collaborate on difficult problems because there is no measure within the metrics system to reflect that this occurred. People who used to be innovative are no longer so, because they are not rewarded for spending time innovating. It has achieved nothing good that I can see.
Good. glad to see that some VP did the smart thing for once and cut the middle managers instead of the people who actually get the work done.
Stats by themselves will only ever be an indicator what is happening. You really need managers on the ground that are trust worthy to give you feed back on how things are actually going.
Taking humans out of the loop when rating other humans is always a mistake
It said "windows 98 or better" so I installed Linux
I work for an MSP (Manage Service Provider). We account for time every 15 minutes. Inactive, internal department active, billable active, and non-billable active. All of this logging of time gets calculated out as metrics that define our bonus. So the outcome is pretty much as you've stated. But that's ok, we know how the metric get calculated and thus we game the system of metric without cheating our clients out of money. Naturally, that would be dishonest to do otherwise. But I'll be damned if I sit back and be judged and taken advantage of by some MBA that can't even interoperate the concept of what those numbers are supposed to mean in the first place. They only need to know two things. Is the work billable to the client, and how much. They're free to speak to a manager if they wish to contest the hours performed and/or quality of work. The point is, we want their business. So it serves no point to lose clients for us.
It will get worse I hear. Rumor has it we will be timed every 5 minutes with a USB activity button. Sort of like a Chess timer or some such. Also, our keyboards will be logged for activity and application fields will track mouse moment and other activity. It's absolutely nuts. At this rate, they'll need to hire me a secratary just to do the logging for me while I focus on actual work. Hey, now that's cost effective right? I bet they didn't think of that, did they. Doh!
Life is not for the lazy.
Be warned: my example is way off topic, but a pet statistic I keep track of.
There is no such things as bad statistics, only bad layman statisticians who don't understand what the numbers actually measure.
Take lines of code, for example. Some people hate it because you can bloat the numbers by adding comments, neglecting to consider how useful those comments are for future maintenance, and thereby a useful application of a developer's time. If you use a consistent formatting style for two projects, you can get a fair grasp of their complexity from the line count, though that will gloss over details about how the code actually works.
The most interesting pattern I've notice in line counts over the years is that the use of templates and other code abstraction facilities really hasn't decreased the size of code much at all, though it's improved readability, maintainability, and programmer API usability substantially. So line counts only give you an approximation of complexity with a language like Java, but do nothing to measure the quality of the code.
One other thing I've found is that complex code looks fat and heavy from it's sheer size, but often compiles to very reasonable executable size and runs rings around supposedly "tight" code that makes heavy use of dynamic techniques like introspection. As only one image of an executable is loaded by a reasonably competent OS, a fat binary does not mean a fat application at runtime.
Big code is only scary if it's not following recognizable patterns and is instead a mishmash of different developer's pet syntax, algorithms, style conventions, naming conventions, and even preferred APIs. If you manufacture it predictably, fat source code becomes a joy to maintain, enhance, and use.
But back to the core topic: help desk performance.
The only help desk stat I care about is a low number on customer complaint reports about the quality of information and assistance provided by the tech team. If it's my company and my budget, I'd rather hire more technicians to handle the load and produce happy customers in the end than I would saving money by overworking and burning them out by even thinking about useless numbers like "calls handled per week."
In the end, if you care about your business, the only thing that truly matters are happy customers who want more services or products in the future, and who will gladly tell others about their good experiences in dealing with you.
There is no substitute for a good word-of-mouth reputation and repeat business. No one ever got fired for buying IBM not because they're perfect, but because their people will go the extra mile to make things work.
I do not fail; I succeed at finding out what does not work.
Support... ... also means 'helping you set things up right', 'helping you optimize your configuration', 'helping you figure out what tool you need for the job at hand', and so on.
Worked at a support center... I was a "talk to them until they understand" guy, playing the long game... I figured while it might not take every time, if I got people to understand, they could get back to work and not break things for just a little bit longer. You know, it costs two people money if they have to talk to me while I help them.
One of my coworkers got huge amounts of management praise for processing lots and lots of cases... My management was too dumb to run numbers on how many callbacks he had, that the rest of us were fixing...
Yeah, sure I was spending too much time with each person, but half of my time was fixing this jerk's mistakes. There's probably some of that at every support center. It takes 10 minutes to fix a problem, but 5 minutes to get them to go away. You can look very busy by making them go away, if management isn't clever enough.
I'm rather happy with my new position... I get to review other people. And I do it fairly.
I worked in a helpdesk many years ago where we were all measured on the number of calls per week we closed. There was no consideration towards the complexity of the call given.
Our boss at the time, started giving a $100 incentive to the most number of closed calls. One of the guys in there consistently got the prize. One day, while looking up a call I fat fingered a digit and found myself looking at one of his tickets... it was a ticket, opened and closed about receiving a phone call from X. $ticketnum +1 was the actual ticket for X.
In a nutshell with some sorting/filtering I saw that the guy was not only gaming the system, but hiding the fact that he was grossly incompetent. I wrote everything up and showed it to our boss. Needless to say, he was less than happy not only with this guy, but with me. He was being pushed on from his boss to generate metrics and basically was complicate.
Long story short, I went to his bosses boss i.e. the CIO and voiced my frustration. I pointed out that fallacy of this metric that me imaging a laptop (which back then took hours) vs. Answering the phone both being basically equal to the same measure of productivity made the metric useless. Not to mention the fact that it provided zero incentive to provide better support, just incentive to close tickets.
Obviously, this caused some huge changes. Not the least of which was a much more comprehensive analysis of what people were actually doing. This made quite a few people unhappy because it exposed them for being the incompetent hacks they were. Not the least of which were my boss at the time and that employee.
Yes Francis, the world has gone crazy.
The major fallacy many big companies fall into is that some of these systems have been running flawlessly for years, because they hired a competent IT staff. They look at the price of those paychecks and shiver. Why are we paying so many high priced engineers when we've never had a problem, they think.
So they reduce staff and start to rely on support contracts instead of on-site gurus. The gurus are still there to solve any oh-shit moments. But that back investment in good engineers has produced a stable infrastructure that runs with few problems for years. So they reduce staff more, pay for more support contracts, and eventually the system critical mass is greater than the engineers who can support it. It's no problem until it's a problem.
Eventually something minor goes wrong, but nobody notices or if they do it's not really their field of expertise so they don't understand it's minor now but could escalate. When it does, something else goes wrong, and a cascade effect takes out more and more systems. With a full staff, you have enough guys that when the critical mass is reached, they can start defensive measures and get things back in working order in no time. With support staff only, things are going wrong faster than they can deal with it.
"Call on our support contracts," shout the bosses! So now your on-site staff are all on hold instead of troubleshooting. When they get through to someone, they have to spend the first hour or two describing their infrastructure to the technician on the other end, who starts making random suggestions that maybe help, but probably don't.
My anecdote on this front is a company I used to work for. It's a long read, but demonstrates the failures at several levels which is the direct result of this kind of thinking. The Oracle transaction log disk was getting full. Some warnings came in, but disks running low on space was an every day occurrence, we'll send an email to the person on record as being responsible for those servers, and troubleshoot why the "Executive Dashboard" is responding a bit slow today (it's for the execs, it's automatically high priority). Except that person is currently aboard an airplane on his way to help reduce staff in east Asia, he'll be incommunicado for the next 19 hours or more.
It seems like an innocent enough problem, it's just a log disk, the worst thing that could happen is we lose some logs, right? Whoops, transaction logs are pretty important for Oracle. The fact that the disk is filling up at all is itself an indicator that something bigger is wrong; this shouldn't happen. But critically once the disk does fill up, Oracle will enter read-only mode. Or it should. This time it doesn't, it shuts down. BOOM, offline. So down goes SAP. With SAP down, our entire business is offline. We can't take orders, we can't ship orders, we can't pay bills, we can't pay paychecks, the hourly workers whose shift is starting can't even clock in. Some buildings with tighter badge access can't even be entered unless someone inside opens an emergency door to let someone in.
Once the transaction log disk was full, Oracle will no longer start up, it needs some space on the log disk to log startup-related transactions. Two hours on hold with Oracle Gold Pressed Latinum level support they finally get an engineer. Wow, this is something he's never seen before, Oracle should have gone into read-only mode before this happened! The only solution anyone can seem to think of is to get some bigger disks for the transaction logs, clone the data over to these new disks and give the startup another go. We have hot spares on a shelf, but nobody knows this. Finding disks requires a different support contract, they can have disks out to us tomorrow. Yeah, that's not going to cut it. Someone literally drives out to a distribution warehouse. Two more hours down (they actually send two different guys in different cars with instructions to take different routes in case one runs into traffic or gets in an acciden
Slay a dragon... over lunch!