The Four Fallacies of IT Metrics
snydeq writes "Advice Line's Bob Lewis discusses an all-too-familiar IT mistake: the use of incidents resolved per analyst per week as a metric for assessing help-desk performance. 'If you managed the help desk in question or worked on it as an analyst, would you resist the temptation to ask every friend you had in the business to call in on a regular basis with easy-to-fix problems? Maybe you would. I'm guessing that if you resisted the temptation, not only would you be the exception, but you'd be the exception most likely to be included in the next round of layoffs,' Lewis writes. 'The fact of the matter is it's a lot easier to get metrics wrong than right, and the damage done from getting them wrong usually exceeds the potential benefit from getting them right.' In other words, when it comes to IT metrics, you get what you measure — that's the risk you take."
Metrics are great for some things. For making sure that your employees are working they are terrible. I used to work in a metric free environment and there was a great team atmosphere. Then metrics came along and it all went to hell. Now everyone is so focussed on making their numbers look good that the whole organisation is suffering from a weird sense of internal competitiveness. People no longer collaborate on difficult problems because there is no measure within the metrics system to reflect that this occurred. People who used to be innovative are no longer so, because they are not rewarded for spending time innovating. It has achieved nothing good that I can see.
Good. glad to see that some VP did the smart thing for once and cut the middle managers instead of the people who actually get the work done.
It's not just a problem in IT. It's a problem anywhere that managers are out of their depth. When they are very badly out of their depth poor attempts to observe what is going on can have very bad effects on an operation.
Some years ago I worked for few months at a steelworks, and it was managed as badly as the most rabid Libertarians imagine that the worst of government is run. Management never saw the operation or the city it was in. They just saw numbers, and the most important number graphed on noticeboards everywhere around the plant was "tonnes of steel per man hour".
Now the hours were only counted for permanant employees, so contractors were shuffled in and out by the thousands to skew that number. Since they were not employees and were theoretically off the books there was no training for those that came in from outside of the industry, which of course given the number of people involved and the nature of the workplace resulted in some very serious accidents with multiple deaths. Quality suffered from untrained staff and a desire to increase the tonnage above all else. Revenue went down becuase a lot of material had to be sold as a lower grade of steel, in addition to increased amounts of scrap which still counted in that magical number even though it had to be remelted. Nobody on site had the authority to make any major changes and any reports beyond the interesting numbers were ignored.
It went from being a profitable operation to almost completely shut down within two years - of over 16,000 employees only 300 remained to operate a small rod rolling mill that could get steel shipped in from elsewhere. The losses exposed the company to a takeover bid and they are now owned by Swiss Bankers that have some odd remote control management quirks of their own (which has created a billionaire that picked up one of their discared operations for just about nothing).
Performance metrics are just a simple model and you have to make sure that model actually fits the situation. Trying to change reality to fit an inappropriate model can result in the opposite to what is intended.
Actually, I'd recommend something outside the business field. Someone's going to cry BASTARD LIBERAL ARTS MAJOR on this, but what about Foucault's Surveiller et punir: he traces the development of the modern concept of uniform regimentation -- and the assessment process that accompanies it -- in the prisons and schools that shape or re-shape populations since the nineteenth century. I'm not sure he gets it quite right, since he focuses on France and tries hard to pretend that Prussia doesn't exist, and the Prussians were really the ones who pushed "objective" assessments into fields that were a bad fit for numerical metrics and regimentation. There are fields that are good fits for Prussian assessment: unthinking factory line workers (the kind best replaced by robots), prisons, and the cannon-fodder parts of armies have benefitted enormously from basing rewards and punishment on metrics.
Stats by themselves will only ever be an indicator what is happening. You really need managers on the ground that are trust worthy to give you feed back on how things are actually going.
Taking humans out of the loop when rating other humans is always a mistake
It said "windows 98 or better" so I installed Linux
I work for an MSP (Manage Service Provider). We account for time every 15 minutes. Inactive, internal department active, billable active, and non-billable active. All of this logging of time gets calculated out as metrics that define our bonus. So the outcome is pretty much as you've stated. But that's ok, we know how the metric get calculated and thus we game the system of metric without cheating our clients out of money. Naturally, that would be dishonest to do otherwise. But I'll be damned if I sit back and be judged and taken advantage of by some MBA that can't even interoperate the concept of what those numbers are supposed to mean in the first place. They only need to know two things. Is the work billable to the client, and how much. They're free to speak to a manager if they wish to contest the hours performed and/or quality of work. The point is, we want their business. So it serves no point to lose clients for us.
It will get worse I hear. Rumor has it we will be timed every 5 minutes with a USB activity button. Sort of like a Chess timer or some such. Also, our keyboards will be logged for activity and application fields will track mouse moment and other activity. It's absolutely nuts. At this rate, they'll need to hire me a secratary just to do the logging for me while I focus on actual work. Hey, now that's cost effective right? I bet they didn't think of that, did they. Doh!
Life is not for the lazy.
because there are no metrics used.
New Economic Perspectives
Be warned: my example is way off topic, but a pet statistic I keep track of.
There is no such things as bad statistics, only bad layman statisticians who don't understand what the numbers actually measure.
Take lines of code, for example. Some people hate it because you can bloat the numbers by adding comments, neglecting to consider how useful those comments are for future maintenance, and thereby a useful application of a developer's time. If you use a consistent formatting style for two projects, you can get a fair grasp of their complexity from the line count, though that will gloss over details about how the code actually works.
The most interesting pattern I've notice in line counts over the years is that the use of templates and other code abstraction facilities really hasn't decreased the size of code much at all, though it's improved readability, maintainability, and programmer API usability substantially. So line counts only give you an approximation of complexity with a language like Java, but do nothing to measure the quality of the code.
One other thing I've found is that complex code looks fat and heavy from it's sheer size, but often compiles to very reasonable executable size and runs rings around supposedly "tight" code that makes heavy use of dynamic techniques like introspection. As only one image of an executable is loaded by a reasonably competent OS, a fat binary does not mean a fat application at runtime.
Big code is only scary if it's not following recognizable patterns and is instead a mishmash of different developer's pet syntax, algorithms, style conventions, naming conventions, and even preferred APIs. If you manufacture it predictably, fat source code becomes a joy to maintain, enhance, and use.
But back to the core topic: help desk performance.
The only help desk stat I care about is a low number on customer complaint reports about the quality of information and assistance provided by the tech team. If it's my company and my budget, I'd rather hire more technicians to handle the load and produce happy customers in the end than I would saving money by overworking and burning them out by even thinking about useless numbers like "calls handled per week."
In the end, if you care about your business, the only thing that truly matters are happy customers who want more services or products in the future, and who will gladly tell others about their good experiences in dealing with you.
There is no substitute for a good word-of-mouth reputation and repeat business. No one ever got fired for buying IBM not because they're perfect, but because their people will go the extra mile to make things work.
I do not fail; I succeed at finding out what does not work.
Support... ... also means 'helping you set things up right', 'helping you optimize your configuration', 'helping you figure out what tool you need for the job at hand', and so on.
Worked at a support center... I was a "talk to them until they understand" guy, playing the long game... I figured while it might not take every time, if I got people to understand, they could get back to work and not break things for just a little bit longer. You know, it costs two people money if they have to talk to me while I help them.
One of my coworkers got huge amounts of management praise for processing lots and lots of cases... My management was too dumb to run numbers on how many callbacks he had, that the rest of us were fixing...
Yeah, sure I was spending too much time with each person, but half of my time was fixing this jerk's mistakes. There's probably some of that at every support center. It takes 10 minutes to fix a problem, but 5 minutes to get them to go away. You can look very busy by making them go away, if management isn't clever enough.
I'm rather happy with my new position... I get to review other people. And I do it fairly.
Anyone worth their salt will look at downtime, stability, and resolutions before they look at resolution time.
Ahh so number of resolutions are better than resolution time are they?
My recent experience with our IT call centre at work (Company is top 10 in the Fortune 500) I needed access to a shared drive. We've been asked to email the call center with specific detailed requests if we know what the exact problem is to save the phone support for problem identification sessions.
Anyway my email went along the lines of: "I have recently for some unknown reason lost access to network share with no explanation. Access to the drive is necessary for "
I got not 1 but 3 replies from the service centre:
Email 1 (autogenerated): Your Incident INC xxxxxxx1 has been raised for access to a network share.
Email 2 (typed): Dear User, Requests for access to network shares need to go through .
Email 3 (autogenerated): Your Incident INC xxxxxxx1 has been closed with successful resolution.
Errr no it hasn't. They generated a case number and closed it successfully but I still have no access to the network folder. Anyway off I go to the other system and request access through it. I get 3 emails again:
Email 1 (autogenerated): Your request has been received and has been forwarded to the service centre.
Wait for real? The same schmucks who I just requested this through and been sent away get the request?
Email 2 (autogenerated): Incident INC xxxxxxx2 has been raised for access to a network share.
Email 3 (autogenerated): Incident INC xxxxxxx2 has been closed with successful resolution, please wait up to 1 hour for permissions to propagate.
Well there you go two incidents were raised and closed, and in neither case was the end user asked if it actually worked. I wonder what happens if they gave me read only access instead of read-write as per my original email, given that the system we were supposed to use didn't specify. I guess that would raised a 3rd case.
It should be remembered that efficency and effectiveness generally are unrelated.
Efficiency is something that can be measured: responces to calls, forms processed, etc, the sort of thing you can count. It's pretty easy to do this sort of thing, and often the PHBs will take some metric and use it as a measure of activity. Because of this, one often sees things like proformance indicators, and the process and often salary, becomes connected to the indicator. The industry stops being what it is and starts producing 'red beans' for the bean counters. The indicator changes, and one produces blue beans.
Effect is something that is about getting the right job done, both for the customer and for the system. It's not even about what the customer wants, since this supposes that it is the role of the customer to diagnose the problem and the solution, and simply ask for the solution to happen. One needs to think of what happened with the system that responded to cyclone Katrina in New Orleans, which the responce was based on customer wants, rather than pre-assessment by those who should have done this. A call for help is an indicator to a problem, not a proposed solution.
Of course, even though an indicator might be proportional to effect in the wild, when it is proportional to money, the indicator becomes more important to the effect. A doctor, who might have an indicator on consultations, will split several illnesses to several consultations. On a help desk, one is more intent on creating calls, then on providing effect. A call that seeks three problems would be terminated at the first, and new calls needed for the second and third. Also, the process might be extended to several calls to create extra indicator traffic.
In the main, help desk traffic is not a really good indicator of effect, since there are things that effect this. Response time, time to fix, etc, all serve to alter traffic, in some cases, it might be better served by the section guru rather than the help desk. The effectiveness of the guru's solutions may well impede the help desk's overall issues, since it might make matters worse.
One should also note that recording the help calls is also an impediment. It serves no effect, and in many cases, might take as much to make happen as the call does in nature. One might answer say, 90% of the calls first up, yet spend more than 50% of the times making the necessary beans for the counter. A good deal of issues can be condensed into a few batch files (yes, i did this: system configuration is a good candidate for script files), so that while the call is terminated relatively fast, the actual recording might be tedious.
My experience of help desk is that particularly Microsoft rograms (eg Word, Access, Windows), use common names, which makes them very hard to grep for in the system. This reduces the effectiveness of any sort of 'search the job tables' for help. To this end, i used Wart, Abcess, Windoze, much to the annoyances of the PHBs.
OS/2 - because choice is a terrible thing to waste.
I worked in a helpdesk many years ago where we were all measured on the number of calls per week we closed. There was no consideration towards the complexity of the call given.
Our boss at the time, started giving a $100 incentive to the most number of closed calls. One of the guys in there consistently got the prize. One day, while looking up a call I fat fingered a digit and found myself looking at one of his tickets... it was a ticket, opened and closed about receiving a phone call from X. $ticketnum +1 was the actual ticket for X.
In a nutshell with some sorting/filtering I saw that the guy was not only gaming the system, but hiding the fact that he was grossly incompetent. I wrote everything up and showed it to our boss. Needless to say, he was less than happy not only with this guy, but with me. He was being pushed on from his boss to generate metrics and basically was complicate.
Long story short, I went to his bosses boss i.e. the CIO and voiced my frustration. I pointed out that fallacy of this metric that me imaging a laptop (which back then took hours) vs. Answering the phone both being basically equal to the same measure of productivity made the metric useless. Not to mention the fact that it provided zero incentive to provide better support, just incentive to close tickets.
Obviously, this caused some huge changes. Not the least of which was a much more comprehensive analysis of what people were actually doing. This made quite a few people unhappy because it exposed them for being the incompetent hacks they were. Not the least of which were my boss at the time and that employee.
Yes Francis, the world has gone crazy.
The major fallacy many big companies fall into is that some of these systems have been running flawlessly for years, because they hired a competent IT staff. They look at the price of those paychecks and shiver. Why are we paying so many high priced engineers when we've never had a problem, they think.
So they reduce staff and start to rely on support contracts instead of on-site gurus. The gurus are still there to solve any oh-shit moments. But that back investment in good engineers has produced a stable infrastructure that runs with few problems for years. So they reduce staff more, pay for more support contracts, and eventually the system critical mass is greater than the engineers who can support it. It's no problem until it's a problem.
Eventually something minor goes wrong, but nobody notices or if they do it's not really their field of expertise so they don't understand it's minor now but could escalate. When it does, something else goes wrong, and a cascade effect takes out more and more systems. With a full staff, you have enough guys that when the critical mass is reached, they can start defensive measures and get things back in working order in no time. With support staff only, things are going wrong faster than they can deal with it.
"Call on our support contracts," shout the bosses! So now your on-site staff are all on hold instead of troubleshooting. When they get through to someone, they have to spend the first hour or two describing their infrastructure to the technician on the other end, who starts making random suggestions that maybe help, but probably don't.
My anecdote on this front is a company I used to work for. It's a long read, but demonstrates the failures at several levels which is the direct result of this kind of thinking. The Oracle transaction log disk was getting full. Some warnings came in, but disks running low on space was an every day occurrence, we'll send an email to the person on record as being responsible for those servers, and troubleshoot why the "Executive Dashboard" is responding a bit slow today (it's for the execs, it's automatically high priority). Except that person is currently aboard an airplane on his way to help reduce staff in east Asia, he'll be incommunicado for the next 19 hours or more.
It seems like an innocent enough problem, it's just a log disk, the worst thing that could happen is we lose some logs, right? Whoops, transaction logs are pretty important for Oracle. The fact that the disk is filling up at all is itself an indicator that something bigger is wrong; this shouldn't happen. But critically once the disk does fill up, Oracle will enter read-only mode. Or it should. This time it doesn't, it shuts down. BOOM, offline. So down goes SAP. With SAP down, our entire business is offline. We can't take orders, we can't ship orders, we can't pay bills, we can't pay paychecks, the hourly workers whose shift is starting can't even clock in. Some buildings with tighter badge access can't even be entered unless someone inside opens an emergency door to let someone in.
Once the transaction log disk was full, Oracle will no longer start up, it needs some space on the log disk to log startup-related transactions. Two hours on hold with Oracle Gold Pressed Latinum level support they finally get an engineer. Wow, this is something he's never seen before, Oracle should have gone into read-only mode before this happened! The only solution anyone can seem to think of is to get some bigger disks for the transaction logs, clone the data over to these new disks and give the startup another go. We have hot spares on a shelf, but nobody knows this. Finding disks requires a different support contract, they can have disks out to us tomorrow. Yeah, that's not going to cut it. Someone literally drives out to a distribution warehouse. Two more hours down (they actually send two different guys in different cars with instructions to take different routes in case one runs into traffic or gets in an acciden
Slay a dragon... over lunch!