How Would You Handle a $1,000,000 Coding Error?
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
Simple enough.
Take responsibility and ownership of the problem. Don't make excuses, but give real reasons.
Fix it..do whatever it takes, even if it means working over a weekend.
Write a good post mortem, explaining how th e fix is different from the original problem.
And hope to god that your management is understanding enough to keep you on.
This is comong from a guy, who in 1997 blew a $100,000 test weekend by kicking off the systems tests by loading the wrong generation of tapes.
I took the blame, and expected to lose my job. But I knew that the right thing to do was to try to recover from the problem. I stayed in the office from 1:00AM Sunday to 10:00AM Monday morning rerunning every job and report and proving out the results.
Not only did I keep my job, but I got promoted a year later. I made a name for myself that weekend....sure I could f*k up, but I work hard to keep things right for the company.
wbs.
Huh?
True story. I was working an assignment as a tester for Microsoft. I apologize for the use of variables, rather than names, but I don't want to get sued for breaking NDL. There was a deadline on the release, and if we missed it, there was a penalty of $1 per copy shipped. 20 million copies were due to be shipped on date X. The day of date "X", we realize there's a fatal bug that causes Product "Y" to crash after running any segment that lasts longer than "Z" minutes. Somehow, I'd completely missed this bug. I have no idea how, don't ask, but I completely missed it. We even checked back 3 months worth of revs...the bug was sill there in each one. Of course, the product was late, costing Microsoft a whopping $20 million. What did I do?
I was "allowed" to resigned gracefully, quietly, and have learned a valuable lesson about software testing: It's not whether you miss something, it's whether or not someone else will find it in time to cost you your job. (nods sagely)
-The Libra
"Please be patient--The future will begin momentarily."
Funny you should mention that. According to the Chicago Tribune(subscribtion required),
...technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines. To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.
So was it Sun or Microsoft?? Or maybe Apple?
Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
Viv
Gmail invites for ip
Bah, this is absolutely nothing compared to the coding error that brought down Canada's Royal Bank last month, leaving millions of customers without paychecks, access to their accounts, etc.... And this too was attributed to human error, but had far more drastic repurcusions than not getting your morning paper, and cost RBC a heck of a lot more than a million dollars.
It's better to burn out than to fade away
Most project managers (especially ones with no technical experience... who shouldn't be let near a technical project) plan their projects with timelines with rose colour glasses. They assume there will be no coding issues discoverered in testing. Or worse, they do, but then let scope creap come into it, and borrow time from testing for the new items introduced in the scope creep. Bye bye testing time.
Mind you, I have also seen QA managers who believe that the testers only need to understand the software, and not the business where the software is to be used. This has sometimes leads to problems in end use. In any case, I tend to blame poor management before I blame the little guy. Projects like this are big enough that the process should have been able to catch things like this... unless the process was flawed.
My opinion... ready, set, slag away!
-- I ignore anonymous replies to my comments and postings.
I believe there are apocrophal stories about the guy who made a $27M and told his boss, "I guess I'm fired, huh?" and the response was, "No, I just spent $27M to educate you."
That, and the story from one of Tom Peters' books about the guy who rented a helicopter on the fly (intended pun) to get up to the top of a mountain to restore clientele service. I consider these to be things we'll never see, only hear about.
As you pointed out, QA should have caught something this basic. There had to be a lot of careless decisions made here, and none of them are necessarily any one coder's fault. Blaming a "coding error" is simple, and makes people forget that a manager didn't do their job correctly. I've seen this particular scenario played out a dozen times before:
Last Monday Suzy Manager shouted at her team, "The schedule says we install on July 18th, so this damned product damned well better be installed on July 18th, you all got that?!"
But the vendor's ship dates slipped, and testing dates got pushed back, even though there was nothing particularily important about July 18th; except for Suzy Manager's promise to the CIO that she'd get WhizBang 2.0 installed by July 18th. And she would, too -- she had 25 points on her review riding on that very promise.
By the 14th, when a new patched version arrived that fixed the bug they discovered on the 10th, Suzy was visibly distressed. "They damn well better have that transmit bug fixed, they've been dragging their feet long enough."
Perhaps the testers just kept testing the version from the 10th instead of upgrading to the version of the 14th. It was beautiful on Saturday, so maybe the tester called in with a bad case of 'weekend flu.' Perhaps they got the patch late Friday afternoon, and the vendor swore up and down that it was just one little bug, our guy knows it's fixed, don't worry, it's better now. Whatever -- Suzy was under the gun, so she simply said "ship it."
Regardless, some nameless coder is flapping in the breeze today. Suzy is probably running around the IT department at the Tribune screaming, "we'll never buy code from those bastards again, I swear!" in a vain attempt to deflect criticism from her department.
But the CIO usually knows better, and Suzy knows the CIO knows better, and she's already sent out her interview suit to the cleaners. Even so, she'll feign total surprise to her department as she boxes up the little wooden carving she picked up during a drinking cruise to Mazatlan a couple years ago. A couple of tears later, she's interviewing over at Microsoft Consulting Services.
Or, maybe I'm completely off the mark. Perhaps they've been testing the code for a month and it's worked fine, but they installed the new code with the old libraries, or the new libraries with the old code, or the destinations were SP2 with some new security turned on. Of course, the QA department should be testing the installation packages as well, but we all know that in hindsight, right? As Yogi Berra might once have said (were he an IT manager,) "In theory, there's no difference between the lab and production, but in production there is."
John
I write software for a company that handles $45,000,000+ of client cash every week.
A mistake I made in May (discovered this very day, by yours truly) had backed up about $400,000 per week.
Did I get stomped?
No.
A bottleneck had been identified, repaired, and eliminated!
Behold the power of positive thinking.
Writers imply. Readers infer.
I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think
We had a reporter screw up and drag a folder into the trash instead of the volume it was in (MacOS is absofuckinglutely retarded for having you unmount volumes by dragging them to the trash).
He went on with his business, and then around 5pm he emptied the trash. He suspected something was wrong when it was taking over 5 minutes to empty the trash.
Turns out the folder he trashed contained *all* the quark documents for the paper (the next day's stories and advance stories).
While there were backups, some people had to scramble to rewrite their stories. Paper was a little light the next day.
That's the problem with OS9 and OSX. The users need permission to delete stories in order to have permission to modify stories.
Wasn't that a Nietzsche quote? Sort of:
Money lost is money best spent, since it directly pays off into wisdom.
Just because I can imagine doing a hippopotamus, doesn't mean I'd like to do it.
exactly.
I can not count they number of battles I have fought just to get some time to design an emergency rollback plan.
I wish I had more balls to jump up in a emergency meeting and sream "I TOLD YOU TO GIVE ME A FEW DAYS SO I COULD DESGIN A ROLLBACK PLAN, ASSHOLE. BUT NOW ALL THE DATA CORRUPTED, AND WE CAN'T DO ANYTHING ABOUT IT BECASUE OF YOU!!"
Instead, I just keep a copy of the emails where I made the request and was denied, and then forward them to the CTO.
The Kruger Dunning explains most post on
This is pretty much what happened with the first launch of the shuttle. Remember when the Columbia was to first time lift off, and it was just around the final 10 count when they abandoned the mission due to a software error. The problem was then searched by many programmers to find what happened, and it was finally found by the guy who made the mistake! Of course this guy got a huge bonus for finding it, although no one seemed to care that he was the one that made it. But that's the life of a programmer :-)
Steven Rostedt
-- Nevermind
"The poor schmuck" will, in my experience, have spent the last 18 months hearing phrases like:
"Time / Quality / Functionality: Choose Two"
"You can't test quality into a system"
"Measure twice, cut once"
"We need to parallel run the UT system"
"Engineers shouldn't be testing their own code!"
"I wouldn't be using NT for that, mate"
and so on.
These are the words technical people use to warn management of impending doom. Managers on the other hand have other things to worry about like delivery dates, sales, penalty ratchets and so on. When the "go" decision was made it will have been made by senior managers who get paid the big bucks to take the big decisions and the big sh*t when it all goes pear shaped.
The question is how the management handled mitigation by way of backups to manual processing, rollbacks to the old system or risk analysis during project planning.
Automation of an entire printing plant is a big job and it is probable they planned for a failure as a worst case scenario and will just put the 1M loss down to experience.
I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
Which time? I'm the guy who (unintentionally) wrecked the first Saturn ever wrecked (job #65). Since then I've wrecked one other (job 2 million and something), so my track record isn't that bad :)
:)
:)
Most of the time you don't actually break something (be it product or be it equipment), but fixing the bug and getting everything rolling again takes time.
And since the "value" of the product that is running on the line is about $5000 a minute, time is indeed money.
I've probably had a couple 1+ hour breakdowns, but this doesn't even compare to the time my buddies plant went down for three days x 2 shifts per day ($14M).
They were Lear-jetting parts in on a daily basis (they kept blowing up the new stuff and didn't seem to have the sense to order spares). Ron would show up at the service entrance at the airport to pick them up and it got to the point where the guys would just open the gates when he drove up
My most recent one was when we changed the line speed of the skillet line and the thumbwheel switch messed up and opened up the 8's bit in the ten's digit (faulty thumbwheel switch) so that instead of running at 42 jobs an hour it was trying to run at 80 JPH (it would have tried to run at 122 but it's limited in the software to 80 JPH)
Zoom zoom.
Oh wait, that's the other guys
John
I dream in binary.
Not true - plenty of jobs where people on the ground are working with kit worth more than that. Easy for a forklift or truck driver to cause a lot of damage when moving stuff around.
Or say this incident - blamed on technicians...
Or say you were an air-traffic-controller... - how big a mistake do you want to make.
I work as a system administrator for a newspaper since 7 years back. 5 Years ago we were out-sourced to another company, my job stayed the same (save for extra work needed) but the decision paths and cost terms has changed a lot. -- More management, less money, cutting corners, less contact with customers has actually led to an increase in costs by 25% for the newspaper.
:)
For 5 years we have worked on cutting costs instead of doing what we originally did; produce a newspaper. This has led to a lot of cut corners, patchy systems and above all stupid decisions. Now we have to spend most of our time with our hands tied behind our backs because there's no way to prove a _direct_ profit we can put on the price-tag we show to a (non-technical) customer when we are suggesting a change. It's always cost > functionality.
Companies that only sell services to customers has no goal, does not work. There has to be something you produce, something to live for instead of just being a money making machine.
Management cannot be just management to be management. A good manager is someone involved working with something they have a passion for. My boss didn't create this newspaper, nor did the boss of the actual newspaper and they probably don't have a special interest in media, it's just a career pushing money making machine for them.
Oh, I guess this turned into a rant