How Would You Handle a $1,000,000 Coding Error?
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
Simple enough.
Take responsibility and ownership of the problem. Don't make excuses, but give real reasons.
Fix it..do whatever it takes, even if it means working over a weekend.
Write a good post mortem, explaining how th e fix is different from the original problem.
And hope to god that your management is understanding enough to keep you on.
This is comong from a guy, who in 1997 blew a $100,000 test weekend by kicking off the systems tests by loading the wrong generation of tapes.
I took the blame, and expected to lose my job. But I knew that the right thing to do was to try to recover from the problem. I stayed in the office from 1:00AM Sunday to 10:00AM Monday morning rerunning every job and report and proving out the results.
Not only did I keep my job, but I got promoted a year later. I made a name for myself that weekend....sure I could f*k up, but I work hard to keep things right for the company.
wbs.
Huh?
Funny you should mention that. According to the Chicago Tribune(subscribtion required),
...technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines. To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.
So was it Sun or Microsoft?? Or maybe Apple?
Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
Viv
Gmail invites for ip
Bah, this is absolutely nothing compared to the coding error that brought down Canada's Royal Bank last month, leaving millions of customers without paychecks, access to their accounts, etc.... And this too was attributed to human error, but had far more drastic repurcusions than not getting your morning paper, and cost RBC a heck of a lot more than a million dollars.
It's better to burn out than to fade away
Most project managers (especially ones with no technical experience... who shouldn't be let near a technical project) plan their projects with timelines with rose colour glasses. They assume there will be no coding issues discoverered in testing. Or worse, they do, but then let scope creap come into it, and borrow time from testing for the new items introduced in the scope creep. Bye bye testing time.
Mind you, I have also seen QA managers who believe that the testers only need to understand the software, and not the business where the software is to be used. This has sometimes leads to problems in end use. In any case, I tend to blame poor management before I blame the little guy. Projects like this are big enough that the process should have been able to catch things like this... unless the process was flawed.
My opinion... ready, set, slag away!
-- I ignore anonymous replies to my comments and postings.
I believe there are apocrophal stories about the guy who made a $27M and told his boss, "I guess I'm fired, huh?" and the response was, "No, I just spent $27M to educate you."
That, and the story from one of Tom Peters' books about the guy who rented a helicopter on the fly (intended pun) to get up to the top of a mountain to restore clientele service. I consider these to be things we'll never see, only hear about.
I write software for a company that handles $45,000,000+ of client cash every week.
A mistake I made in May (discovered this very day, by yours truly) had backed up about $400,000 per week.
Did I get stomped?
No.
A bottleneck had been identified, repaired, and eliminated!
Behold the power of positive thinking.
Writers imply. Readers infer.
Which time? I'm the guy who (unintentionally) wrecked the first Saturn ever wrecked (job #65). Since then I've wrecked one other (job 2 million and something), so my track record isn't that bad :)
:)
:)
Most of the time you don't actually break something (be it product or be it equipment), but fixing the bug and getting everything rolling again takes time.
And since the "value" of the product that is running on the line is about $5000 a minute, time is indeed money.
I've probably had a couple 1+ hour breakdowns, but this doesn't even compare to the time my buddies plant went down for three days x 2 shifts per day ($14M).
They were Lear-jetting parts in on a daily basis (they kept blowing up the new stuff and didn't seem to have the sense to order spares). Ron would show up at the service entrance at the airport to pick them up and it got to the point where the guys would just open the gates when he drove up
My most recent one was when we changed the line speed of the skillet line and the thumbwheel switch messed up and opened up the 8's bit in the ten's digit (faulty thumbwheel switch) so that instead of running at 42 jobs an hour it was trying to run at 80 JPH (it would have tried to run at 122 but it's limited in the software to 80 JPH)
Zoom zoom.
Oh wait, that's the other guys
John
I dream in binary.