How Would You Handle a $1,000,000 Coding Error?
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
Check out this link. Sorry, dude. Any of us could have done it.
> How Would You Handle a $1,000,000 Coding Error?
I would have to follow Dogbert's Top Secret Management Handbook, and take full responsibility for the bungle. That way when the next job comes up two or three rungs above me, I'll be at the top of the list of people with actual experience with massive projects, and it won't matter that it was a colossal screw-up because I will have jumped two or three pay-grades. Corporate fall-guys, if they take it right, always end up better off than quiet behind the scenes types.
So my advice is that you should take full responsiblity and sharpen that resume, but be sure to make it known that you have learned from your mistakes and you worked hard to correct them. Nobody gets anywhere without making big blunders along the way. Be a good sport and you'll jump at least two pay grades for this blunder.
The dangers of knowledge trigger emotional distress in human beings.
... and blame it on Microsoft.
[SIG] It's like putting a moose in the blender -- a recipe for disaster!
Well, ok so that might not fly, but hey, it works when its true if you work for a modestly forgiving employer...
;-)
Now if the cause was insufficient testing, well then QA has to answer for it.
And if there's no QA, well that's managements fault...
Now if it all comes down to dumb circumstances, it's poor planning on the papers fault for not testing themselves
That said, fess up, worse comes to worse, you now have national infamy, and any fame is good fame, right??
-- (appended to the end of comments you post, 120 chars)
Where was the pre-install testing?
A good test should have identified some errors, especially if it blew up IMMEDIATELY.
Management frequently makes mistakes which cost much more. The difference is that their mistakes are not as easily identified or attributed to a single person.
The culprit should just admit it. Shit happens, it's unavoidable even if you take all precautions. Don't make the same mistake again, though.
Change your name, and switch to a "skills" based resume rather than an experience based one...
Karma: 0 (But I wield a mean +10 Vorpal Apathy)
Down, not across. (motto of alt.sysadmin.recovery referring to best method of slashing one's wrists).
Or The journalists that work at the outfit the link went to. Did you notice it took 3 of them to write that article? Talk about overstaffed.
With any large roll out, if only one person is at fault for a fiasco like this, then the project mas mismanaged. They should have had a plan in place to backout the change.
Simple enough.
Take responsibility and ownership of the problem. Don't make excuses, but give real reasons.
Fix it..do whatever it takes, even if it means working over a weekend.
Write a good post mortem, explaining how th e fix is different from the original problem.
And hope to god that your management is understanding enough to keep you on.
This is comong from a guy, who in 1997 blew a $100,000 test weekend by kicking off the systems tests by loading the wrong generation of tapes.
I took the blame, and expected to lose my job. But I knew that the right thing to do was to try to recover from the problem. I stayed in the office from 1:00AM Sunday to 10:00AM Monday morning rerunning every job and report and proving out the results.
Not only did I keep my job, but I got promoted a year later. I made a name for myself that weekend....sure I could f*k up, but I work hard to keep things right for the company.
wbs.
Huh?
Where was the phased or parallel deployment?
You don't just change a system like in a weekend. There WILL be problems, so you have to have ways of dealing with it. Maybe that means flicking the switch back to the old system if it fails, or maybe it means running with degraded capacity a while, but whatever it is, it's dead-in-the-water is not your Plan B.
Forget thrust, drag, lift and weight. Airplanes fly because of money.
Good planning would have had an abort procedure, so the show would go on. Everything changed should be undone if it did not work. They could figure it out after the paper was printed.
Errors are inevitable. Good planning and implementation keep you from falling on your face even when you publish seven days a week. It's not the coder's fault.
Friends don't help friends install M$ junk.
How Would You Handle a $1,000,000 Coding Error?
Frankly, I can't believe anyone would pay $1M for a coding error. Hell, the guys I work with make coding errors all the time, and practically for free!
(That's free, as in beer.)
I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think.
/dev/null was deleted and the backup systme had been down for 6 mos. and take out $50,000 - $100,000 in advertising.
Most daily newspapers produce various editions, between 2 and four, and I've seen a couple of times, where only one edition is printed due to "codeing errors" (like the 1 billion seconds from the epoc thing - my personal favorite).
Of course the vendor had to be called at the $500/hour emergency rate to fix their own error.
Once I saw a print pre-processor go off line because
The call daily newspapers "the daily miracle" and when you look at some of the computer band-aids they have producing them, you can see why.
You insufferable ass -- you just slashdotted Illinois.
May we never see th
As long as I keep checking in my code as someone else, I won't have to.
Show me on the doll where his noodly appendage touched you.
Bah, this is absolutely nothing compared to the coding error that brought down Canada's Royal Bank last month, leaving millions of customers without paychecks, access to their accounts, etc.... And this too was attributed to human error, but had far more drastic repurcusions than not getting your morning paper, and cost RBC a heck of a lot more than a million dollars.
It's better to burn out than to fade away
Answer Key:
X = Will accept any date 1975-Present.
Z = *.*
Y = Will accept any product made in the history of Microsoft. The Fabric of Space-Time is also an acceptable answer.
Anyone else notice that there is a little footer with teh "recycled" symbol and the phrase "printed on recycled paper" ? it's a PDF. what happens if i print it out on non recycled paper?
Bad news: We missed printing half of our papers.
Good news: Rainforest saved.
paintball
Software testing is boring boring boring. You have to try things out again and again after each change. Modules that haven't changed gain confidence in the face of changes and might not be tested, but omitting tests can end up being the Achilles heel. There can be an overwhelming desire when a project nears completion to just get things done and over with. After all the hard problems may well be solved and it's all down to seemingly inconsequential details.
These days programmers have a Sword of Damocles hanging over them. Once they finish a major piece of code they may have a hard time finding new work. The economy has not lived up to forecasts of more jobs. Outsourcing has reduced computer opportunities. Management of many companies do not see new uses for computers. Off-the-shelf programs abound for almost every aspect of computerized work.
Stress may distract software engineers enough that someone will make a major mistake.
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
Most project managers (especially ones with no technical experience... who shouldn't be let near a technical project) plan their projects with timelines with rose colour glasses. They assume there will be no coding issues discoverered in testing. Or worse, they do, but then let scope creap come into it, and borrow time from testing for the new items introduced in the scope creep. Bye bye testing time.
Mind you, I have also seen QA managers who believe that the testers only need to understand the software, and not the business where the software is to be used. This has sometimes leads to problems in end use. In any case, I tend to blame poor management before I blame the little guy. Projects like this are big enough that the process should have been able to catch things like this... unless the process was flawed.
My opinion... ready, set, slag away!
-- I ignore anonymous replies to my comments and postings.
I write software for a company that handles $45,000,000+ of client cash every week.
A mistake I made in May (discovered this very day, by yours truly) had backed up about $400,000 per week.
Did I get stomped?
No.
A bottleneck had been identified, repaired, and eliminated!
Behold the power of positive thinking.
Writers imply. Readers infer.
Sure, but what happens when you screw up placing the lid on a cup of coffee?
Here is the full text of the article in the Tribune:
A story we never thought we'd print
By James Coates
Tribune computer columnist
Published July 19, 2004, 6:40 PM CDT
Nothing built by humans can go wrong in as many ways or with as nasty an outcome as a computer system.
The people who create the Chicago Tribune started relearning that fact about 4 p.m. Sunday when they noticed that nothing was getting through as they attempted to beam the stories, artwork and ads from Tribune Tower to the Freedom Center printing plant.
About 13 hours later, they finally started printing a 24-page version of Monday's Tribune that should have already been landing on their readers' porches.
It was a misfortune that most people in the news business don't ever expect to experience. Newspapers do not miss days -- and Monday was close.
The only time the Tribune failed to print was during the Great Chicago Fire of 1871. That time, the lesson was that nature can be fickle and dangerous.
Now, the paper has learned that the same goes for the computer technology that has graced the industry with unparalleled productivity since the 1990s.
Business computer systems are cobbled together as row upon row of workstations, each running an operating system based on an estimated 50 million lines of instructions. In turn, the worker bee desktop computers connect to the queen machines with their own millions of lines of code in a different language.
An endless nest of wires, cables and even radio signals move instructions at light speed between the central computer and the workstations. The main computer also talks to all the peripheral devices needed to accomplish the mission.
The peripherals can be banks of hard drives, storage bays, printers, scanners, cameras and specialty devices as diverse as a pager or a printing press several stories tall.
The certainty that each and every one of these massively complex systems will crash haunts the people charged with keeping this thoroughly digital world up and running.
Those people are engineers, and so they often reduce it to numbers.
An often quoted study by Carnegie Mellon University computer scientists studied 30,000 software programs and found five to six defects per 1,000 lines of code.
And this is for finished software sent to customers.
When writing new programs, there is typically a defect in every 10 lines of code. About a half dozen defects per 1,000 lines remain after a process of checking, rechecking, cross checking, testing, retesting and finger crossing.
The hubris of computing becomes clear as one realizes that each of these errors in code branch out with instructions to millions of other lines of code. Quite often, they find pathways never before taken by that particular program.
Collisions occur on these pathways and trouble is spotted. Maybe it can be fixed or maybe technicians can only perform a "workaround" that can't be guaranteed.
Dick Malone, the Tribune's senior vice president and general manager, said that around 9:30 a.m. on Sunday technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines.
To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.
Malone noted that they checked and rechecked, tested and retested all day. Everything seemed to be working without a hitch. Then, they punched the button that was supposed to send all of the content for the newspaper to the printing plant.
Nothing arrived.
Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
Which time? I'm the guy who (unintentionally) wrecked the first Saturn ever wrecked (job #65). Since then I've wrecked one other (job 2 million and something), so my track record isn't that bad :)
:)
:)
Most of the time you don't actually break something (be it product or be it equipment), but fixing the bug and getting everything rolling again takes time.
And since the "value" of the product that is running on the line is about $5000 a minute, time is indeed money.
I've probably had a couple 1+ hour breakdowns, but this doesn't even compare to the time my buddies plant went down for three days x 2 shifts per day ($14M).
They were Lear-jetting parts in on a daily basis (they kept blowing up the new stuff and didn't seem to have the sense to order spares). Ron would show up at the service entrance at the airport to pick them up and it got to the point where the guys would just open the gates when he drove up
My most recent one was when we changed the line speed of the skillet line and the thumbwheel switch messed up and opened up the 8's bit in the ten's digit (faulty thumbwheel switch) so that instead of running at 42 jobs an hour it was trying to run at 80 JPH (it would have tried to run at 122 but it's limited in the software to 80 JPH)
Zoom zoom.
Oh wait, that's the other guys
John
I dream in binary.
I bet I know why....
WTF? Over?