How Would You Handle a $1,000,000 Coding Error?
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
Simple enough.
Take responsibility and ownership of the problem. Don't make excuses, but give real reasons.
Fix it..do whatever it takes, even if it means working over a weekend.
Write a good post mortem, explaining how th e fix is different from the original problem.
And hope to god that your management is understanding enough to keep you on.
This is comong from a guy, who in 1997 blew a $100,000 test weekend by kicking off the systems tests by loading the wrong generation of tapes.
I took the blame, and expected to lose my job. But I knew that the right thing to do was to try to recover from the problem. I stayed in the office from 1:00AM Sunday to 10:00AM Monday morning rerunning every job and report and proving out the results.
Not only did I keep my job, but I got promoted a year later. I made a name for myself that weekend....sure I could f*k up, but I work hard to keep things right for the company.
wbs.
Huh?
True story. I was working an assignment as a tester for Microsoft. I apologize for the use of variables, rather than names, but I don't want to get sued for breaking NDL. There was a deadline on the release, and if we missed it, there was a penalty of $1 per copy shipped. 20 million copies were due to be shipped on date X. The day of date "X", we realize there's a fatal bug that causes Product "Y" to crash after running any segment that lasts longer than "Z" minutes. Somehow, I'd completely missed this bug. I have no idea how, don't ask, but I completely missed it. We even checked back 3 months worth of revs...the bug was sill there in each one. Of course, the product was late, costing Microsoft a whopping $20 million. What did I do?
I was "allowed" to resigned gracefully, quietly, and have learned a valuable lesson about software testing: It's not whether you miss something, it's whether or not someone else will find it in time to cost you your job. (nods sagely)
-The Libra
"Please be patient--The future will begin momentarily."
The article said that the problem was in transmitting the pages from the newsroom to the printing facility across town. I wonder if they could have used a removable hd and a motorcycle as a backup plan.
Funny you should mention that. According to the Chicago Tribune(subscribtion required),
...technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines. To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.
So was it Sun or Microsoft?? Or maybe Apple?
Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
Viv
Gmail invites for ip
Bah, this is absolutely nothing compared to the coding error that brought down Canada's Royal Bank last month, leaving millions of customers without paychecks, access to their accounts, etc.... And this too was attributed to human error, but had far more drastic repurcusions than not getting your morning paper, and cost RBC a heck of a lot more than a million dollars.
It's better to burn out than to fade away
Most project managers (especially ones with no technical experience... who shouldn't be let near a technical project) plan their projects with timelines with rose colour glasses. They assume there will be no coding issues discoverered in testing. Or worse, they do, but then let scope creap come into it, and borrow time from testing for the new items introduced in the scope creep. Bye bye testing time.
Mind you, I have also seen QA managers who believe that the testers only need to understand the software, and not the business where the software is to be used. This has sometimes leads to problems in end use. In any case, I tend to blame poor management before I blame the little guy. Projects like this are big enough that the process should have been able to catch things like this... unless the process was flawed.
My opinion... ready, set, slag away!
-- I ignore anonymous replies to my comments and postings.
I believe there are apocrophal stories about the guy who made a $27M and told his boss, "I guess I'm fired, huh?" and the response was, "No, I just spent $27M to educate you."
That, and the story from one of Tom Peters' books about the guy who rented a helicopter on the fly (intended pun) to get up to the top of a mountain to restore clientele service. I consider these to be things we'll never see, only hear about.
Absolutely.
A few years ago I worked for a publishing company that sold software to newspapers and magazines for publishing (mostly ad layout stuff). we became the re-seller of a pice of content management software that was being customized by us and installed (for the first time ever anywhere) at one of the larger magazines published by one of the largest mega-media companies.
We didnt just rush in headlong and try to install and run the software in production the first time. for a while the system ran in paralell with the production system as a proof of concept (just a few of the pages at the time). Then, when it was deamed ready those few pages were published live out of the system (still had other sources if it went bad)
the system worked as designed and we were able to publish the pages out of it. unfotuantely the software wasnt very usefull or costeffective so the project was ultimately scraped. Still, this is obviously the way to handle something like this, dont just rush headlong and detach your old software and systems for the new ones. run them in parallel in a production environment... its realy the only way to be sure.
"In America, first you get the sugar, then you get the power, then you get the women..." -H. Simpson
As you pointed out, QA should have caught something this basic. There had to be a lot of careless decisions made here, and none of them are necessarily any one coder's fault. Blaming a "coding error" is simple, and makes people forget that a manager didn't do their job correctly. I've seen this particular scenario played out a dozen times before:
Last Monday Suzy Manager shouted at her team, "The schedule says we install on July 18th, so this damned product damned well better be installed on July 18th, you all got that?!"
But the vendor's ship dates slipped, and testing dates got pushed back, even though there was nothing particularily important about July 18th; except for Suzy Manager's promise to the CIO that she'd get WhizBang 2.0 installed by July 18th. And she would, too -- she had 25 points on her review riding on that very promise.
By the 14th, when a new patched version arrived that fixed the bug they discovered on the 10th, Suzy was visibly distressed. "They damn well better have that transmit bug fixed, they've been dragging their feet long enough."
Perhaps the testers just kept testing the version from the 10th instead of upgrading to the version of the 14th. It was beautiful on Saturday, so maybe the tester called in with a bad case of 'weekend flu.' Perhaps they got the patch late Friday afternoon, and the vendor swore up and down that it was just one little bug, our guy knows it's fixed, don't worry, it's better now. Whatever -- Suzy was under the gun, so she simply said "ship it."
Regardless, some nameless coder is flapping in the breeze today. Suzy is probably running around the IT department at the Tribune screaming, "we'll never buy code from those bastards again, I swear!" in a vain attempt to deflect criticism from her department.
But the CIO usually knows better, and Suzy knows the CIO knows better, and she's already sent out her interview suit to the cleaners. Even so, she'll feign total surprise to her department as she boxes up the little wooden carving she picked up during a drinking cruise to Mazatlan a couple years ago. A couple of tears later, she's interviewing over at Microsoft Consulting Services.
Or, maybe I'm completely off the mark. Perhaps they've been testing the code for a month and it's worked fine, but they installed the new code with the old libraries, or the new libraries with the old code, or the destinations were SP2 with some new security turned on. Of course, the QA department should be testing the installation packages as well, but we all know that in hindsight, right? As Yogi Berra might once have said (were he an IT manager,) "In theory, there's no difference between the lab and production, but in production there is."
John
I write software for a company that handles $45,000,000+ of client cash every week.
A mistake I made in May (discovered this very day, by yours truly) had backed up about $400,000 per week.
Did I get stomped?
No.
A bottleneck had been identified, repaired, and eliminated!
Behold the power of positive thinking.
Writers imply. Readers infer.
I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think
We had a reporter screw up and drag a folder into the trash instead of the volume it was in (MacOS is absofuckinglutely retarded for having you unmount volumes by dragging them to the trash).
He went on with his business, and then around 5pm he emptied the trash. He suspected something was wrong when it was taking over 5 minutes to empty the trash.
Turns out the folder he trashed contained *all* the quark documents for the paper (the next day's stories and advance stories).
While there were backups, some people had to scramble to rewrite their stories. Paper was a little light the next day.
That's the problem with OS9 and OSX. The users need permission to delete stories in order to have permission to modify stories.
Wasn't that a Nietzsche quote? Sort of:
Money lost is money best spent, since it directly pays off into wisdom.
Just because I can imagine doing a hippopotamus, doesn't mean I'd like to do it.
exactly.
I can not count they number of battles I have fought just to get some time to design an emergency rollback plan.
I wish I had more balls to jump up in a emergency meeting and sream "I TOLD YOU TO GIVE ME A FEW DAYS SO I COULD DESGIN A ROLLBACK PLAN, ASSHOLE. BUT NOW ALL THE DATA CORRUPTED, AND WE CAN'T DO ANYTHING ABOUT IT BECASUE OF YOU!!"
Instead, I just keep a copy of the emails where I made the request and was denied, and then forward them to the CTO.
The Kruger Dunning explains most post on
Well, although can't say the guy did a great job... if the DB was so important, why was there not a regular backup?
You are pointing out two problems taking place simulataneously.
One is a minor human error, but it is obviously an unintended act.
NOT having a recent-enough backup IS a serious issue. This issue has been pending for, as you say, 6 weeks, and it is a critical issue (if the data is valuable as you seem to imply).
You do not go around deleting all entries in your DB for fun, but you know some software is going to go bananas on you one day and start messing up with your DB, whether it is in such an obvious form as deleting all the records or simply altering them all in a subtle way that takes a while to notice... (change all prices from euros to dollars?).
A succesful project or business is much more than the sum of little individual acts. There is such thing as planning for things going wrong. And in this day and age, a database backup is no longer a problem.
This is pretty much what happened with the first launch of the shuttle. Remember when the Columbia was to first time lift off, and it was just around the final 10 count when they abandoned the mission due to a software error. The problem was then searched by many programmers to find what happened, and it was finally found by the guy who made the mistake! Of course this guy got a huge bonus for finding it, although no one seemed to care that he was the one that made it. But that's the life of a programmer :-)
Steven Rostedt
-- Nevermind
"The poor schmuck" will, in my experience, have spent the last 18 months hearing phrases like:
"Time / Quality / Functionality: Choose Two"
"You can't test quality into a system"
"Measure twice, cut once"
"We need to parallel run the UT system"
"Engineers shouldn't be testing their own code!"
"I wouldn't be using NT for that, mate"
and so on.
These are the words technical people use to warn management of impending doom. Managers on the other hand have other things to worry about like delivery dates, sales, penalty ratchets and so on. When the "go" decision was made it will have been made by senior managers who get paid the big bucks to take the big decisions and the big sh*t when it all goes pear shaped.
The question is how the management handled mitigation by way of backups to manual processing, rollbacks to the old system or risk analysis during project planning.
Automation of an entire printing plant is a big job and it is probable they planned for a failure as a worst case scenario and will just put the 1M loss down to experience.
I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
Take a look at some K code (there are examples in the user manual) and then come back and say that. If K is too exotic, then try looking at some macro-heavy LISP code -- it has the same problem just slightly less so.
Code density can be good when you're trying to see the big picture (fewer screenfulls of code is a good thing in this case), but it can work against you when you're trying to understand the little details.
Regular expressions are nothing more than a hack to make up for the fact that generalized LR parsers were quite inefficient up until a few years ago. Just compare a reasonably complex regular expression to the BNF form of a grammar for parsing the same input to see how much easier GLR is to use -- you can see some examples of just how easy GLR parsing is to use here. And it can actually handle more general patterns with nesting, etc. I really think regexes are really just a question of premature optimization -- with GLR you just start out with an incredibly readable and simple grammar, and if it proves to be slow (i.e. if there are lots of points of ambiguity along certain parse trees) you can optimize it towards a purely LR(k) grammar.
HAND.
Which time? I'm the guy who (unintentionally) wrecked the first Saturn ever wrecked (job #65). Since then I've wrecked one other (job 2 million and something), so my track record isn't that bad :)
:)
:)
Most of the time you don't actually break something (be it product or be it equipment), but fixing the bug and getting everything rolling again takes time.
And since the "value" of the product that is running on the line is about $5000 a minute, time is indeed money.
I've probably had a couple 1+ hour breakdowns, but this doesn't even compare to the time my buddies plant went down for three days x 2 shifts per day ($14M).
They were Lear-jetting parts in on a daily basis (they kept blowing up the new stuff and didn't seem to have the sense to order spares). Ron would show up at the service entrance at the airport to pick them up and it got to the point where the guys would just open the gates when he drove up
My most recent one was when we changed the line speed of the skillet line and the thumbwheel switch messed up and opened up the 8's bit in the ten's digit (faulty thumbwheel switch) so that instead of running at 42 jobs an hour it was trying to run at 80 JPH (it would have tried to run at 122 but it's limited in the software to 80 JPH)
Zoom zoom.
Oh wait, that's the other guys
John
I dream in binary.
Not true - plenty of jobs where people on the ground are working with kit worth more than that. Easy for a forklift or truck driver to cause a lot of damage when moving stuff around.
Or say this incident - blamed on technicians...
Or say you were an air-traffic-controller... - how big a mistake do you want to make.
I work as a system administrator for a newspaper since 7 years back. 5 Years ago we were out-sourced to another company, my job stayed the same (save for extra work needed) but the decision paths and cost terms has changed a lot. -- More management, less money, cutting corners, less contact with customers has actually led to an increase in costs by 25% for the newspaper.
:)
For 5 years we have worked on cutting costs instead of doing what we originally did; produce a newspaper. This has led to a lot of cut corners, patchy systems and above all stupid decisions. Now we have to spend most of our time with our hands tied behind our backs because there's no way to prove a _direct_ profit we can put on the price-tag we show to a (non-technical) customer when we are suggesting a change. It's always cost > functionality.
Companies that only sell services to customers has no goal, does not work. There has to be something you produce, something to live for instead of just being a money making machine.
Management cannot be just management to be management. A good manager is someone involved working with something they have a passion for. My boss didn't create this newspaper, nor did the boss of the actual newspaper and they probably don't have a special interest in media, it's just a career pushing money making machine for them.
Oh, I guess this turned into a rant