Slashdot Mirror


How Would You Handle a $1,000,000 Coding Error?

theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"

39 of 878 comments (clear)

  1. Just one by kalidasa · · Score: 5, Funny

    Check out this link. Sorry, dude. Any of us could have done it.

    1. Re:Just one by wo1verin3 · · Score: 5, Informative

      Google Cache as per your request.

    2. Re:Just one by TastyWords · · Score: 5, Interesting

      I believe there are apocrophal stories about the guy who made a $27M and told his boss, "I guess I'm fired, huh?" and the response was, "No, I just spent $27M to educate you."

      That, and the story from one of Tom Peters' books about the guy who rented a helicopter on the fly (intended pun) to get up to the top of a mountain to restore clientele service. I consider these to be things we'll never see, only hear about.

    3. Re:Just one by kegwell · · Score: 5, Funny

      Ehh..hopefully he lives on the side of town that the paper will get delivered on because he will definitely need the classifieds to look for a new job.

    4. Re:Just one by Nefarious+Wheel · · Score: 5, Informative

      The book was "Big Blues", a NYT columnist's documentation of IBM's travails around the days of the rise of Microsoft. Speaker was TJ Watson Jr. I think.

      --
      Do not mock my vision of impractical footwear
  2. Dogbert Strategy by mfh · · Score: 5, Funny

    > How Would You Handle a $1,000,000 Coding Error?

    I would have to follow Dogbert's Top Secret Management Handbook, and take full responsibility for the bungle. That way when the next job comes up two or three rungs above me, I'll be at the top of the list of people with actual experience with massive projects, and it won't matter that it was a colossal screw-up because I will have jumped two or three pay-grades. Corporate fall-guys, if they take it right, always end up better off than quiet behind the scenes types.

    So my advice is that you should take full responsiblity and sharpen that resume, but be sure to make it known that you have learned from your mistakes and you worked hard to correct them. Nobody gets anywhere without making big blunders along the way. Be a good sport and you'll jump at least two pay grades for this blunder.

    --
    The dangers of knowledge trigger emotional distress in human beings.
    1. Re:Dogbert Strategy by Pulse_Instance · · Score: 5, Insightful

      In my experience being honest about your mistakes and having the willingness to learn from them always pays off.

    2. Re:Dogbert Strategy by MrDelSarto · · Score: 5, Funny

      Reminds me of that often quoted story about Thomas Watson, head of IBM, when some executive made a bad decision that ended up costing $10 million. The guy comes in and says "I suppose you'll want my resignation now" and Watson replies something like "Are you crazy! I just spent $10 million educating you!"

  3. Do as any knee-jerk slashdotter would... by Jad+LaFields · · Score: 5, Funny

    ... and blame it on Microsoft.

    --
    [SIG] It's like putting a moose in the blender -- a recipe for disaster!
    1. Re:Do as any knee-jerk slashdotter would... by VivianC · · Score: 5, Interesting

      Funny you should mention that. According to the Chicago Tribune(subscribtion required),

      ...technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines. To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.

      So was it Sun or Microsoft?? Or maybe Apple?

      Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.

      --
      Viv

      Gmail invites for ip
  4. It's my first week! by Fubar420 · · Score: 5, Insightful

    Well, ok so that might not fly, but hey, it works when its true if you work for a modestly forgiving employer...

    Now if the cause was insufficient testing, well then QA has to answer for it.

    And if there's no QA, well that's managements fault...

    Now if it all comes down to dumb circumstances, it's poor planning on the papers fault for not testing themselves ;-)

    That said, fess up, worse comes to worse, you now have national infamy, and any fame is good fame, right??

    --
    -- (appended to the end of comments you post, 120 chars)
    1. Re:It's my first week! by Soko · · Score: 5, Insightful

      I'm giving up moderation on this story to post this, so listen the fuck up.

      I work in newspapers, and have for the past 7 years. The blame for this fiasco should be pinned directly on the project manager. Not the coders, not the people trying to get the thing running, but the project manager. Right in the middle of his fucking forehead.

      I've torn the guts out of many newpaper networks upgrading or improving them, but never have I ever put anyone in the position of "If the new system doesn't work, we're fucked." I've always made ab-so-fucking-loutely certain there was a fall back position where the paper would hit the press. I actually had this conversation before:

      <Management weenie> What happens if this new server fails?
      <me> I haven't touched the old server. If the new one hiccups one whit, we fire up the old box and produce product.
      <Management weenie> I don't like that - we've spent a million bucks on the new gear. Delays make me look bad.
      <me> Well, if you're willing to man the phones when the advertisers call demanding re-prints of thier ads because of human error somewhere, I have no problem with it.
      <Management weenie> You're an asshole. I could have you fired.
      <me> In this instance, I'm paid to be an asshole. You can't fire me for doing my job.
      <Management weenie> Heh. OK, we'll go with your plan.

      Not planning some way to get the paper on the press is dereliction of duty, and deserves your professional head to be lopped off.

      Is there _no_ professionalism anymore? Fuck, I should be paid more. Morons like that burn me - when you blow up a critical system with no backup, it's not just your livelyhood, but for everyone who depends on that system functioning as needed - it's thier livelyhood as well. Fucking morons.

      Soko

      --
      "Depression is merely anger without enthusiasm." - Anonymous
  5. Testing? by buff_pilot · · Score: 5, Insightful

    Where was the pre-install testing?

    A good test should have identified some errors, especially if it blew up IMMEDIATELY.

    1. Re:Testing? by ryen · · Score: 5, Insightful

      I agree.
      Blame the project manager (hopefully their was one) that led testing the services thoroughly before deployment. Individual coders shouldn't be held to any legal liability.
      Any legal action should be directed towards the'outside provider' (as noted in the article).

  6. 1 million is not that much by Anonymous Coward · · Score: 5, Insightful

    Management frequently makes mistakes which cost much more. The difference is that their mistakes are not as easily identified or attributed to a single person.

    The culprit should just admit it. Shit happens, it's unavoidable even if you take all precautions. Don't make the same mistake again, though.

  7. Only one thing to do now... by C60 · · Score: 5, Funny

    Change your name, and switch to a "skills" based resume rather than an experience based one...

    --
    Karma: 0 (But I wield a mean +10 Vorpal Apathy)
  8. advice to hapless code monkey by Jayfar · · Score: 5, Funny
    Any advice for the poor schmuck who's going to get the blame?

    Down, not across. (motto of alt.sysadmin.recovery referring to best method of slashing one's wrists).

  9. Or South Florida by LoztInSpace · · Score: 5, Funny

    Or The journalists that work at the outfit the link went to. Did you notice it took 3 of them to write that article? Talk about overstaffed.

  10. No one person should be at fault by David+Frankenstein · · Score: 5, Insightful

    With any large roll out, if only one person is at fault for a fiasco like this, then the project mas mismanaged. They should have had a plan in place to backout the change.

  11. Fix it. by wideBlueSkies · · Score: 5, Interesting

    Simple enough.

    Take responsibility and ownership of the problem. Don't make excuses, but give real reasons.

    Fix it..do whatever it takes, even if it means working over a weekend.

    Write a good post mortem, explaining how th e fix is different from the original problem.

    And hope to god that your management is understanding enough to keep you on.

    This is comong from a guy, who in 1997 blew a $100,000 test weekend by kicking off the systems tests by loading the wrong generation of tapes.

    I took the blame, and expected to lose my job. But I knew that the right thing to do was to try to recover from the problem. I stayed in the office from 1:00AM Sunday to 10:00AM Monday morning rerunning every job and report and proving out the results.

    Not only did I keep my job, but I got promoted a year later. I made a name for myself that weekend....sure I could f*k up, but I work hard to keep things right for the company.

    wbs.

    --
    Huh?
  12. Deployment? by BiggerIsBetter · · Score: 5, Insightful

    Where was the phased or parallel deployment?

    You don't just change a system like in a weekend. There WILL be problems, so you have to have ways of dealing with it. Maybe that means flicking the switch back to the old system if it fails, or maybe it means running with degraded capacity a while, but whatever it is, it's dead-in-the-water is not your Plan B.

    --
    Forget thrust, drag, lift and weight. Airplanes fly because of money.
  13. planning? by twitter · · Score: 5, Insightful
    A good test should have identified some errors, especially if it blew up IMMEDIATELY.

    Good planning would have had an abort procedure, so the show would go on. Everything changed should be undone if it did not work. They could figure it out after the paper was printed.

    Errors are inevitable. Good planning and implementation keep you from falling on your face even when you publish seven days a week. It's not the coder's fault.

    --

    Friends don't help friends install M$ junk.

  14. Very carefully! by YouHaveSnail · · Score: 5, Funny

    How Would You Handle a $1,000,000 Coding Error?

    Frankly, I can't believe anyone would pay $1M for a coding error. Hell, the guys I work with make coding errors all the time, and practically for free!

    (That's free, as in beer.)

  15. More common than you think... by John+Whorfin · · Score: 5, Informative

    I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think.

    Most daily newspapers produce various editions, between 2 and four, and I've seen a couple of times, where only one edition is printed due to "codeing errors" (like the 1 billion seconds from the epoc thing - my personal favorite).

    Of course the vendor had to be called at the $500/hour emergency rate to fix their own error.

    Once I saw a print pre-processor go off line because /dev/null was deleted and the backup systme had been down for 6 mos. and take out $50,000 - $100,000 in advertising.

    The call daily newspapers "the daily miracle" and when you look at some of the computer band-aids they have producing them, you can see why.

  16. You Slashdotted Illinois by 0x0d0a · · Score: 5, Funny

    You insufferable ass -- you just slashdotted Illinois.

  17. I don't worry about it by pyrrhonist · · Score: 5, Funny
    How Would You Handle a $1,000,000 Coding Error?

    As long as I keep checking in my code as someone else, I won't have to.

    --
    Show me on the doll where his noodly appendage touched you.
  18. Bah by Sandman1971 · · Score: 5, Interesting

    Bah, this is absolutely nothing compared to the coding error that brought down Canada's Royal Bank last month, leaving millions of customers without paychecks, access to their accounts, etc.... And this too was attributed to human error, but had far more drastic repurcusions than not getting your morning paper, and cost RBC a heck of a lot more than a million dollars.

    --
    It's better to burn out than to fade away
  19. Re:1 Million? That's nothing! by ebob9 · · Score: 5, Funny

    Answer Key:

    X = Will accept any date 1975-Present.
    Z = *.*
    Y = Will accept any product made in the history of Microsoft. The Fabric of Space-Time is also an acceptable answer.

  20. Re:My advice. by sTavvy · · Score: 5, Funny

    Anyone else notice that there is a little footer with teh "recycled" symbol and the phrase "printed on recycled paper" ? it's a PDF. what happens if i print it out on non recycled paper?

  21. Bad News, Good News..... by raehl · · Score: 5, Funny

    Bad news: We missed printing half of our papers.

    Good news: Rainforest saved.

    1. Re:Bad News, Good News..... by killjoe · · Score: 5, Informative

      Actually that's not quite true. The big paper companies do have large forests that they try to manage but they cut trees much faster then they are being replenished. This is why there is relentless pressure to log the national forests. If the harvest from private acreage was sustainable they would never need to log the national forests.

      These days companies like champion and plum creek are finding that it's more profitable to sell the logged areas then to replant them. For example in maine and montana.

      It's more profitable to sell land (especially waterfront land) and then log the federally subsidized national forests.

      Your tax dollars at work!

      --
      evil is as evil does
  22. Testing is Boring by PingPongBoy · · Score: 5, Insightful

    Software testing is boring boring boring. You have to try things out again and again after each change. Modules that haven't changed gain confidence in the face of changes and might not be tested, but omitting tests can end up being the Achilles heel. There can be an overwhelming desire when a project nears completion to just get things done and over with. After all the hard problems may well be solved and it's all down to seemingly inconsequential details.

    These days programmers have a Sword of Damocles hanging over them. Once they finish a major piece of code they may have a hard time finding new work. The economy has not lived up to forecasts of more jobs. Outsourcing has reduced computer opportunities. Management of many companies do not see new uses for computers. Off-the-shelf programs abound for almost every aspect of computerized work.

    Stress may distract software engineers enough that someone will make a major mistake.

    --
    Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
  23. Re:The Coder? Nothing... by theshowmecanuck · · Score: 5, Interesting
    In my experience, it probably was not totally the QA department's... or the coder's... fault either. It was probably shitty managers paying too little attention to the need to allocate sufficient time for QA and realistic testing environments.

    Most project managers (especially ones with no technical experience... who shouldn't be let near a technical project) plan their projects with timelines with rose colour glasses. They assume there will be no coding issues discoverered in testing. Or worse, they do, but then let scope creap come into it, and borrow time from testing for the new items introduced in the scope creep. Bye bye testing time.

    Mind you, I have also seen QA managers who believe that the testers only need to understand the software, and not the business where the software is to be used. This has sometimes leads to problems in end use. In any case, I tend to blame poor management before I blame the little guy. Projects like this are big enough that the process should have been able to catch things like this... unless the process was flawed.

    My opinion... ready, set, slag away!

    --
    -- I ignore anonymous replies to my comments and postings.
  24. Been there by Inthewire · · Score: 5, Interesting

    I write software for a company that handles $45,000,000+ of client cash every week.
    A mistake I made in May (discovered this very day, by yours truly) had backed up about $400,000 per week.

    Did I get stomped?
    No.

    A bottleneck had been identified, repaired, and eliminated!
    Behold the power of positive thinking.

    --


    Writers imply. Readers infer.
  25. Re:McDonald's by yiantsbro · · Score: 5, Funny

    Sure, but what happens when you screw up placing the lid on a cup of coffee?

  26. Tribune's version by Anonymous Coward · · Score: 5, Informative

    Here is the full text of the article in the Tribune:

    A story we never thought we'd print

    By James Coates
    Tribune computer columnist
    Published July 19, 2004, 6:40 PM CDT

    Nothing built by humans can go wrong in as many ways or with as nasty an outcome as a computer system.

    The people who create the Chicago Tribune started relearning that fact about 4 p.m. Sunday when they noticed that nothing was getting through as they attempted to beam the stories, artwork and ads from Tribune Tower to the Freedom Center printing plant.

    About 13 hours later, they finally started printing a 24-page version of Monday's Tribune that should have already been landing on their readers' porches.

    It was a misfortune that most people in the news business don't ever expect to experience. Newspapers do not miss days -- and Monday was close.

    The only time the Tribune failed to print was during the Great Chicago Fire of 1871. That time, the lesson was that nature can be fickle and dangerous.

    Now, the paper has learned that the same goes for the computer technology that has graced the industry with unparalleled productivity since the 1990s.

    Business computer systems are cobbled together as row upon row of workstations, each running an operating system based on an estimated 50 million lines of instructions. In turn, the worker bee desktop computers connect to the queen machines with their own millions of lines of code in a different language.

    An endless nest of wires, cables and even radio signals move instructions at light speed between the central computer and the workstations. The main computer also talks to all the peripheral devices needed to accomplish the mission.

    The peripherals can be banks of hard drives, storage bays, printers, scanners, cameras and specialty devices as diverse as a pager or a printing press several stories tall.

    The certainty that each and every one of these massively complex systems will crash haunts the people charged with keeping this thoroughly digital world up and running.

    Those people are engineers, and so they often reduce it to numbers.

    An often quoted study by Carnegie Mellon University computer scientists studied 30,000 software programs and found five to six defects per 1,000 lines of code.

    And this is for finished software sent to customers.

    When writing new programs, there is typically a defect in every 10 lines of code. About a half dozen defects per 1,000 lines remain after a process of checking, rechecking, cross checking, testing, retesting and finger crossing.

    The hubris of computing becomes clear as one realizes that each of these errors in code branch out with instructions to millions of other lines of code. Quite often, they find pathways never before taken by that particular program.

    Collisions occur on these pathways and trouble is spotted. Maybe it can be fixed or maybe technicians can only perform a "workaround" that can't be guaranteed.

    Dick Malone, the Tribune's senior vice president and general manager, said that around 9:30 a.m. on Sunday technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines.

    To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.

    Malone noted that they checked and rechecked, tested and retested all day. Everything seemed to be working without a hitch. Then, they punched the button that was supposed to send all of the content for the newspaper to the printing plant.

    Nothing arrived.

    Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.

  27. One-line CODE ERROR $60 million - AT&T phone c by mdrejhon · · Score: 5, Informative
    History....one line coding error cost $60 million dollars!

    AT&T Failure of January 15, 1990

    Link 1, Link 2, Link 3

    On January 15, 1990, 114 switching nodes of the AT&T long distance system went down. The published cause of the crash was a bug in the failure recovery code of the switches. When a node crashed, it sent "out of service" message to the neighboring nodes, which are supposed to re-route traffic around it. However, the bug (a misplaced "break" statement in C code) caused the neighboring nodes to crash themselves upon receiving the "out of service" message, and further propagate the fault by sending an "out of service" message to nodes further out in the network.

    The crash lasted 9 hours, while programmers searched for the cause of the bug. An estimated 60 thousand people were left without telephone service, and 70 million phone calls went uncompleted. AT&T estimates at least $60 million in lost revenue and damage to its reputation; reliability was a central point in AT&T's marketing campaign against other long distance providers at the time. The incidental damage to businesses that were unable to operate due to lack of telephone service is hard to estimate, but is presumably much larger. The public safety and national security implications of such a large telephone system outage are distressing as well.

    This fault happened despite fault-tolerant design principles which were present in the phone system's design. The nodes failed fast, reporting their outage to neighboring nodes, and there was enough redundancy in the system to route around the failures. The crashed nodes recovered quickly, rebooting themselves and coming back up; however, they would immediately crash because of the messages received from neighboring nodes. The failure happened on an error-recovery path, which is poorly tested. The presence of decentralized distributed control, necessary for scaling, allowed this failure to propagate. The outage demonstrates that a bug in the software can cause a widely correlated failure.

    The possibility of a malicious attack on the system was seriously investigated as a cause for the crash. The investigation came up dry, but most sources acknowledge that this accidental fault could have just as easily been activated on purpose by a knowledgeable attacker. The social implications are investigated in detail in Bruce Sterling's The Hacker Crackdown.
  28. Re: How Would You Handle a $1,000,000 Coding Error by Scud · · Score: 5, Interesting

    Which time? I'm the guy who (unintentionally) wrecked the first Saturn ever wrecked (job #65). Since then I've wrecked one other (job 2 million and something), so my track record isn't that bad :)

    Most of the time you don't actually break something (be it product or be it equipment), but fixing the bug and getting everything rolling again takes time.

    And since the "value" of the product that is running on the line is about $5000 a minute, time is indeed money.

    I've probably had a couple 1+ hour breakdowns, but this doesn't even compare to the time my buddies plant went down for three days x 2 shifts per day ($14M).

    They were Lear-jetting parts in on a daily basis (they kept blowing up the new stuff and didn't seem to have the sense to order spares). Ron would show up at the service entrance at the airport to pick them up and it got to the point where the guys would just open the gates when he drove up :)

    My most recent one was when we changed the line speed of the skillet line and the thumbwheel switch messed up and opened up the 8's bit in the ten's digit (faulty thumbwheel switch) so that instead of running at 42 jobs an hour it was trying to run at 80 JPH (it would have tried to run at 122 but it's limited in the software to 80 JPH)

    Zoom zoom.

    Oh wait, that's the other guys :)

    John

    --
    I dream in binary.
  29. Re:You forgot... by fataugie · · Score: 5, Funny
    I'm sure I'm forgetting some.

    I bet I know why....

    --

    WTF? Over?