How Would You Handle a $1,000,000 Coding Error?

← Back to Stories (view on slashdot.org)

How Would You Handle a $1,000,000 Coding Error?

Posted by timothy on Monday July 19, 2004 @03:38PM from the wasn't-me dept.

theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"

6 of 878 comments (clear)

Min score:

Reason:

Sort:

More common than you think... by John+Whorfin · 2004-07-19 15:57 · Score: 5, Informative

I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think.

Most daily newspapers produce various editions, between 2 and four, and I've seen a couple of times, where only one edition is printed due to "codeing errors" (like the 1 billion seconds from the epoc thing - my personal favorite).

Of course the vendor had to be called at the $500/hour emergency rate to fix their own error.

Once I saw a print pre-processor go off line because /dev/null was deleted and the backup systme had been down for 6 mos. and take out $50,000 - $100,000 in advertising.

The call daily newspapers "the daily miracle" and when you look at some of the computer band-aids they have producing them, you can see why.
Re:Just one by wo1verin3 · 2004-07-19 15:59 · Score: 5, Informative

Google Cache as per your request.
Re:Just one by Nefarious+Wheel · 2004-07-19 17:22 · Score: 5, Informative

The book was "Big Blues", a NYT columnist's documentation of IBM's travails around the days of the rise of Microsoft. Speaker was TJ Watson Jr. I think.

--
Do not mock my vision of impractical footwear
Tribune's version by Anonymous Coward · 2004-07-19 17:46 · Score: 5, Informative

Here is the full text of the article in the Tribune:

A story we never thought we'd print

By James Coates
Tribune computer columnist
Published July 19, 2004, 6:40 PM CDT

Nothing built by humans can go wrong in as many ways or with as nasty an outcome as a computer system.

The people who create the Chicago Tribune started relearning that fact about 4 p.m. Sunday when they noticed that nothing was getting through as they attempted to beam the stories, artwork and ads from Tribune Tower to the Freedom Center printing plant.

About 13 hours later, they finally started printing a 24-page version of Monday's Tribune that should have already been landing on their readers' porches.

It was a misfortune that most people in the news business don't ever expect to experience. Newspapers do not miss days -- and Monday was close.

The only time the Tribune failed to print was during the Great Chicago Fire of 1871. That time, the lesson was that nature can be fickle and dangerous.

Now, the paper has learned that the same goes for the computer technology that has graced the industry with unparalleled productivity since the 1990s.

Business computer systems are cobbled together as row upon row of workstations, each running an operating system based on an estimated 50 million lines of instructions. In turn, the worker bee desktop computers connect to the queen machines with their own millions of lines of code in a different language.

An endless nest of wires, cables and even radio signals move instructions at light speed between the central computer and the workstations. The main computer also talks to all the peripheral devices needed to accomplish the mission.

The peripherals can be banks of hard drives, storage bays, printers, scanners, cameras and specialty devices as diverse as a pager or a printing press several stories tall.

The certainty that each and every one of these massively complex systems will crash haunts the people charged with keeping this thoroughly digital world up and running.

Those people are engineers, and so they often reduce it to numbers.

An often quoted study by Carnegie Mellon University computer scientists studied 30,000 software programs and found five to six defects per 1,000 lines of code.

And this is for finished software sent to customers.

When writing new programs, there is typically a defect in every 10 lines of code. About a half dozen defects per 1,000 lines remain after a process of checking, rechecking, cross checking, testing, retesting and finger crossing.

The hubris of computing becomes clear as one realizes that each of these errors in code branch out with instructions to millions of other lines of code. Quite often, they find pathways never before taken by that particular program.

Collisions occur on these pathways and trouble is spotted. Maybe it can be fixed or maybe technicians can only perform a "workaround" that can't be guaranteed.

Dick Malone, the Tribune's senior vice president and general manager, said that around 9:30 a.m. on Sunday technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines.

To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.

Malone noted that they checked and rechecked, tested and retested all day. Everything seemed to be working without a hitch. Then, they punched the button that was supposed to send all of the content for the newspaper to the printing plant.

Nothing arrived.

Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
One-line CODE ERROR $60 million - AT&T phone c by mdrejhon · 2004-07-19 17:55 · Score: 5, Informative

History....one line coding error cost $60 million dollars!

AT&T Failure of January 15, 1990

Link 1, Link 2, Link 3

On January 15, 1990, 114 switching nodes of the AT&T long distance system went down. The published cause of the crash was a bug in the failure recovery code of the switches. When a node crashed, it sent "out of service" message to the neighboring nodes, which are supposed to re-route traffic around it. However, the bug (a misplaced "break" statement in C code) caused the neighboring nodes to crash themselves upon receiving the "out of service" message, and further propagate the fault by sending an "out of service" message to nodes further out in the network.

The crash lasted 9 hours, while programmers searched for the cause of the bug. An estimated 60 thousand people were left without telephone service, and 70 million phone calls went uncompleted. AT&T estimates at least $60 million in lost revenue and damage to its reputation; reliability was a central point in AT&T's marketing campaign against other long distance providers at the time. The incidental damage to businesses that were unable to operate due to lack of telephone service is hard to estimate, but is presumably much larger. The public safety and national security implications of such a large telephone system outage are distressing as well.

This fault happened despite fault-tolerant design principles which were present in the phone system's design. The nodes failed fast, reporting their outage to neighboring nodes, and there was enough redundancy in the system to route around the failures. The crashed nodes recovered quickly, rebooting themselves and coming back up; however, they would immediately crash because of the messages received from neighboring nodes. The failure happened on an error-recovery path, which is poorly tested. The presence of decentralized distributed control, necessary for scaling, allowed this failure to propagate. The outage demonstrates that a bug in the software can cause a widely correlated failure.

The possibility of a malicious attack on the system was seriously investigated as a cause for the crash. The investigation came up dry, but most sources acknowledge that this accidental fault could have just as easily been activated on purpose by a knowledgeable attacker. The social implications are investigated in detail in Bruce Sterling's The Hacker Crackdown.
Re:Bad News, Good News..... by killjoe · 2004-07-19 18:10 · Score: 5, Informative

Actually that's not quite true. The big paper companies do have large forests that they try to manage but they cut trees much faster then they are being replenished. This is why there is relentless pressure to log the national forests. If the harvest from private acreage was sustainable they would never need to log the national forests.

These days companies like champion and plum creek are finding that it's more profitable to sell the logged areas then to replant them. For example in maine and montana.

It's more profitable to sell land (especially waterfront land) and then log the federally subsidized national forests.

Your tax dollars at work!

--
evil is as evil does