How Would You Handle a $1,000,000 Coding Error?
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think.
/dev/null was deleted and the backup systme had been down for 6 mos. and take out $50,000 - $100,000 in advertising.
Most daily newspapers produce various editions, between 2 and four, and I've seen a couple of times, where only one edition is printed due to "codeing errors" (like the 1 billion seconds from the epoc thing - my personal favorite).
Of course the vendor had to be called at the $500/hour emergency rate to fix their own error.
Once I saw a print pre-processor go off line because
The call daily newspapers "the daily miracle" and when you look at some of the computer band-aids they have producing them, you can see why.
Google Cache as per your request.
The book was "Big Blues", a NYT columnist's documentation of IBM's travails around the days of the rise of Microsoft. Speaker was TJ Watson Jr. I think.
Do not mock my vision of impractical footwear
Here is the full text of the article in the Tribune:
A story we never thought we'd print
By James Coates
Tribune computer columnist
Published July 19, 2004, 6:40 PM CDT
Nothing built by humans can go wrong in as many ways or with as nasty an outcome as a computer system.
The people who create the Chicago Tribune started relearning that fact about 4 p.m. Sunday when they noticed that nothing was getting through as they attempted to beam the stories, artwork and ads from Tribune Tower to the Freedom Center printing plant.
About 13 hours later, they finally started printing a 24-page version of Monday's Tribune that should have already been landing on their readers' porches.
It was a misfortune that most people in the news business don't ever expect to experience. Newspapers do not miss days -- and Monday was close.
The only time the Tribune failed to print was during the Great Chicago Fire of 1871. That time, the lesson was that nature can be fickle and dangerous.
Now, the paper has learned that the same goes for the computer technology that has graced the industry with unparalleled productivity since the 1990s.
Business computer systems are cobbled together as row upon row of workstations, each running an operating system based on an estimated 50 million lines of instructions. In turn, the worker bee desktop computers connect to the queen machines with their own millions of lines of code in a different language.
An endless nest of wires, cables and even radio signals move instructions at light speed between the central computer and the workstations. The main computer also talks to all the peripheral devices needed to accomplish the mission.
The peripherals can be banks of hard drives, storage bays, printers, scanners, cameras and specialty devices as diverse as a pager or a printing press several stories tall.
The certainty that each and every one of these massively complex systems will crash haunts the people charged with keeping this thoroughly digital world up and running.
Those people are engineers, and so they often reduce it to numbers.
An often quoted study by Carnegie Mellon University computer scientists studied 30,000 software programs and found five to six defects per 1,000 lines of code.
And this is for finished software sent to customers.
When writing new programs, there is typically a defect in every 10 lines of code. About a half dozen defects per 1,000 lines remain after a process of checking, rechecking, cross checking, testing, retesting and finger crossing.
The hubris of computing becomes clear as one realizes that each of these errors in code branch out with instructions to millions of other lines of code. Quite often, they find pathways never before taken by that particular program.
Collisions occur on these pathways and trouble is spotted. Maybe it can be fixed or maybe technicians can only perform a "workaround" that can't be guaranteed.
Dick Malone, the Tribune's senior vice president and general manager, said that around 9:30 a.m. on Sunday technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines.
To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.
Malone noted that they checked and rechecked, tested and retested all day. Everything seemed to be working without a hitch. Then, they punched the button that was supposed to send all of the content for the newspaper to the printing plant.
Nothing arrived.
Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
Actually that's not quite true. The big paper companies do have large forests that they try to manage but they cut trees much faster then they are being replenished. This is why there is relentless pressure to log the national forests. If the harvest from private acreage was sustainable they would never need to log the national forests.
These days companies like champion and plum creek are finding that it's more profitable to sell the logged areas then to replant them. For example in maine and montana.
It's more profitable to sell land (especially waterfront land) and then log the federally subsidized national forests.
Your tax dollars at work!
evil is as evil does