Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost?
NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!
But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!
I was in charge of ordering a leak correlation system for a water utility that I work for. The system I choose was not quite what we needed, but worked. One week after the warranty expired, I dropped the correction unit and it has never worked since. I found out the correlator wad unrepairable and we had to order a whole new system.
I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..
They didn't have any cable management and only one border router..
Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.
I used a system improperly over the course of a month. It connected to some services that ran up a $50k bill. I was mortified when my boss told me, thought for sure I'd be canned on the spot. I was only 22 and it was my first job out of college, so the amount was nearly double what I was being paid. The boss basically took the heat for not having explained it to me better, and I was not reprimanded in any way.
I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.
Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.
I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.
So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.
Next day everyone is working happily, everything is working smoothly, no worries.
Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.
So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.
Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).
In the land of the blind, the one-eyed man is kinky.
In 1993, I failed to file the US Patent on "A means of accessing a relational database via the Internet." If we'd known we could do it, CompuServe might still be around.
When I was 12 years old and hanging out on BBSs in 1989, I didn't realize dialing Gilroy from San Jose was long distance (Both were 408 area code). My parents were not pleased at the nearly $500 phone bill.
I made a calculation error that cost $10k per day. Took 9 months to straighten things out.
I later won an award for outstanding work.
Some bugs I've been responsible for, although it's hard to tell exactly what they did cost:
- rounding error when programming a timer in an embedded system, resulting in a baud rate to be 10% off, causing problems with several units shipped to customers
- overflow of an 8-bit counter, resulting in a serial protocol failing
Plus tons of other errors I forgot or haven't been aware of. Total damage for sure thousands of Euros. However, that's probably little for a 25+ years career mostly in software development.
Lost a slide for 3rd party client that was to be featured in a skateboarding magazine.
I think one of the coworkers stole it as I did not get along with them.
Insurance claims for that kind of thing can involve the cost of setting up the shoot again, whatever that entails.
Was fired not long after.
Our group at FedEx released code that I wrote on a Saturday night. This was two days before the Apple iPhone 4 shipped. The code worked perfectly, however, despite our repeated warnings about nearly doubling downstream traffic, the downstream systems (like billing and tracking) weren't ready for it.
So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).
Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.
Life, the Universe, and Everything... in my image.
digital signal processing chip from TI. The $750 (in 1986 dollars) wasn't the big deal. That the parts had serial numbers hand-lettered on them and I had to go back on the waiting list to get a replacement was.
I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.
I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"
Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.
On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.
I do not deploy Linux. Ever.
obsolescence, I got the task to shut 'em down. I also forced a worldwide recall of PC card disk drives in the switches that were the backbone of the Internet when we kept the vendor engineering on the phone all day for a failed switch... and read the duty cycle of the drives to them, like 5 minutes a shot, 10 minutes an hour, when they were running read/write continuously.
but I got a haircut indeed when we had to get out stuff out of a colocate that was shutting down. built a mirror data system for that in the new place, had the trunks up, costed over the traffic. then it was time to demanage and power down the old shelf. telcordia assigned a code to the new unit that was one letter different than the old one.
the good news is I got the new one back up in 20 minutes and they didn't stake me out over an anthill.
if this is supposed to be a new economy, how come they still want my old fashioned money?
We were writing a Unix program to parse transactions from some specialized terminals that read customer invoices and the checks that accompanied them, writing the transactions to digital tape to carry over to the mainframe system. During testing our tapes were compared to tapes generated by the legacy IBM system. Our team lead got a call from the customer liaison *early* on morning saying "Do you realize one of your batches was 5 MILLION DOLLARS SHORT - yes, she was shouting. Turns out that the $5 million transaction was the largest we'd ever tested with so far. All others were less than $999,999. It was my bug - I'd put the sign nybl (half a byte) on top of the most-significant digit of the packed-decimal payment-amount field on the test tape, dropping that digit from the field. Trivial fix - I had just been auditing the relevant code the previous day.
How many people will refrain from posting because the statute of limitations hasn't run out yet?
My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.
Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.
Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.
The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.
Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.
Take Nobody's Word For It.
I used to work as a SDH/DWDM admin. In early 2000's, while my colleague screwed up a major firmware update on a STM1/4 ADM and I as senior (haha - I was in my 1st half of 20ies) admin had to drive up to site (since the affected node was unresponsive to management system). After many unsuccessful attempts to recover it, at about 3 am. I decided to hard reboot the node, which caused it to boot up from corrupt firmware bank (it had two of those); which in turn just erased all the configuration, including traffic connections (which is built very robust btw). Since the site was on a (relatively small) island and had only 2 ADM's at the time, I more or less cut off the entire communication with mainland. For morning, I had managed to get my colleagues to ferry me another, fully fitted ADM (our last resort backup scenario was to replace entire node) - but as it turned out, it was in a hurry fitted with cards with different firmware (entire network was in middle of upgrade process) which resulted in same kind of useless "brick" I had already at hand. Although it was very cool to fly ~200km/h to port and back in my sporty car, to pick up the spare (not many police on the island and I had a very good excuse). By the afternoon, my higher-up manager had mobilized a helicopter to personally deliver me fully functional ADM, which we promptly replaced and restored configuration from backup. I still have copy of the local newspapers front page, praising how our company heroically saved the day to restore connection with outer world.
At that time I was already able to make up excuses that would have made BOFH proud, which saved my ass.
I let a vendor sell me a product without really testing it. Turns out it didn't work (at all) and we lost €50k on license fees for a product we could not use.
I was able to lay the blame on an accountant who had locked us into a 5-year contract in exchange for a minor discount. So I didn't get fired.
Some other fool did not install the panel properly, and left one of the three nuts off. Distinctive nuts, used in only one place.
Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.
Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
Training mission aborted. much sheet metal work needed.
Actual repair cost? Unknown, but easily 5 figures if not more.
Working for a desktop publishing house in it. Spent just under $4000 on 36 inch flat panel displays. Accidentally plugged in printer power cable. Immediately fried monitor. My boss was not happy. The internship did not go well the rest of. The summer.
I was working as a Jr. Network admin, helping to install some new cisco PoE switches to facilitate our building's move to VoIP phones. I aligned a brand new 48-port poe switch slightly off when inserting it into the chassis, and bent the insanely-complex connector at the back of the card, rendering it unusable. Fortunately, we had a ridiculous service agreement with cisco, and a new card arrived at our office within 4 hours. I distinctly remember buying burritos and beer for me and the Sr. admin to help make up for the fact that neither of us got to sleep that night.
Six or so years ago I was using a (fairly cheap) Virtual Private Server as a dev/testing box for a pet project of mine.
The VPS company was bought by a larger company, and prices were to double on the next billing period. I hastily chose a new provider without doing any research. I paid for 3 months of service in advance, got the container set up the way I like, migrated all of my data over, and was up and running.
2 months in the new provider vanished, along with all of my data. I wasn't very concerned about the months worth of money I had lost by not getting the 3 months I had paid for, I think it was only about $15. "Okay," I thought. I'll just pull my data out of my nightly backups and move on. It turns out I forgot to adjust my local cron script that pulled the data over rsync to the new IP address. My backups had not been pulled in over 2 months.
Luckily it wasn't very important, as it didn't make me any month and was mostly just for fun. I ended up starting over from scratch and ended up with a better system anyway.
I learned my lesson, though.
The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.
My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.
I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.
Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.
I will say, the Server was the problem.
It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.
It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.
By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.
So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.
The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)
I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.
Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.
I just did what I thought I could to keep the Titanic afloat.
So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.
But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....
Unfortunately - that only works if the damaged disk decides to drop out of the array.
It didn't.
I find th
So there I was, scribbling down some notes off the PC screen by hand, when I reached for the keyboard and Ctrl-S'd.
Not selling the company for $250M because he wanted $300M during the dot-com boom. My boss personally owned about 30% of the company at this point.
The real "Libtards" are the Libertarians!
Pretty much all modern Intel CPUs from the past many years.
Now the programmers in the audience could probably think of like 10 different specific things that could be coded into the system to prevent that from happening, but this company didn't. Which really isn't too surprising. I asked one of the devs on the ground systems team if the ground systems was using GMT or UTC. His answer was "What's the difference?" I was able to infer from his answer that it was most likely GMT, and that did appear to be the case. Somewhere deep in the bowels of the system there was presumably some piece of code written by an Indian contractor with a math degree adjusting times for leap seconds, but it wasn't in any code that anyone knew about.
The early history of that company read like a Monty Python sketch. The first satellite exploded on the launch pad. The second satellite fell over and then exploded. The third satellite burned down, fell over, exploded and then sank into the swamp. The forth satellite got into orbit and was promptly bricked by sending the wrong version of Windows(!) to it. To be fair they only had to do that because they launched it with the wrong version of Windows(!!) in the first place. One would think that ANY version of Windows would be the wrong version of Windows to shoot into space, but that's why you're not the head of a billion dollar satellite company.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
... too bad it was here :)
On the whole, I find that I prefer Slashdot posts to twitter ones because I don't get limited to 140 chars before
But it's worth repeating in this context. Thankfully, it wasn't me.
When I worked at a KC bank, we had a Wire Transfer team manager who loved golf. He was supposed to come in Saturday and test a firmware/OS upgrade, then restore. Nice, sunny day Saturday, so he decided golfing would be better.
Came in Sunday. Installed firmware/OS upgrade. Tested fine. Forgot to reinstall previous firmware and powered up old OS.
Incompatible. Froze the machine solid. He panicked and tried for maybe four hours to fix things himself. No go. Finally called Cupertino for help 4+ PM.
The techs had to be found, gathered and flown out from CA to disassemble said machine and reassemble. No wires until 1 or 2 PM Monday. Much money loss for all customers.
To answer the obvious question, no - beyond my understanding, he wasn't fired or even demoted.
One of my first engineering jobs out of uni involved modifying a UPS. This UPS had a massive battery bank that was quite dangerous to load test and didn't have an automatic load testing function. I came up with a small design involving a contractor and some minor wiring changes and we were part way through implementing it on every UPS at this site.
This UPS was part of a redundant pair that fed an emergency shutdown system at an oil refinery. In between the UPSs and the ESD system were about 120 circuit breakers, two for each circuit, and one of them was off. We modified the first UPS without issue then started the process for the second one. After calling the control room to let them know they will receive an alarm I switched off the UPS and was suddenly meet with a steam of profanities over the radio.
We lost power to 80 field instruments which triggered a fail safe action on the shutdown system tripping 4 units at the refinery, one of them was the FCCU which is core to a lot of refinery processes. To add insult to injury the unit was unable to be hot restarted because of a stuck valve and then thermally contracted breaking of large chunks of coke from the overhead line which blocked the internal cyclones. The FCCU was down for repair for roughly 10 days, I had made a name for my self and was asked to display the cock-up award (a giant dildo mounted on a plaque) on my desk.
Total cost of the outage was about $8million. Fortunately only partially my fault.
Wow. Just, wow.
I knew a guy who did support for a multi million pound company. They had many problems, mostly due to the fact that he was too scared to reboot their servers because he did all the support remotely and it would be a 100 mile trip up to their office if the machine didn't come back up. They insisted that he do maintenance in the evenings or at weekends to avoid disrupting their work.
So their terminal server was still running IE 7, because he was too afraid to update to IE 9 as it required a reboot. Someone actually got fired because they infected the server with a drive-by. Their mail server had a dodgy network card, but it took nearly a year to diagnose because he was terrified of updating the driver in case it didn't come back up, so that was just intermittently not responding or dropping incoming connections for over a year. The driver update fixed it in the end.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
I talked to someone recently who lost a day of science data from a UAV because the Windows system driving the instrument decided to auto update while in the air with something like a 56kbps data rate.
I recently built a field instrument and made it Linux based specifically to prevent things like that, as well as to keep power and latency down by being able to kill unnecessary background tasks.
Just fifteen minutes ago I realized that my script to refactor the primary file server (newly converted to ZFS) into more sensible datasets had an irritating detail wrong (a path element was being duplicated in some paths).
I said to myself "oh, I'll just roll that whole thing back to the snapshot I made 30 minutes ago".
Then I go "zfs list -t snapshot" and discover that my snapshot was holding onto 0 GB because I forgot the -r switch to make the snapshot recursive.
Oh, well. By some impossible-to-separate mixture of good management and good fortune, it turns out I had a set of (different) snapshots from the last two days covering all datasets in questions. I lost very little work (only scripts were executed against these datasets and I still have all the scripts).
My real screw up?
Back in my second co-op workterm job, I managed not to notice that a system I was backing up changed the order of the listed drives between two very similar screen requests that I made almost immediately one after the other. Unfortunately, on the second pass I selected the active system drive as the recipient of the system backup, picking from the position in the menu where the desired destination drive had appeared moments before.
I had become accustomed to my home system being deterministic in the order it listed things. My bad.
This is back at the very beginnings of the 4.77 MHz era, so my PC was actually not yet what we now know as a "PC" (its father had an S-100, and its mother had a itty-bitty CRT).
Thirty years later I still can't type dd of=/dev/ada3 without making three trips to the metaphorical bathroom.
Whenever I type a disk-level dd command, I leave the sudo off, until after the third proof-read and several console consultations in which at least two different programs give me the same view of the drive name.
In dollar costs I couldn't say. In psychic cost, it's indelibly etched onto my permanent record.
I had a co-worker once (EEng) who claimed that as a junior intern during the late 1990s back when laser gear for fiber optics was all the rage, he routinely fried extremely delicate $2000 DUTs while the old hands just shrugged their shoulders. Dotcom dollars. Who really gave a fuck? It was considered barely worse than ruining a nice chair.
I nearly cost my employer several million by fixing a bug.
The first task I was given in my new job was to look at an old system that printed labels to be put on containers of car parts. A message would come in on a serial cable saying what part was going to be needed within a few hours at a car assembly line, the parts were packed into stillages (a frame designed to hold a certain number of a certain part, like bonnets, bumpers, doors panels, etc.) and when a stillage was full, or when a certain amount of time had passed since the first part was picked, then a label was printed, applied to the stillage, and it was dispatched over the road to the factory.
Every time the serial number rolled over 9999 to 0001, the system would go wrong and stop working. This happened about once a month, and the help desk had a sheet of instructions on how to fix the problem. Some of the staff knew the fix off by heart.
I looked at the code, found a roll-over bug, and fixed it. Everything was fine, and a couple of years went by with no problems.
Then, at 3 in the morning, the help desk called me and said that it had happened again. They didn't have the sheet of paper any more, and no-one could remember how to fix it. I rubbed the sleep from my eyes, and tried to get my brain into gear and remember what to do. It took me about an hour talking with a couple of help desk people, and between us we figured out what the fix was, and they called the warehouse and talked them through it.
The next day I talked with my colleagues, and found out that we had come within a few minutes of triggering a penalty clause for halting the production line that could have run into millions of pounds. This was back in the '90s when millions of pounds were a lot of money!
I looked back over the code, and found that there were actually two very similar bugs in the code, one of which happened fairly regularly, and one which only happend much more infrequently, but the same fix worked for both of them.
Back when I first started working in IT, my boss told me, "One day, you will probably make your million pound mistake. In our business, we build systems that, over the course of our careers, will save millions of pounds in lots of small ways. Eventually you will make a mistake, and one of those systems will go wrong, and it might cost millions. Your employer will bear the cost of it, which is why we don't earn those millions ourselves. You have to be prepared for that eventuality. If it happens while you're working for me then I will kick your arse, and maybe I will fire you, but I'd be wrong to do so, that's just the nature of the business that we are in."
During a panel discussion with very senior technical leads, the question came up: "How many of you have made a $1,000,000 mistake?"
Every single one raised their hand. This was a very large semi-conductor company, and everyone had been involved in at least one instance where bad masks were made because a check was skipped or step was botched in the design flow.
I worked on a chip design where it took six design revs to get clean masks. All five of the prior revs had avoidable (human) errors during the design and build process.
Pay me now (in time running checks) or pay me later (in nre: non-recoverable expense) for bad hardware.
Back in the 70's when I was still a junior electrical design engineer working for a distribution transformer company, we used algorithms loaded into TI calculators to compute the electrical, heat, and mechanical stresses. I later got the task of modernizing those codes and merging them with a FORTRAN code that another engineer had written and abandoned because it was too expensive to run. Things went well at first, we saved a lot of time and used that as any good engineer would to optimize our designs using different parameters to reduce cost and improve efficiency, both very important to my company and its customers. Then one day we got a limiting case which we didn't recognize at the time. As usual, one of our engineering assistants used the computer generated design and the old methods to validate the design. The engineer always takes responsibility for the design. After the build, the unit, a 3 phase unit that had 76,000 volt inputs, was tested in our "hi pot" chamber - a voltage pulse of the rated voltage but with reduced current and only for a short pulse. The center core winding turned into shards of copper spaghetti in the 8 foot tall tank. It cost $25,000 to repair, and delayed delivery for 3 weeks. My heart rate hit about 200 when the engineering manager called me and my supervisor into his office. Then he explained that he had run the calculations also, and discovered that our methods had a flaw in the prediction of the axial forces on the center coil. It was a very subtle mistake, and he said it could have been much worse. We were able to revise the code within a few hours, and that incident led to further improvements in methods and automation. It also taught me my most important lesson about computers - human error is the greatest risk. Real tests of your code sometimes do "blow up".
Every change is not progress, but there is no progress without change.
Underestimating time needed happens all the time in the software industry. It probably is worse in the gaming industry where publishing deadlines often get set 6 months or more in advance, but I still get hit with guaranteed release dates for customer commitments at my job now where I've put in ~100 hour weeks to fulfill (telecommuting many of these probably saved my marriage, as I would work 4 hours after my wife went to bed). Still, it is nothing like the 160 hour weeks in the office for a game release crunch (and no, that isn't all work - I slept on beanbag chairs in the testing room and they catered in meals, but at some point you're just so burned out and stinking of feet that you need a night sleeping at home and a long shower).
I can't think of any instance where I've cost a project, but I'm sure they exist. OTOH, I did have a workaround for a $5 million dollar contract where the customer was going to reject our Linux port due to a bug I found and reported. The developer and pubs person assigned the defect were laid off after 9/11 so the defect slipped through to the customer. Fortunately, I overheard a sales person talking about it and supplied the workaround, saving the contract.