Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost?
NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!
But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!
$32,000 in paper and postage
I was in charge of ordering a leak correlation system for a water utility that I work for. The system I choose was not quite what we needed, but worked. One week after the warranty expired, I dropped the correction unit and it has never worked since. I found out the correlator wad unrepairable and we had to order a whole new system.
I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..
They didn't have any cable management and only one border router..
Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.
Was a bit shy to speak to my doctor about my ED, so.... Yeah...
Biggest mistake is derping out and going sideway with the cpu while installing it. A bunch of pins in the cpu socket got bent. On a $300 motherboard.
I cost our Asian office a day's work after I failed to verify that a deployment completed successfully.
The deployment was done on Friday evening US time, which would have been around 1 or 2am UK time. I couldn't be bothered to stay up for that so figured that I'd check in the morning.
Naturally I forgot to do that.
Throughout the weekend whenever I was out, I'd suddenly remember and think "I'd better check that when I get back in."
Naturally, I forgot to do that.
On Monday morning, I received a lot of phone calls and emails asking where I was and to get into the office ASAP. When I got in, I found out that the deployment had failed and the rollback scripts that I'd asked the team to run had not been run.
After a lot of frantic phone calls, we found a DBA in the Asia office who still had database access to the Production servers and he rolled the changes back.
By then however, Asia had lost a whole day of work and I was given a written warning by my manager.
It's still a running joke amongst my friends that I "took out all of Asia for a day". And if I ever interview and I can see it's going badly, I tell this story in response to the "What's your weakest asset" question, just to see the look on their faces.
Broke SLA shutdown wrong mainframe. Still have the job
Heh - would have to total all that up... sigh... but it still works!
Mark
I used a system improperly over the course of a month. It connected to some services that ran up a $50k bill. I was mortified when my boss told me, thought for sure I'd be canned on the spot. I was only 22 and it was my first job out of college, so the amount was nearly double what I was being paid. The boss basically took the heat for not having explained it to me better, and I was not reprimanded in any way.
uncountable losses, real and opportunity
I dropped a dime into old AT one time and it hit the controller for a propitiatory SCSI controller. It all worked out though. We replaced it with a 100 meg ide and everyone was happier.
I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.
Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.
I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.
So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.
Next day everyone is working happily, everything is working smoothly, no worries.
Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.
So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.
Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).
In the land of the blind, the one-eyed man is kinky.
In 1993, I failed to file the US Patent on "A means of accessing a relational database via the Internet." If we'd known we could do it, CompuServe might still be around.
About $2 Trillion.
I worked for the Florida Electoral Commission back around 2000.
I have been part of of a large mistake costing hundreds of thousands of dollars.
However most mistakes are part of a chain of events of little mistakes, where they all combine to a big mistake. For example, if someone happen to trip over a plug that unplugged a production server. Then questions on why was the cable was out where it can be tripped, who decided that it wasn't worth the money to put time, to get a better system of cable management...
Normally a person will get fired for a mistake if it was due to intentional misconduct or it happens to get political and needs someone to blame, however if it happens you need to be sure that you put the blame back on the system (not an individual), then you will need to follow up to fix the system so it doesn't happen again.
Most of the most expensive mistakes, are often due to a huge chain of events. A good system should be in place to stop a simple mistake from escalate into big ones.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
When I was 12 years old and hanging out on BBSs in 1989, I didn't realize dialing Gilroy from San Jose was long distance (Both were 408 area code). My parents were not pleased at the nearly $500 phone bill.
As High Proctor of Fahz, I once led my whole species into unrelenting suicidal despair when during the Chinz-Rahl celebration I passed our Ultron onto Chief Groo, who was not prepared to hold such a heavy object and dropped it.
My Mask of Ultimate Embarrassment and Shame is not enough to express the deep chasm of depression into which I sink.
I maneuvered downward the left button of the mouse attached to the computer I was working on which pointer was right on a small gif saying "Send" that technically sent a message I should never have sent. Cost me a lot.
Slashdot, fix the reply notifications... You won't get away with it...
Not me, but a friend. In high school the best computer in the school was a 386SX. They decided to upgrade it to a DX by adding a maths co-processor to the main board. So the ordered one, and when it arrived, they gave it to my friend to install for some reason. Now, the chip had one corner cut, which you are supposed to line up with the cut corner on the socket, so you know it's seated the right way. Of course, my friend put it in completely backwards (because it fit an any direction.) So he tries to boot up the computer and nothing happens. So he looks at it again, and realizes the chip is in backwards. So he turns the box off, pulls out the co-processor, rotates it 180 degrees and puts it back in the socket. Unfortunately, misfiring it in the wrong direction had toasted the chip completely, and when he put it into the socket in the correct orientation, the socket locked itself shut, as it's supposed to do. But, since the chip was fried, this effectively locked the motherboard in an unbootable configuration with a dead shop. Sigh.
- In Soviet Korea, only old people loose all their bases to Natalie Portman's petrified hot grits overlords.
Dropped and broke a $40k USD Symantec Gateway Security Appliance
I made a calculation error that cost $10k per day. Took 9 months to straighten things out.
I later won an award for outstanding work.
Some bugs I've been responsible for, although it's hard to tell exactly what they did cost:
- rounding error when programming a timer in an embedded system, resulting in a baud rate to be 10% off, causing problems with several units shipped to customers
- overflow of an 8-bit counter, resulting in a serial protocol failing
Plus tons of other errors I forgot or haven't been aware of. Total damage for sure thousands of Euros. However, that's probably little for a 25+ years career mostly in software development.
I failed to found Facebook before Zuckerberg did. Cost me billions.
Lost a slide for 3rd party client that was to be featured in a skateboarding magazine.
I think one of the coworkers stole it as I did not get along with them.
Insurance claims for that kind of thing can involve the cost of setting up the shoot again, whatever that entails.
Was fired not long after.
Our group at FedEx released code that I wrote on a Saturday night. This was two days before the Apple iPhone 4 shipped. The code worked perfectly, however, despite our repeated warnings about nearly doubling downstream traffic, the downstream systems (like billing and tracking) weren't ready for it.
So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).
Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.
Life, the Universe, and Everything... in my image.
First one, I was lucky... there wasn't a switchover to a new database yet, and I made sure to schedule a large downtime window, because I try to do like Scotty... take the time I think will fix something at the worst, then double it. If the PHB gripes, start into detail. A side effect is that users tend to be happy when stuff is back up earlier than planned.
Well, this was a two node HA cluster back in the day where a certain vendor had a passive node and an active node configuration selling for an insane amount. They were connected via serial connections for heartbeats.
Well, it was time to do a simple update of the machines. I staked out 24 hours, just because I wanted to do backups first.
Well, I did the sysbacks, so I had two tapes of the entire boxes.
Ran one set of updates on both machines, rebooted... all fine. Noticed there was a drive array microcode update... just a 0.0.x update. Well, I tossed that on and rebooted... Well, both boxes blew their kernels. All the data on their drives was gone, because the microcode patch got the array in such a state that one machine started writing garbage to all drives.
At least I was able to restore both machines and build the shared data.from the tapes.
The second one would have been just as bad. I was cleaning out source code tree of .o files and executables... came to found one dev had libraries that were only present in binary only format, and whose only backup was in the tree (where the backup program excluded all binaries for space sake.) Thankfully, the tree was on a NetApp, and a simple copy from a snapshot fixed everything. Were it on another server, I'd have Hell to pay.
digital signal processing chip from TI. The $750 (in 1986 dollars) wasn't the big deal. That the parts had serial numbers hand-lettered on them and I had to go back on the waiting list to get a replacement was.
A long time ago on mainframes. IBM 3083's and VAX's. I was running analysis on some waveform data, took probably about 20 reels of mag tape. Fucking marine seismic data. I sent the big deck of cards down to the floor on a Friday. 1st thing Monday, I had to go the VP's office. He explained that Monday morning, the fucking job was still running. Turns out, instead of sampling the data every 4ms, I accidentally sampled it every 2ms. Back then, you didn't own your mainframes, IBM leased it to you. The VP explained that I cost the company anywhere from $40-60k. Nice guy actually. Texas engineer, cowboy boots and a suit. He politely asked me, "Son, you probably won't be making this mistake again, will you?" I stuck around for another couple of years. Goddamn it took an army to process data back then.
"He's using a quantum encryption scheme! That'll take hours to break!"
...a VERY large (but nameless here) grocery chain here in the US after an EMC engineer decided it was perfectly fine to stick his hand inside the array that supported ALL the chains warehouses WITHOUT an anti-static wristband.
One 36 hour conference call later and we were all finally back online. I've no clue what the overall cost was but it was measured in not only in hardware and manpower, but lost sales as NONE of the 2,000+ stores could be resupplied while the requisite warehouses were down.
And yes, this was a MAJOR chain here but many many years ago.
Long before Amazon was ever more than a bookseller in the mid 1990s, a friend and I had this idea of a website that would allow for comparison shopping pulling data from other sites allowing folk to buy the cheapest electrical items possible
We never progressed because we couldn't see any way for it to make money. We had no idea that was the absolute last thing we should have cared about.
So now I'm here, an anonymous coward posting about our total lack of foresight and imagination, and not some rich fecker who owns real-estate like /Slashdot
I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.
I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"
Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.
On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.
I do not deploy Linux. Ever.
i used to insert the cartridges too hard and broke it to the point where i had to spend 15 minutes playing with it every time i wanted to play a game
obsolescence, I got the task to shut 'em down. I also forced a worldwide recall of PC card disk drives in the switches that were the backbone of the Internet when we kept the vendor engineering on the phone all day for a failed switch... and read the duty cycle of the drives to them, like 5 minutes a shot, 10 minutes an hour, when they were running read/write continuously.
but I got a haircut indeed when we had to get out stuff out of a colocate that was shutting down. built a mirror data system for that in the new place, had the trunks up, costed over the traffic. then it was time to demanage and power down the old shelf. telcordia assigned a code to the new unit that was one letter different than the old one.
the good news is I got the new one back up in 20 minutes and they didn't stake me out over an anthill.
if this is supposed to be a new economy, how come they still want my old fashioned money?
We were writing a Unix program to parse transactions from some specialized terminals that read customer invoices and the checks that accompanied them, writing the transactions to digital tape to carry over to the mainframe system. During testing our tapes were compared to tapes generated by the legacy IBM system. Our team lead got a call from the customer liaison *early* on morning saying "Do you realize one of your batches was 5 MILLION DOLLARS SHORT - yes, she was shouting. Turns out that the $5 million transaction was the largest we'd ever tested with so far. All others were less than $999,999. It was my bug - I'd put the sign nybl (half a byte) on top of the most-significant digit of the packed-decimal payment-amount field on the test tape, dropping that digit from the field. Trivial fix - I had just been auditing the relevant code the previous day.
How many people will refrain from posting because the statute of limitations hasn't run out yet?
My biggest mistake was buying a hp 4020i cd burner that was so flawed i understand americans received compensation for it. Traditionally a big fuck you went to europeans
My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.
Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.
Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.
The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.
Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.
Take Nobody's Word For It.
I used to work as a SDH/DWDM admin. In early 2000's, while my colleague screwed up a major firmware update on a STM1/4 ADM and I as senior (haha - I was in my 1st half of 20ies) admin had to drive up to site (since the affected node was unresponsive to management system). After many unsuccessful attempts to recover it, at about 3 am. I decided to hard reboot the node, which caused it to boot up from corrupt firmware bank (it had two of those); which in turn just erased all the configuration, including traffic connections (which is built very robust btw). Since the site was on a (relatively small) island and had only 2 ADM's at the time, I more or less cut off the entire communication with mainland. For morning, I had managed to get my colleagues to ferry me another, fully fitted ADM (our last resort backup scenario was to replace entire node) - but as it turned out, it was in a hurry fitted with cards with different firmware (entire network was in middle of upgrade process) which resulted in same kind of useless "brick" I had already at hand. Although it was very cool to fly ~200km/h to port and back in my sporty car, to pick up the spare (not many police on the island and I had a very good excuse). By the afternoon, my higher-up manager had mobilized a helicopter to personally deliver me fully functional ADM, which we promptly replaced and restored configuration from backup. I still have copy of the local newspapers front page, praising how our company heroically saved the day to restore connection with outer world.
At that time I was already able to make up excuses that would have made BOFH proud, which saved my ass.
I let a vendor sell me a product without really testing it. Turns out it didn't work (at all) and we lost €50k on license fees for a product we could not use.
I was able to lay the blame on an accountant who had locked us into a 5-year contract in exchange for a minor discount. So I didn't get fired.
Some other fool did not install the panel properly, and left one of the three nuts off. Distinctive nuts, used in only one place.
Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.
Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
Training mission aborted. much sheet metal work needed.
Actual repair cost? Unknown, but easily 5 figures if not more.
Working for a desktop publishing house in it. Spent just under $4000 on 36 inch flat panel displays. Accidentally plugged in printer power cable. Immediately fried monitor. My boss was not happy. The internship did not go well the rest of. The summer.
McAfee on a mass spectrometer data acquisition system. System control would be periodically lost. Cost over $12.6K in lost instrument time and labour to determine that McAfee was blocking serial comms to the instrument (but only when it felt like it).
Lesson learned: never run McAfee or Norton on a mission-critical data system.
I let a upgrade bug slip by me during a software upgrade for the accounting software. In retrospect it should have been caught before it got out of hand. It got out of hand in about 3-4 seconds and had a cascading effect bringing down the whole datacenter for the company.
It happened when a "guaranteed" bid was due for a 2 million dollar job. We had nothing. Not so guaranteed...
Fortunately (?) I had a ownership stake in the company; so I also screwed myself too. Figuring ~12% profit on the job was typical and 10% of that was mine ... it cost me personally over $20K on that mistake.
Ooops.
I was working as a Jr. Network admin, helping to install some new cisco PoE switches to facilitate our building's move to VoIP phones. I aligned a brand new 48-port poe switch slightly off when inserting it into the chassis, and bent the insanely-complex connector at the back of the card, rendering it unusable. Fortunately, we had a ridiculous service agreement with cisco, and a new card arrived at our office within 4 hours. I distinctly remember buying burritos and beer for me and the Sr. admin to help make up for the fact that neither of us got to sleep that night.
Two wall warts used on this computer were had the same size and shape plug, but very different voltages. I did not know this, and I put the wrong one into the powered usb hub. The computer had a ton of USB items on it, slide scanners, wireless keyboards (early days) flatbed scanners, a really nice giant ink jet printer, a 11x17 laser printer, serial interfaces, I could go on. Shorted the hub, and fried everything plugged into it. Thankfully I did this before I plugged it all into the computer, the smell of smoke was the dead giveaway. About $1500 worth of damage.
I was on the NASA Genesis price team. Only a few hundred million lost on that one when it crashed into Earth...
A year spent in artificial intelligence is enough to make one believe in God.
It was publishing it on Slashdot and costing the company a lot of canceled orders the next month as word went around.
I printed out an older draft of an IF-RF board spec. We were developing a high data rate fixed point to multi-point RF comms system. We hooked up the bench power supplies, set the voltages right and the IF CPU did stupid stuff. 8 engineers spent 7 days at 200 UK pounds per hour, trying to fix it, combing over every detail.
I printed out the up to date spec and saw the CPU rail spec had changes to be 0.5 volts higher. We (myself and one other engineer) turned the knob and the system sprang to life. We agreed to claim victory and not go into to many details about what was wrong.
Six or so years ago I was using a (fairly cheap) Virtual Private Server as a dev/testing box for a pet project of mine.
The VPS company was bought by a larger company, and prices were to double on the next billing period. I hastily chose a new provider without doing any research. I paid for 3 months of service in advance, got the container set up the way I like, migrated all of my data over, and was up and running.
2 months in the new provider vanished, along with all of my data. I wasn't very concerned about the months worth of money I had lost by not getting the 3 months I had paid for, I think it was only about $15. "Okay," I thought. I'll just pull my data out of my nightly backups and move on. It turns out I forgot to adjust my local cron script that pulled the data over rsync to the new IP address. My backups had not been pulled in over 2 months.
Luckily it wasn't very important, as it didn't make me any month and was mostly just for fun. I ended up starting over from scratch and ended up with a better system anyway.
I learned my lesson, though.
Not me, but my thesis adviser became the Technical Director for JSIMS, which ran through +/- $1B before the pentagon pulled the plug. He is not shy about mentioning that fact.
http://www.nationaldefensemaga...
In 1985 I bought a "Fat" Mac for NZ$10K
(The kiwi $ was worth about 44cents US at the time)
Oh yea, the "HME lock speed and duplex to full scripts". New some admins at a financial services company that didn't remember to run that on building the servers. Servers made it through testing, got turned on in production. The next day was ugly until we looked at the change management book (was really a paper book) and saw the new servers. 5 ethernet cable disconnects later we were back up our original capacity until they sorted it out.
The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.
My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.
I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.
Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.
I will say, the Server was the problem.
It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.
It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.
By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.
So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.
The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)
I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.
Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.
I just did what I thought I could to keep the Titanic afloat.
So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.
But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....
Unfortunately - that only works if the damaged disk decides to drop out of the array.
It didn't.
I find th
So there I was, scribbling down some notes off the PC screen by hand, when I reached for the keyboard and Ctrl-S'd.
Not selling the company for $250M because he wanted $300M during the dot-com boom. My boss personally owned about 30% of the company at this point.
The real "Libtards" are the Libertarians!
Two totally incompetent twits from a populous south Asia country. Cost about $32k in salary and 4 month schedule slippage. Another contractor, who is competent, said she suspected they gave 'ghost' interviews, a common practice n her country. I heard managers say the same thing, that the two who showed up for work were not the ones they phone interviewed. They did not know command line basics in either bash or Windows, how to use remote desktop, J ava, unit tests, and other things we required.
Oddly enough of the 4 foreign contractors we used recently the two women have been competent, the two men useless.
putting the 'B' in LGBTQ+
Havent caused errors with a quantifiable dollar-amount loss. But have been involved with several errors in various systems, as I suspect is the case for developers who write code that actually goes to production ;)
For an embedded hardware/firmware module for use by a backend application, I made a bug causing the module to reboot if a given parameter passed from the application was missing in certain circumstances where it was supposed to be present. The application wasnt supposed to call with this combination of parameters, and unfortunately the test harness didnt test for this case either. And in fact the application didnt usually call with the wrong parameters. But due to a database crash and associated data integrity error (which turned out to be a bug in the DB software itself which was later fixed) the column corresponding the parameter in question actually became NULL for a few users in the database- And since the application didnt check the validity of parameters but just passed on whatever it got from the DB, this resulted in the firmware receiving the illegal NULL value thus causing a reboot whenever one of these users logged in. The module brought itself up quickly after each reboot and there was redundancy so there wasnt any user impact, but a lot of warnings and alarms went off every time and it took some time to figure out how the error could happen.
I was brought onto a small web startup project as a co-lead. By this time the project was already 2.5 years old and had been rewritten at least three times by progressively less lousy developers. The final iteration was built on CodeIgniter (MVC framework), a decent choice in 2013.
My first day I'm browsing the codebase to see what's what, and a grep finds something like "UPDATE my_table set foo=" . $_POST['bar']. Not in a controller... not in a model... in a view.
So I immediately told the other leads that we needed to do a security audit on the entire codebase; it took a few days for the owners to consent. The audit revealed three different mechanisms for database queries (the standard CI driver and two other crude home-grown libraries, all used inconsistently) and that one of the devs, who not conicidentally had resisted the audit, was actually AFK for 20%-50% of the hours he billed every week. It took two months to do the audit and resolve the redundant code (no one was full time, mind you). Finally the owners told us "give us two weeks to decide whether or not we want to proceed". After six weeks of silence they pulled the plug and abandoned it entirely.
I had a friend who's job it was to find a way to break satellites. She said she was quite often successful.
(Hey, the OP didn't say it had to be an accident.)
We get big discounts that way.
“He’s not deformed, he’s just drunk!”
I developed a system at work to transfer specifications from the customer to the software engineers that bypassed me.
I left the cover off of a $40,000 stabilized vsat antenna in a rainstorm once, That did about 10k in damage to the electronics inside. That's nothing compared to what our customers do though. Lets just say communications systems don't belong IN the ocean.
sorry for my comments, I'm drunk
Not my own personal screw up, but did watch a Coworker Torch the version control system that about 1000 people were depending on. It was such an epic torching that it was down for 3 days - bug was also deleting the backups the moment they restored. It turns out that 1000 engineers playing solitaire for 24hrs is a huge bill
I prepared a powerpoint presentation, where we could see small black dots. These were dirt marks on the lense of the camera.
But I thought it was missiles with nuclear warheads or chemical weapons, and presented that theory to a bunch of idiots. Next thing I knew, we were invading Irak!
- Colin
I got hired with a local ISP/network service group, and my first assignment was to go install a new frac-t1 router in a new client's office (yea, this was ~15 years ago, cheap t1 routers were still ~$1k). So the boss takes me back into the storeroom, digs out a router from a pile, and grabs a random power supply by comparing the size of the plug to the hole in the router. I actually bother to check the rating, and find that the power supply is 24V, and the router wants 18V. The boss tells me to plug it in.
Me: "Um, I don't think this is the right power supply."
Boss: "It'll work, come on, we're in a hurry."
Me: "But this is a 24V supply, and the router wants 18V"
Boss: "I said plug it in, what are you, deaf?"
Me: "OK..."
BANG! Fizzle-smoke-spark!
Boss: "What did you do that for?"
Shortest job I've ever had.
Not sure how much the repaired metal layer cost. Totally my fault, I busted a key piece of a metal rom mask. About 200k of code got shifted late by 4k bytes (a late req from a team member) and all that code was worthless as it was linked for the original layout map.
I did not get fired but the processes were tightened up. I wrote a tool to confirm an elf32 files preloaded contents were where they should be. Never had this issue since.
During an acquisition, the company being acquired helpfully passed along the list of AS they used in their BGP4 configurations in their core routers.
They helpfully had included the ones from other networks they provided connectivity to as well, but just had sent the AS numbers over in one big list, unlabeled, along with the AS their network originated: "Do these."
So during the network integration I dutifully entered the entire list of AS into the core routers as AS to be originated. Needless to say, hilarity ensued.
So perhaps not entirely my fault - though I should, in hindsight, have asked for more clarification or done more investigation rather than blindly trusting the information I had been given. This was a couple decades ago, and I was not cynical enough yet.
Got this domain "hsa.com" in the *very* early days of the Internet (pre-web). Decided that since we were a Canadian company, I we should have a Canadian domain, and surrendered it and got hsa.on.ca. (we weren't allowed to have hsa.ca, since all our offices were in Ontario...)
A three letter .com address would probably have been the most valuable asset of the company :-).
I've found 99 times out of 100 catastrophies were caused by deliberate acts. Usually there is one or more arrogant sysadmins, who are much better at writing root cause analysis. Of course, if you are senior architect who has designed your system to allow on3 of these bozo's to actually touch it, you're screwed. Then of course you have users. If you have users, no amount of planning can stop the carnage about to be unleashed.
Worst thing (so far) has been formatting a PHP date() DB timestamp wrong for entries associating users and payments. I think it was something like accidentally using 'M' for both month and minute.
At the same time, there was a bug somewhere that periodically caused only one of the 2 tables to be written to, when we noticed that the tables were out-of-sync we immediately jumped to the timestamps to make some sense of the situation, which of course didn't work in this case.
Took only a few hours to sort out since we could use other available information to fix it, but it was my 1st or 2nd real job at around 18 so I figured I was canned; I wasn't though, it was one of those "lesson learned, watch out for it next time" situations -- my boss was really frustrated though.
... plugging a kettle into your 6-hour UPS is not a recommended way to make a cup of tea. This, however, is exactly what I did a long time ago. 10 or so seconds later, I had still-cold kettle of water and an entirely drained UPS. Oooops !
Bug in the routing protocol in a custom X.25 network for a major stock exchange. in the 1980's. Killed the entire net for a day. Client estimated the cost at a minimum of a million dollars. Found and fixed in 24 hours. The client actually thanked us because their earlier custom network vendor had done much worse.
Back in the pre-Y2K era I designed and wrote a system used by a major transcription company. The transcriptionists were hired as contractors and paid by the word. My system was supposed to do a word count, then pay the transcriptionist on the number of words. Easy, right? Only the word count was more than a little selective. Headings, for instance, were not a part of the count. Certain phrases were exempted from the paid word count, as were other characters and whole words.
Well, in the middle of all this I screwed up the word count process and ended up over-paying the transcriptionists by about 15%. The system ran for over a year before I accidentally caught the error. I think the phrase, "Oh Holy shit, look at this crap." came to mind. After updating my resume, I took the whole mess in to the general manager's office along with a bottle of bourbon and told him what was going on. I finished up by telling him I wasn't going to say anything about it to anyone.
He agreed. I quietly fixed the problem, being careful to drop the cost a point at a time over a three month period so that no one would notice, then left it alone.
My first summer job during studies - 20 years ago.
We were developing and testing new payment transfer system at one of largest global banks. I had access to production for monitoring
and to test for testing.
Somehow I made a mistake and run a test batch on production
instance.
The batch was monthly payment of a large airline for on of larger airports - slightly over 12 million USD.
In 2 hours reports in mgmt went red and my manager got a call from higher managemt.
My boss was able to negotiate with the airport to reverse the transaction - he knew the manager at the airport as we were training their staff as well.
Nothing happend to me - I just got chastized - my boss got some heat but everything ended fine.
I once forgot to open a water valve before turning on a laser in the lab.
The low-pressure safety switch for sensing water flow had been bypassed (not by me) and the laser tube immediately cracked and broke due to the instant heat buildup. Total cost, about $4000.
Just cruising through this digital world at 33 1/3 rpm...
I once mis-printed a 50000 page sales report.
I worked on both the Mars Climate Orbiter and the Mars Polar Lander, though not on software related to the failures. I did fry a $12k damper during testing though due to a misunderstanding with the thermal engineers on hardware placement (I didn't lose my job, I was fresh out of college). Due the fact that the capillary pumped loop heat pipe thermal system didn't work, they ended up cutting it off and adding extra heaters/sensors at the last minute. Looming launch deadlines make for crazy times ...
Rebooted the wrong IBM mainframe - twice!
No idea of true "damage" cost but frustrated a lot of users. Figure $100k or so given the number of users and their hourly salary.
Didn't lose job but boy was I made fun of for a couple of weeks . . . Then made a systems programmer.
Go figure.
I stripped the thread from my pedals by incorrectly using a cotterless crank puller.
I think that was the biggest tech mistake I ever made.
It didn't cost anything since I haven't had to replace the cranks on that bike yet.
I work in finance and there was a confluence of a bug, a human error, and a poor process that lead me to lose almost a million dollars in about 20 seconds after an economic event. It had been working on the front desk for about a month at the time and I was literally seeing fog for the rest of the day trying to put together the tick by tick scenario of what happeed. I wasn't fired, and cleaned up next time the event came along. But I was scared I had just ruined the best job I even had.
Mostly I make my career out of fixing other people's tech mistakes. Which is not something that uni taught me how to do. Man I'm glad I got out of that place before I ran up any significant student debt. Did I mention I trash talked a uni on a news blag website?
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
burnt electronics.
I worked at a company in the valley and had to correct another engineer's mistake in an offline log parser that calculated per-customer bandwidth usage of our platform from httpd server logs. Customers got billed on a combination of things, with one component being bandwidth usage (X GB per month free then additional charges per GB). This surfaced when a customer complained that they were being overbilled. Turns out that the log parser was not distinguishing HEAD requests from GET requests, so if a HEAD request for a 1MB file came in and resulted in an HTTP 200, the code incremented "bandwidth used" for that customer by 1MB. I fixed the bug, then we had to pull the previous 2+ years of logs from S3 and run a massive offline job to compute how much we overbilled. For many customers it was a non-issue, but there were a few that had been overbilled by a lot. We ended up issuing about $180K of refunds (which was a lot for a company with a few million in revenue per quarter).
The engineer that wrote the original code only worked there for about 6 months and was terminated after failing a PIP, but nobody ever did a thorough audit of his code after he left, so it was used for the next 2 years until the complaints made us look at it.
I always liked FinTech place's ability to forgive these things. When I had my million dollar loss, I thought for sure I was going to get fired, but they just wanted accountability, to know what happened EXACTLY, to have the issue fixed, and new procedures in places to make sure it never happened again. I too was priase for the being able to do the intricate debugging, picking though packet captures, etc...
Fucked up the coordinates on a fuel rod matrix/grid for a nuclear plant and it had to be scrapped. (a huge part several meters in diameter precision milled out of a single piece) I did not get appointed employee of the month that month.
At a startup - I was the new and first sys-admin they had had, they had two developers running amok before I got there. So I had implemented nightly backups and gotten things under control - on the backup server, which was also the development environment, I wanted to hook up the CD audio cable so I could play music... so rather than waiting to do a clean shutdown and interrupt the developers I decided to do it hot. Needless to say I bumped the scsi cable and put the raid into an wedged state. It took a call to the manufacturer of the box and one of their systems engineers to get the magic command to force the hardware raid card to a "good" state, and the day was saved.
Basically an loading tool with a bug I knew from testing, you could set it correctly once in production but if you set it twice every user was f*cked up and could only be fixed from the web interface by about 5 clicks per user, no programmatic solution. And of course we had an error in the production setup, I altered that part - which I could - but forgot to take out the "you can run this only once" settings. Hundreds of users borked and the vendor support would take forever or claim there's no other way, what do?
This was a consulting company, trying to bill this would look bad on both our vendor and ourselves and it pretty much broke everything so we gave a benched consultant the assignment from hell. Click here, here, browse, pick, save in this somewhat less than instant web interface. Now do that all day, every day for all users until you're done. Personally I'd be ready to jump off the roof after an hour, but apparently she stuck to it for three days and finished. I don't think we won any popularity points with her though.
Live today, because you never know what tomorrow brings
Melted down a couple of LARGE high end power supplies (worth about 200K - I think the repair was about 50K). Did I lose my job? Nope, not even really called on the carpet. I had a triple redundant fail safe system, approved by management (in writing), and reviewed by both levels of client, and ALL THREE systems failed! (1 software, one independently developed firmware, one mechanical). Failure analysis on just the last one (the mechanical) was it was a once in over a million chance of it failing (yes, we did a failure analysis). Something (surge?) fried the computer, the firmware controller, AND welded the mechanical contactor closed (LOW duty cycle - close at start of test, open at end, 3x safety factor on ratings, something welded them during the test - aka I watched them close, visually inspected, and went home for the night, as per SOP)
One of those freak things, but we changed to a carbon contactor so it could not weld, and changed the firmware unit to a more robust unit, and did some other isolation. As far as I know, never happened again
was created by my boss. I fixed the bug instead of reporting it. The boss was incompetent and was costing the company millions in missed opportunities and in increased turn over of really good people. He couldn't see when his successes were pure accidents and when his mistakes were entirely foreseeable and preventable. I had a few opportunities to get him fired when fixing his messes. I wasn't ruthless. It cost a number of good smart people their jobs and cost the company millions (in fixes, unnecessary delays and missed opportunities). I'd put the dollar figure at around $10mil. But it may be much larger if some of those missed opportunities were first-to-market.
Any guest worker system is indistinguishable from indentured servitude.
... too bad it was here :)
On the whole, I find that I prefer Slashdot posts to twitter ones because I don't get limited to 140 chars before
The two biggest I have seen: -Comms card slips out of box while being carried over to submarine. Worth about $220,000, fell into the water and had to be recovered by divers for security. -Electrician didn't test circuit was isolated, he went to disconnect 3 phase circuit and decided to start with neutral. He lifted the neutral off, putting up to 400v where there should have been 230v. This destroyed over $300,000 in components, and cost another $200,000 due to lost operations.
No one knows yet, ..., and only the 1st one got backed up.
They deleted everything to install the systems fresh. No one knows why backups didn't work.FWIW, I had specifically told them to ask me if they ever wanted to reinstall the systems fresh. The bug was fixed in the newer version of the code.
I had hardcoded the path for database backup, so one customer had db1, db2, db3,
Back in the 80's I worked for a field service organisation, fixing and maintaining PDP11 and VAX systems, but also CDC-9766 removable disk systems. Big 14" removable disk packs like you see them in old scifi movies. One of my customers had a string of 10 or so attached to a five-node Tandem Non-stop system.
Each week they brought two out of ten off-line for me to work on. I cleaned the heads, then used a servo disk pack to realign those heads.
To do this, I needed to remove the control cable from the string, and plug in an excersizer. One day I forgot to pull the control cable. So instead of moving the heads of my offline drive to a specific track, I moved the heads of *ALL* disks in the string! Without the O/S knowing about it
Believe me, that will bring a Tandem Non-stop to a grinding halt. That was my last time on the floor for that customer, but I didn't lose my job. Cost? I don't know. Perhaps a weekend of data recovery for the operators?
To Terminate, or not to Terminate, that's the question - SCSIROB
Late 70's. Central datacenter for a state not to be mentioned. I modified the JES2 startup JCL. Our mean-time-to-reboot was typically 2 weeks. Because of important state business, we didn't get a chance to reboot for 3 weeks. So, we reload and JES2 dies for JCL error. Then, we realized that all of our daily backups have the same error. And our last 2week backup has same problem. Our next backup, monthly, is stored at a site that is 1.5 hours away. Meanwhile, programs like AFDC and prison support apps are not up. Governor starts getting calls from important folk - wheres the system? Governor calls DP director - wheres the system? I see the end of my career looming. Fortunately, my boss had an old SVS system on tape that was just enough to allow us to edit the JES2 deck. After this, we changed our backup policy and put in stricter rules on modifying production systems. I just retired after 46years in computer industry. Still remember the fear on that day.
But it's worth repeating in this context. Thankfully, it wasn't me.
When I worked at a KC bank, we had a Wire Transfer team manager who loved golf. He was supposed to come in Saturday and test a firmware/OS upgrade, then restore. Nice, sunny day Saturday, so he decided golfing would be better.
Came in Sunday. Installed firmware/OS upgrade. Tested fine. Forgot to reinstall previous firmware and powered up old OS.
Incompatible. Froze the machine solid. He panicked and tried for maybe four hours to fix things himself. No go. Finally called Cupertino for help 4+ PM.
The techs had to be found, gathered and flown out from CA to disassemble said machine and reassemble. No wires until 1 or 2 PM Monday. Much money loss for all customers.
To answer the obvious question, no - beyond my understanding, he wasn't fired or even demoted.
Posting AC for obvious reasons:
I was an electronic tech in the military. I dropped a blast proof panel onto a nuke and left a dent about the size of watermelon. Cost unknown, but I should have been wearing depends
One of my first engineering jobs out of uni involved modifying a UPS. This UPS had a massive battery bank that was quite dangerous to load test and didn't have an automatic load testing function. I came up with a small design involving a contractor and some minor wiring changes and we were part way through implementing it on every UPS at this site.
This UPS was part of a redundant pair that fed an emergency shutdown system at an oil refinery. In between the UPSs and the ESD system were about 120 circuit breakers, two for each circuit, and one of them was off. We modified the first UPS without issue then started the process for the second one. After calling the control room to let them know they will receive an alarm I switched off the UPS and was suddenly meet with a steam of profanities over the radio.
We lost power to 80 field instruments which triggered a fail safe action on the shutdown system tripping 4 units at the refinery, one of them was the FCCU which is core to a lot of refinery processes. To add insult to injury the unit was unable to be hot restarted because of a stuck valve and then thermally contracted breaking of large chunks of coke from the overhead line which blocked the internal cyclones. The FCCU was down for repair for roughly 10 days, I had made a name for my self and was asked to display the cock-up award (a giant dildo mounted on a plaque) on my desk.
Total cost of the outage was about $8million. Fortunately only partially my fault.
I used to work at a cloud infrastructure provider and we were in the process of decommissioning a zone(datacenter). We were currently using cloudstack as our orchestration and it worked fairly well. One of the sysadmins noticed it was going to TAKE FOREVER to delete all of these these templates(used for spinning up vms) via the GUI. So I volunteered to write a quick little script to delete the templates through the API.
The sysadmin promptly gives me all of the template ID's that needed to be deleted. I plugged them into my script and away it went. I forgot one crucial detail when I wrote that script... I forgot to specify which zone these templates were to be deleted from. So it deleted the templates from all datacenters...
Didn't cause any downtime but about $20,000 in hours to rebuild all of the templates... and no, we didn't have any backups... that was something management was to cheap to pay for 0_0. I didn't lose my job over it and there was a silver lining. We now had all of our templates up to date with the latest security patches :P.
Test tech smokes a 80Mbyte hard disk when they cost $1500 each. Replaces and smokes the next one, hmmm something is wrong. Leaves that machine, goes to the next one, smokes a hard disk, replaces and smokes that one. Hmmm,, two machines in a row. Calls in the Engineering tech (me). I quickly determine the cable sets were wired wrong, 12V where 5V should be an vice versa. He never thought to look at that, took me about 30 seconds to figure out the problem.
In the very early 80's, I was tasked with getting a VAX 11/780 onto an internal Ethernet network using a proprietary Ethernet Unibus card (one guess where I worked). This VAX had a Unibus backplane in a separate cabinet cabled to a Unibus adapter board on the system bus in the "main" cabinet. The Unibus adapter backplane was wirewrapped and since this Ethernet card did DMA (it's been a long time, but I think that was why), it needed control of a bus line which was normally jumpered on the backplane bypassing each slot so "dumb" cards didn't have to deal with passing the signal along. Therefore, I had to snip this jumper on the backplane of the slot I was installing the card in.
The VAX wasn't used by our group but was used by other departments during the workweek for some fairly important stuff and there was no backup system. The machine was given to me on a Saturday morning and I was admonished it absolutely had to be up by 8AM (IIRC) on Monday morning. No problem as I had studied the problem and had been in email communication with someone at another site who had performed exactly the same procedure.I had never physically touched a VAX before in person but there really wasn't anyone to help me with the task locally so I was on my own (in retrospect, maybe that wasn't the smartest decision) but, being young and brash, that didn't bother me.
It didn't take me long to find the VAX once I got into the data center -- after all there was only one of them. I shut it down cleanly from the console. I set the switch on the main cabinet front panel to the OFF position (I don't actually recall how it was labeled), the lights on the front panel went off and I could hear the area around me got a little quieter as fans spun down (although there was a lot of other hardware around, so it just reduced the din slightly). I was well prepared and had just the perfect pair of wire cutters to do the job. I opened the Unibus adapter cabinet and put the card in. I then accessed the backplane, carefully identified, double checked, and triple checked, the slot and jumper that I needed to cut. In retrospect, maybe I should have paid attention to a rather obvious condition that was staring me in my face, but I had rehearsed this work flow in my mind and proceeded onward. I confidently stuck the wirecutters into the maze of wires, snipped the relevant wire, and everything was going very well.
Then I withdrew the wirecutters from among the wire-wrap posts and was more than a little surprised as sparks arced from the wirecutters to wirewrap posts that they brushed against. Nearly simultaneously with the arcing, I noticed one little detail that I should have noticed earlier -- the fan in PDU or power supply in the bottom of the cabinet was still whirring away happily and the light showing it was powered on was clearly glaring at me. Ooops...
Well, I thought, hopefully, no harm done and I closed the cabinet. It was around then that I noticed a very concerned look on the faces of a couple of FEs who were working on an adjacent machine. I walked over to them and their concerns quickly became mine -- turns out they were "downwind" of the VAX and the distinctive odor of scorched electrical bits was strong around them. I guess I made someone happy that day though - they were very relieved that it was my machine, not theirs, that was emitting that lovely unmistakable fragrance.
Unfortunately, although the VAX seemed to boot, a bunch of stuff didn't work... Ooops...
We had 7/24 support with DEC so I called service out and watched a completely incompetent service guy (he was our PDP-11 repair guy who apparently was stuck on call supporting hardware he knew nothing about) fumble around for hours and concluded that the Unibus backplane had been fried and initiated getting a new one counter-to-countered to us (fortunately, that got blocked by someone who knew what they were doing somewhere). The guy didn't even know how to run diagnostics on the VAX and refused to attempt to do so.
In the end, the machine was not up a
Back in the day, the HW development kit for Sega Dreamcast cost upwards of $10k and would fry if you plugged a receiver into the RCA video-out port while the system was on. We had a dev do this. So we had a good laugh, shipped the unit back, got a replacement, gave it back to the dev and the first thing he did was plug his receiver into the unit while it was running...
I missed one character in a regex in a monitoring system that would cause it to think all the hard drives in a machine had failed when the machine was booted. Since it only happens on boot, it wasn't noticed until there was maintenance work that powered off an entire datacenter. When they turned the power back on, ~5000 machines all decided their hard drives had failed simultaneously. Took 2 days to clean up the mess.
kc8apf
The largest IT mistake I ever witnessed happened at healthcare software company I used to work for not too long ago. The Database team was going through a lot of change at that time and they were bringing on a lot of new people. The senior DBA was our wondering some mountains in South America for his vacation (the dude is awesome). So it was left to one of the more mid range dbas to hold the fort for the next month.
So the perfect storm happens: The senior is out, the backups had been failing due to lack of equipment and a place to house them, a new team of dbas in the environment, and developers who saw their chance to push things at a faster rate because there wasn't someone there with enough experience and balls to tell them no.
The dbas and developers spot checked the script and everything looked green. They ran the script in prod and accidentally dropped over a terabyte of data in the companies largest database which housed data for their most profitable product. The dba saw that something was wrong within a fraction of a second and tried to cancel it but it took time for ctrl-c to register. Dropped over 400 tables in a 1200 table database. The dba just sat there, speechless with water in his eyes because he knew how bad this was.
It took two days to get the thing back up, the company lost over 2.5 million dollars from it and they tried to fire the dba who ran the script. The senior who was in the mountains, cut his vacation short and flew back to Pittsburgh. The senior threatened to quit if they fired the other dba and pointed out all of the flaws in the process. He also showed them all the documented attempts to try to get management to purchase equipment and infrastructure necessary to backup the companies most profitable data.
They didn't fire him then but they definitely wanted to. They treated the all of the new dbas like children after that which eventually lead to THE ENTIRE database team leaving the company. Management would of been hard pressed to of handled that situation any worse than they did.
About 15 years ago, a QA engineer in my office (a large Wall Street financial form) placed a fake trade for 1,000,000 shares of company stock in one of our test systems. The test order somehow got out to the New York Stock Exchange and actually moved the market. Backing out that trade was reportedly quite expensive.
The engineer didn't get fired, because he had done everything correctly. The system infrastructure had been set up wrong.. wasn't his fault.
I rewrote a system for checking out clients, upgrading the backend code (Things like an actual relational database and events to pass messages between systems for viewing updated data). But I had to maintain the work flow and one of the things the system had was everything is put in a current invoices table, at the end of each shift, they'd shut down every computer and run the end of day, it would print out a list of transactions for them to count their drawer and move everything to old invoices.
The old system did no checking to ensure they didnt have the system up, so i added a tool to lock the system and prevent any new machines from coming up then added a tool to send out an event to all connected machines and ask them if they're still on, connected machines would reply and abort the end of day, if no replies it proceeded. Eventually the users figured out there was no check that the system was locked, so they started launching the end of day without and while it confirmed the invoice report was printed properly they'd leave it sitting there then pull up other machines, the system was built to roll back if there were any inconsistencies in terms of database relations but it also used exclusive access for the end of day so another system up would cause data loss, since i only tested against inconsistent data and the other outcome was impossible right? users were instructed to do it this way system asked if anyone was on it shouldnt be possible right?
Well they did that once, called me in i look at it they've already closed the error and the entire system, i tell them i need the transaction number it gave them so i can roll it back, they go "well it put something on the screen about a transaction number but i didnt think it was important so i closed it" so they have to stay late and re enter the days worth of sales.
A week later I haven't yet figured out what they're doing to cause it and they run into it again, and they've closed the error message again and once again they have to re enter the whole days of sales.
A week later it happens again, call me as luckily im in the building, i run to the machine and as i get there the receptionist is moving the mouse to the cancel button i tell her to stop she says "but it wont let me do anything else if i dont" I tell her to stop anyway. I sit down and go about unrolling the transaction and when i look over someone has the system up, i ask her why she has it up during the end of day she says she started it after the error, i pull the logs and i see when the end of day started i see when she pulled up her machine and i know when i sat down in the chair, she pulled it up well before i sat down in the chair (5-10 minutes or so) but well after the end of day started.
They still try to find tricks to get around it, i have a daemon that monitors the end of day now and looks for them trying to get around the lockout on it and kick them off if they do.
Everyone argues non stop that its their god given right to screw stuff up however they please because they don't see how its bad. Told one girl 3 times and when i caught her the 4th time she said "well i don't see how it affects me so i don't care"
Its literally everyone at the counter so firing offenders isn't an option, working on a training regiment and going to put in policies to make it their problem (ie the shift that screws it up is the one that has to fix it, no leaving it for the afternoon shift)
and working on a replacement system that maintains a similar work flow but more or less simulates the whole end of day thing so they cant mess it up anymore.
I confessed, I worked on Slashdot Beta ;-P
Table-ized A.I.
Mid 90's. Spent a lovely weekend below the waterline on a frigate updating the ship's maintenance system with a new data picture of its systems. All went wonderfully well and I walked ashore late afternoon on Sunday and flew back to my home city. Fast forward to 4pm Monday and we get a call from the ship at sea saying the maintenance system no longer functioned: get your butt out here and unf*ck it. So, in the car, 3 hours drive to where the ship anchored for the night, RHIB ride out to the ship, up the rope ladder, about 10PM... fix it, you have until 6AM or you are sailing with us (for a week). That, my friends, is great motivation to work fast. To cap it off, there was a small fuel leak in the space outside the computer room: wonderful aroma to deal with. Tried to work out the obscure linkage between existing maintenance jobs and the system description that was causing the issue. Ultimately had to roll the database back to the pre-update state. Off the ship at 6 along with many bags of oil-soaked rags used on the fuel leak. Ship lost a few days of data and a day at sea: captain not happy... and we had to do the whole exercise again later.
Tape for data, $100, Airfare and and accommodation, $600, warship all at sea, priceless.
Not entirely my doing (what is these days) but I was the man that delivered the fun. No names, no pack drill over this.
Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
Upgrading from Netware 2.something to 3.1. Did backups, did more backups, tested the backups, retested the backups. Went ahead with the upgrade. That process wiped out everything to re-install. Started reloading from the backups. 1st one failed at the end of a volume, 2nd one failed at the same place. Started worrying. Re-installed everything. Backups still failed. Never managed to restore anything past the end of the volume.
After much to and fro with Novell, we were told that we had found a new bug. A file over 256 megs as the last file on a volume corrupted the rest of the backups.
Handed in my resignation shortly thereafter and NEVER EVER worked with any shit from Novell....
No, not me, but it's worth noting that the XBox 360 Red Ring of Death was (according to EE Times) caused by someone at MS who thought he could save a couple million bucks by doing the graphics ASIC work in-house instead of paying someone with experience like ATI to do it. That cost $1.3 billion. As far as I know nobody involved in deciding that or doing the ASIC work has ever been named (and I wouldn't blame the poor ASIC guys), but I can only imagine it would be like to know that was you.
I fried a voice coil on a fairly expensive Hitachi 2.2 GB optical drive back in the late 1980's with a QA stress test while working for FileNet. This led to engineering improvements and I got to keep the burnt out coil as a trophy.
Not me, but a former boss worked for ACDelco at his first job in the late 1970s. He worked on the first generation of CPU-based engine controllers for General Motors, apparently he had a bug that was discovered just as it was entering full production. GM was making so many of these that when they discovered the problem they couldn't just stop production and throw away all the existing modules, so they constructed a new building just to house the defective parts until a workaround was found. And apparently the first few years their ECUs were electrically and RFI-wise very noisy, and GM got the reputation of having car radios with terrible reception as a result. Ironically less technologically advanced competitors without digital ECUs were favored by consumers as a result.
Bought a Buffalo Terrastation. Went on vacation a year later to a country with limited internet access. On trip, one-year warranty expired and it died the next day, taking all data with it.
Fortunately, I had a copy of the server with me on a portable hard drive, so I could work remotely. That was our only backup. Sending the accounting database back to the office via GPRS was a lot of fun, but mailing that drive back to the office (after duplicating it of course) scared me to death.
The solution at the time was the right one; we didn't have the money for anything more. Ever since we have a hot backup server synchronized to the primary, for a small business. Like most screw-ups, what is important is how you move forward.
I flubbed the script and while there was no data loss, i, by myself on the night shift broke about 25k email accounts. I had a long night fixing it.
I still remember the frantic calls from the help desk as I was in panic mode trying to find out how bad it was.
Silence is a state of mime.
Bought it on eBay. Had crappy Verizon firmware on it that wouldn't allow any kind of audio streaming (web page streaming or TuneIn). Loaded Cyanogen on it and it worked fine but still wouldn't stream due to some remnants of Verizon FW.
Backdated Cyanogen to older mod and that mod was corrupt. It destroyed the boot loader so I couldn't flash another copy of non-corrupt OS.
I still have the phone but no way to get an OS on it without a boot loader on it.
I knew a guy who did support for a multi million pound company. They had many problems, mostly due to the fact that he was too scared to reboot their servers because he did all the support remotely and it would be a 100 mile trip up to their office if the machine didn't come back up. They insisted that he do maintenance in the evenings or at weekends to avoid disrupting their work.
So their terminal server was still running IE 7, because he was too afraid to update to IE 9 as it required a reboot. Someone actually got fired because they infected the server with a drive-by. Their mail server had a dodgy network card, but it took nearly a year to diagnose because he was terrified of updating the driver in case it didn't come back up, so that was just intermittently not responding or dropping incoming connections for over a year. The driver update fixed it in the end.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
The company I worked for was contracted to work for a company that dealt with credit cards - not the big banks who ran their own services but everybody else with a via or mastercard. We did this at a knock down rate looking for more work at the company.
we were certainly the first in the UK to automate balance transfers via the software the call workers were using. massive commercial advantage. Every couple of days I'd see some sniffers from big banks looking around and asking questions. things were looking good
until the company who originally contracted us went... uh uh uh! by the terms of this contract we own the IP. and its true. the moron development manager probably had no idea what he was signing. i certainly lost confidence in him when i asked for pay rise and he couldnt work out what it would actually cost him with the agency fee on top...
Well, one time, I had a problem with my land line, and I erroneously accused the wrong phone and threw that one out instead of the one that was causing the problem. Then I ended up throwing away two phones.
Since then I've solved the problem more generally by not having a land line anymore.
Wasn't mine, but it's too good not to share. Back in the mid 80s, I was working at (let's call it) SuperBigCorp's IT department. There was a fellow there who maintained the programs that handled the savings elections for employee 401K funds. One day, while making some changes to the COBOL programs that sent which funds to what investment vehicles....he made a little mistake. He got confused in a conditional statement, and all the funds that should have gone to stable investment selections went to the highly speculative vehicles, and vice versa. Even more unfortunately, this area of activity was not supervised and audited half as well as it should have been....by the time it was noticed, several months had gone by, and the stock market had suffered a bit of a setback. Millions of dollars were lost by SuperBigCorp getting it straightened out. They had to let the poor fellow go, in disgrace. The Chief of IT was reported to have said, that if the market had just moved the other way, the programmer would have been a hero...
There is no God, and Dirac is his prophet.
I worked for a college and recent graduate job search site. Users who hadn't logged in in a long time were archived. The code I wrote that re-activated those users if they happened to log in truncated (rounded down) their GPAs. Those users went from having a GPA of, say, 3.9 to 3.0, for several hours. When I found out I restored the GPAs from backup, but the intervening time was a very harrowing, thinking that I could have potentially impacted the careers of lots of people.
Back in the mid 80s, I was fortunate enough to get my first programming job. I worked with an incredibly capable programmer, let's call him Dr. Bob. I learned a great deal about programming from kindly Dr. Bob - he was a whiz at PDP11 and VAX assembly coding, and a great mentor. One day we came back from lunch and he picked up his mail and messages from the department secretary on the way to his desk. He opened one of the envelopes he'd gotten, read the letter within briefly, then started cursing like a sailor and threw the letter in the trash. He stalked off in a rage. I retrieved the letter and saw it was a page from a phone book, with the name "David Alexander" circled. After a couple hours, when Dr. Bob had calmed down. I told him he had to tell me what was going on. It turns out that his very first assembly language programming gig had been at the local University. It involved managing the data for a planned 50 year long psychology experiment, tracking the names, addresses, and project info for all of the participants over time. Now this was the mid 70s, so there was no database, just a bunch of tape files and MACRO programs to do the updating and reporting. Dr. Bob really liked the work, and the folks in the Psych Dept were really friendly, it was a great atmosphere. One day, Bob made....the Big Mistake. Due to some typoes, he inadvertently replaced the name and address info in every record in the files with the data from the first record....David Alexander's. This was a tape database and it only went back a few tapes worth....by the time it got noticed it was too late - all the good data was gone. The long range experiment was totally destroyed since they couldn't track the participants. He had to quit in disgrace - he said what really upset him was the way the Psych Dept folks were so nice about it and didn't want to fire him. Anyway, that's bad enough...but when his "friends" caught wind of it, they started popping up David Alexander references everywhere they could - they'd leave him phone messages from David Alexander, they'd get mailings sent to his address to David Alexander, and so forth. By the time this event I saw occurred, it had been going on for years (for all I know it still is). Anyway, due to kindly Dr. Bob's David Alexander mistake, I always check my code just a leetle more carefully than I otherwise might be bothered to - I personally don't ever want to make my Big Mistake....
There is no God, and Dirac is his prophet.
Just fifteen minutes ago I realized that my script to refactor the primary file server (newly converted to ZFS) into more sensible datasets had an irritating detail wrong (a path element was being duplicated in some paths).
I said to myself "oh, I'll just roll that whole thing back to the snapshot I made 30 minutes ago".
Then I go "zfs list -t snapshot" and discover that my snapshot was holding onto 0 GB because I forgot the -r switch to make the snapshot recursive.
Oh, well. By some impossible-to-separate mixture of good management and good fortune, it turns out I had a set of (different) snapshots from the last two days covering all datasets in questions. I lost very little work (only scripts were executed against these datasets and I still have all the scripts).
My real screw up?
Back in my second co-op workterm job, I managed not to notice that a system I was backing up changed the order of the listed drives between two very similar screen requests that I made almost immediately one after the other. Unfortunately, on the second pass I selected the active system drive as the recipient of the system backup, picking from the position in the menu where the desired destination drive had appeared moments before.
I had become accustomed to my home system being deterministic in the order it listed things. My bad.
This is back at the very beginnings of the 4.77 MHz era, so my PC was actually not yet what we now know as a "PC" (its father had an S-100, and its mother had a itty-bitty CRT).
Thirty years later I still can't type dd of=/dev/ada3 without making three trips to the metaphorical bathroom.
Whenever I type a disk-level dd command, I leave the sudo off, until after the third proof-read and several console consultations in which at least two different programs give me the same view of the drive name.
In dollar costs I couldn't say. In psychic cost, it's indelibly etched onto my permanent record.
I had a co-worker once (EEng) who claimed that as a junior intern during the late 1990s back when laser gear for fiber optics was all the rage, he routinely fried extremely delicate $2000 DUTs while the old hands just shrugged their shoulders. Dotcom dollars. Who really gave a fuck? It was considered barely worse than ruining a nice chair.
Crashed the work car, got fired.
$22M - 6 hrs of downtime for 1 application due to a corrupted DB. I typed what the vendor told me to type into sqlplus. The vendor was clueless, obviously. Took about an hour to determine the root cause, took another hour to find a real DB (on staff) then some more time to bring him up-to speed and restore from daily backups.
Over 20K workers couldn't do anything that day.
The lead technical architect (hired gun), my team, and the direct business clients who knew protected me. S-VPs in the client organization all wanted to fire someone - me. They never found out who to fire. However, I've been stuck in the same position the last 8 yrs. No promotion since.
First job out of school, working as a line tech testing/tuning RF amplifiers. Forgot to put in a 2nd 30db attenuator in front of a HP 435 watt meter. Put 1000 watts into a 1 mW (0db) sensor. I felt terrible - the sensor was $1500 at the time, about 2 weeks of my pay. I offered to pay, but my boss said, just don't let it happen again. (I worked there for 5 years, and I never let it happen again). I wish all of my bosses were like him (I was spoiled early on in my career). Thanks, John!
Back in the day when drives were expensive I dropped a 16-drive FC RAID. It was supposed to lock when I pulled it out of the rack but it didn't. At least I jumped back quickly enough so it didn't drop on my feet. Must have been between 20 and 30k.
My personal best was when I was writing the firmware for a customer's laser marker system. It was a big industrial machine that moved the laser head on a very expensive gantry using 15-pound servos that could generate ungodly amounts of torque. I had a bug in the code that drove the servos, and I issued a command to home the gantry, after which the X-axis went zipping across as fast as it would go. Wouldn't have been a problem except there was a faulty limit switch on that end of the axis, so the 25-pound laser head got slammed into the stops at what we estimated was about 100 inches per second. Totally destroyed the laser head (there's nothing more disheartening to hear than the tinkling of broken steering mirrors and seeing a cracked flat field lens as a bonus), and caused some severe mechanical damage to the rest of the assembly. Fortunately the motors shut down automatically when the temperature sensor tripped, but it wasn't fun explaining to the boss that we had to replace about $30,000 of hardware.
My favorites are those I thankfully had nothing at all to do with - where I am now, we write and maintain the warehouse management software for a very, very large snack food vendor, and we have a VPN link to all of the plants to maintain and monitor what's going on. It's happened before where co-workers haven't paid close enough attention and have connected to live plants instead of the test systems, and accidentally shut down the warehouse, which means production gets shut down too since there's nowhere to put those thousands and thousands of bags of chips until the warehouse system comes back up, and it takes them hours to get stuff restarted and settled once that happens. I don't know how much it costs, but it can't be cheap. I'm also not sure why we don't have some kind of two-factor system with a unique key for each plant to keep that from happening. [shrug]
Please stand clear of the doors, por favor mantenganse alejado de las puertas
of millions of $US dollars, who know how much world-wide. Huge mistake;
I'm sure it claimed the lives of many and even some domestice cats as well...
I was the person responsible for green-lighting the Windows 8.0 release for x86_64 processors...
I nearly cost my employer several million by fixing a bug.
The first task I was given in my new job was to look at an old system that printed labels to be put on containers of car parts. A message would come in on a serial cable saying what part was going to be needed within a few hours at a car assembly line, the parts were packed into stillages (a frame designed to hold a certain number of a certain part, like bonnets, bumpers, doors panels, etc.) and when a stillage was full, or when a certain amount of time had passed since the first part was picked, then a label was printed, applied to the stillage, and it was dispatched over the road to the factory.
Every time the serial number rolled over 9999 to 0001, the system would go wrong and stop working. This happened about once a month, and the help desk had a sheet of instructions on how to fix the problem. Some of the staff knew the fix off by heart.
I looked at the code, found a roll-over bug, and fixed it. Everything was fine, and a couple of years went by with no problems.
Then, at 3 in the morning, the help desk called me and said that it had happened again. They didn't have the sheet of paper any more, and no-one could remember how to fix it. I rubbed the sleep from my eyes, and tried to get my brain into gear and remember what to do. It took me about an hour talking with a couple of help desk people, and between us we figured out what the fix was, and they called the warehouse and talked them through it.
The next day I talked with my colleagues, and found out that we had come within a few minutes of triggering a penalty clause for halting the production line that could have run into millions of pounds. This was back in the '90s when millions of pounds were a lot of money!
I looked back over the code, and found that there were actually two very similar bugs in the code, one of which happened fairly regularly, and one which only happend much more infrequently, but the same fix worked for both of them.
Back when I first started working in IT, my boss told me, "One day, you will probably make your million pound mistake. In our business, we build systems that, over the course of our careers, will save millions of pounds in lots of small ways. Eventually you will make a mistake, and one of those systems will go wrong, and it might cost millions. Your employer will bear the cost of it, which is why we don't earn those millions ourselves. You have to be prepared for that eventuality. If it happens while you're working for me then I will kick your arse, and maybe I will fire you, but I'd be wrong to do so, that's just the nature of the business that we are in."
Not sure if it counts as it was an Amiga 3000 and they came to my house to fix it for free.
I had a "friend" who brought over a new hard drive to get working on the Amiga I did my best then the system just quit, He then says yep, did the same thing to mine.
Oops! Wrong terminal!
I was sshd into a production server and did a poweroff. Meant to run it on my own box. I didn't have authority with our host to ask them to turn it back on and those who did already left for the day. Probably didn't cost the company much since it was a small saas product, but if I pulled that stupidity elsewhere it could have.
Coworker was using a dialup (110 baud) modem to a computing service to do his university home work with the bosses permission. This was the way smaller businesses used computers back in the 1970s. It was also very expensive, I did a simulation of air flow through roof structure for energy recovery to heat the building and in 3 hours coding, compiling and and running the simulation the bill was 40 hours pay. Coworkers use was usually cost was less than 1 hours pay, but the program did not stop when he logged-out so it kept running until completion and the bill was 2 years pay.
My first job out of college involved work for a private company performing work for the Navy Research Laboratory on a secret project that is now declassified. We were working on a pan-tilt-zoom camera system with thermal imaging to track a Navy Seal diving in heavy fog conditions. The team at the NRL attempted to track him with hydrophonic sensors (Optical fiber acoustical) to detect the popping of microscopic bubbles in closed circuit rebreather equipment. We were competing against big companies like Northrup Grumman for the contract, and we were working on a demo to show off our capabilities.
Our first demo to them of our camera system destroyed itself spectacularly by spinning uncontrollably at a high speed until it ripped its worm gear apart and caused untold amounts of financial damage to the several hundred thousand dollar unit. I was new to threading at the time and our senior developer was not very good at it either, and we had no unit testing or continuous build system of any kind.
He had decided to use a lock with a loop that had a condition that never released the lock in the laser range finder (LRF) code that tracks distance to target. I had warned him I didn't think we needed the lock as long as we allocated new memory for the data and passed it back on the main message queue, which was true. At the time I thought it was a minor inconvenience that he insisted on adding that piece of code. How wrong I was.
It was tragically the only part of the entire program that he wrote. I ended up writing the entire software, which was silly considering I was right out of college. And, being inexperienced at the time I didn't identify that his code would have seized up our entire system. We had no chance to test his piece of code before the demo because he took too long to finish it. He added it in at the very last moment.
In retrospect I should have designed the plugin framework so that an infinite loop in one of the plugins wouldn't have been able to grind everything to a halt. But, as it was, we were pressured for time and I was inexperienced. We went back to the drawing board, fixed the issue, resubmitted our demo to the NRL and won the contract because the Northrup Grumman system failed every demonstration. However, my company ended up asking me for my resignation instead of firing me in the end because I criticized management. That salaried job only lasted me only a year and it was a lesson to me in more ways than one.
USB connectors also fit neatly in RJ45 ports, and this too can lead to interesting side-effects.
lucm, indeed.
In 1996 I spilt a pint on top of a running 486 computer with an open case! Cost me about a grand which was my entire worth. :(
I temporarily ran a copper network cable out of a window to another building while our building to building fiber was being installed.
Over a weekend we had huge lightning storms. The voltages induced in the unshielded twisted pair cable hanging outside 3 floors up fried both switches on either end of the cable.
That was an $8000 mistake.
Some half-experimental but in-production code (back in them cowboy coder days).. had a little "logical fault" one morning and dropped a significant number of stores' data from a retailers database. Fortunately easily fixed, but confidence was shaken and all the morning reports were screwed as I only recovered the data around lunchtime. Cost ... ?? $20k maybe?
Back in the early 80's, I took off a little too fast in my company station wagon, and $10k DTS Data Terminal hit the road hard. Ooops.
Not me this one, but a classic.
One Friday afternoon Telecoms tech was checking a remote unmanned exchange, one of the checks was to measure the levels on the analog multiplexer for the trunks to the main exchange, which acted as the brains for the dumb remote.
The procedure was to plug a 6.5 mm phone jack, attached to a large fixed meter into each channel at a time. Unfortunately, this chap grabbed the wrong hanging jack, this on having 50v exchange battery on it. He then proceeded to plug into each channel of the carrier system, and was mystified when there were no reading. As he plugged in the last channel, the exchange went totally silent. Whole exchange was down for 2 days.
What about the guy who sold Slashdot to Dice? :)
During a panel discussion with very senior technical leads, the question came up: "How many of you have made a $1,000,000 mistake?"
Every single one raised their hand. This was a very large semi-conductor company, and everyone had been involved in at least one instance where bad masks were made because a check was skipped or step was botched in the design flow.
I worked on a chip design where it took six design revs to get clean masks. All five of the prior revs had avoidable (human) errors during the design and build process.
Pay me now (in time running checks) or pay me later (in nre: non-recoverable expense) for bad hardware.
I once wrote a temperature monitoring system for a cargo airline flying 747s. The system would read the loadplan to determine if there was temperature-sensitive cargo onboard, then after takeoff, would send an ACARS message to an aircraft asking the ECS what the temperature was in each section of the aircraft. The rules table could be set to a different frequency of monitoring based on the exact cargo, so AVI (live animals) would be monitored every 5 minutes, pharmaceuticals every 10, etc. Once the temperature report came back, the system would compare that to determine if the temperature was within limits of the cargo onboard. Anyway, accidentally put zero in the frequency table, and basically DOSd 5 aircraft that were in-air carrying perishables. Realized the error pretty quickly when the monitoring system freaked out, but the data charges alone where about 30k in 30 seconds. ARINC was very nice and waived the fees though - thanks guys!
Back in the mid to late eighties I was taking a basic higher education qualification, they were teaching us COBOL using hard copy data sheets which would be entered by data entry clerks. We learned a little Pascal, some DB2. I was already coding in 6502 assembly language at this point so I thought it was a little backward. I was writing self modifying code and they wanted me to write out programs with a pencil on data entry sheets.
Second year of the course we got a two week work placement, they put me with a financial services company that specialised in COBOL. They wouldn't let me anywhere near the mainframe, the code but they were magnanimous enough to let me read the report outputs but not the actual results. I made a mean cup of tea and fetched a lot of lunch till they asked me to wire a plug for an extension, I had never wired a plug before and long story short I wired it wrong and blew all the fuses on the mainframe and most of the IT section. I was sent home and asked to never return.
Warranty work: In the late 90's I was repairing a beige desktop Mac (early PPC), I needed to remove the logic board, and while attempting to pry up the logic board I slipped with the screwdriver, which ripped off a resistor in the process. As it was warranty work on behalf of the manufacturer (I was working for a service agent), all parties agreed it was a mistake that could have happened to any technician, so it continued to be covered.
Destroyed keyboard: I once spilt a Fanta on a white Apple keyboard, the clear plastic base with the full height keys, the last of it's kind before the current flat aluminium keyboards cam in.
Almost lost data: I was click happy once during the process of backing up a laptop for a staff member (planning to upgrade the OS), and instead if hitting backup, I hit erase. I was able to restore the data thanks to hard drive erasing only modifying the first block or two on the disk, instead of going to the time and trouble of erasing the entire disk.
Clicked on a remote desktop shortcut that started a second session on a paper grading server. Software on this server crashed brutally when two instances of it were running. The resulting crash blanked setpoints in the control system for the paper machine causing it to go down hard. The control system is designed to feed from the grading system to maintain a consistent quality of paper.
Three hours downtime from the crash and resulting startup issues. What made it worse was that earlier that day we'd had a similar failure and I'd given explicit instructions to other people not to do what I did.
But my boss's boss's boss.
Wasn't gonna make his numbers ($$), so he decided that attrition was his only hope for a $250k bonus. So he decided to encourage it at his third largest site, by moving everyone there (180 people, 165 programmers) to other sites (one on the East coast, one on the West).
Miscalculations: (1) two of his three most profitable products were centered there. (2) Most of the programmers were married... due to company rules, spouses would have had to transfer one to each coast. (3) A large fraction of the staff had voluntarily moved to this site for quality-of-life reasons.
Result: 155 people (including me) quit and engineering continuity was lost on several products, including the two most profitable.
I went off and found a job at another company. Two years later, my employer is bought by my old employer because they no longer had a competitive product. The week we were acquired, the old boss 'left to pursue other opportunities'.
Cost? Certainly more than $20M. Probable cost? > $200M, and the company has long since been bought out.
Back in the 70's when I was still a junior electrical design engineer working for a distribution transformer company, we used algorithms loaded into TI calculators to compute the electrical, heat, and mechanical stresses. I later got the task of modernizing those codes and merging them with a FORTRAN code that another engineer had written and abandoned because it was too expensive to run. Things went well at first, we saved a lot of time and used that as any good engineer would to optimize our designs using different parameters to reduce cost and improve efficiency, both very important to my company and its customers. Then one day we got a limiting case which we didn't recognize at the time. As usual, one of our engineering assistants used the computer generated design and the old methods to validate the design. The engineer always takes responsibility for the design. After the build, the unit, a 3 phase unit that had 76,000 volt inputs, was tested in our "hi pot" chamber - a voltage pulse of the rated voltage but with reduced current and only for a short pulse. The center core winding turned into shards of copper spaghetti in the 8 foot tall tank. It cost $25,000 to repair, and delayed delivery for 3 weeks. My heart rate hit about 200 when the engineering manager called me and my supervisor into his office. Then he explained that he had run the calculations also, and discovered that our methods had a flaw in the prediction of the axial forces on the center coil. It was a very subtle mistake, and he said it could have been much worse. We were able to revise the code within a few hours, and that incident led to further improvements in methods and automation. It also taught me my most important lesson about computers - human error is the greatest risk. Real tests of your code sometimes do "blow up".
Every change is not progress, but there is no progress without change.
Comment removed based on user account deletion
Several years ago, i was deploying a new interface monitor to all of eBay's solaris database servers. In the code I did a "netstat -i" in order to enumerate the interfaces available to collect statistics from.
Turns out, "netstat -i" reads the entire netstat table AND tries to do reverse lookups on all the IP addresses, and then just spits out the interface names. Oops.
Withing 5 minutes of pushing out the new monitor, DNS for the entire company had rolled over, and all the application servers were no longer able to connect back to the DB's. It took a good 15 minutes to figure out to change the command to "netstat -in", roll up the change and push it out. Took another 10-15 mins for things to clear up.
Total cost of the outage: Approximately $800k in lost revenue.
The moral of the story: Always leave a note.
I managed to flood it with enough data that it locked up, and required a manual reset. The second and third time that I did it, the network admins were getting much faster about fixing it, but my boss told me to stop doing it.
I have no idea how much it cost ... but it was the router that fed NASA Goddard's active missions, and I was told that the Hubble folks were getting upset when it kept happening.
I didn't get fired, as I was testing to ensure that we had sufficient bandwidth for SDO data transfers. (we didn't ... and I probably didn't need to run the additional tests to prove it). It did convince them to move us over to an isolated network when we moved offices, though.
Build it, and they will come^Hplain.
I dropped a 50k sensor on the ground but it tested out fine afterward. It was used for development so if there was hidden damage it didn't really matter.
In 1978, I made a programming error on a server for a bank's teller network. The day the problem was discovered the banks internal cash control accounts with amounts larger than about 2 billion dollars suddenly started displaying apparently random negative balances. I was late to work that morning which caused my bosses to suspect that I'd somehow stolen about 20 billion dollars (that was back when a billion dollars was real money). When I finally showed up around 10AM, my coworkers were trying to figure out how I'd done it and whether I had flown to Costa Rica like Robert Vesco.
On of my job descriptions besides software development for financial systems is the design of the hardware infrastructure on which the system are going to be deployed. I end up doing also the initial bill of materials in which I have to give a first estimate of the cost of the final bill. Mind that this are always >$500K bill of materials and most of the time >$1mil. A lot of the projects happen in developing countries where the technical expertise is almost zero, where there are a lot of changes requested by the client or by the party who finances the project (USAID, World Bank, BERD etc.). We have to integrate the new hardware in the existing environment of the client so there are a lot of restrictions. There is always an army of "consultants" that always fucks up things with last minutes changes or badly made initial description of the existing environment. It happened several times to forget to mention some hardware in the initial bill of material or add some incompatible hardware that has to be replaced latter because some "consultant" made some copy paste document from another project listing non-existing equipment at the client site. You cannot imagine the sheer number of money that gets wasted this way in this big infrastructure projects. My record of "additional" costs that ended up being supported on a project by my company is around $100K. In total I think I exceeded some amount more than my last 10 years salary.
..for two days. The irate engineer on the phone told me it cost a million pounds a day. And all for the lack of a version check in the online maintenance manuals that we were delivering.
AC, I think.
I used to work for a small development company that did not have a proper backup policy. One day, the main hard drive of the development server started making clicking sounds. We took a backup onto another server in the network as soon as we heard the noise for the first time. The replacement disk arrived two days later. I took a fresh backup onto an empty drive that I had connected to the machine, removed the failing disk, connected the replacement disk and got ready to start transferring the contents of the backup. The supplier of the replacement disk took the clicking hard drive with them. Then it happened: instead of formatting the new hard drive, I formatted the drive that contained the fresh backup... I was facing the prospect of loosing 1.5 days of the development team's work. It was pretty scary. :(
I quickly rang the guy who had taken the old hard drive away and asked him if he still had it. Fortunately, he had not done anything with it yet. He brought back the disk like right away. I transferred the data once more and we were back on track one hour later. This all happened around lunch time, so none of my bosses were there to stress me out while the drama unfolded. By the time they came back in the office everything had been sorted. I did tell them anyway. They were a little shocked but I wasn't told off or anything. The backup policy remained unchanged.
It's all the hardware designers' fault. If your hardware doesn't include perfectly labeled crocodile clip sized pins and above everything else places direct-to-CPU pins next to the power supply, it's good as fried.
I appreciate that the final hardware has to be tiny but don't expect the software guy working on the prototype boards to be some sort of manual ability prodigy.
If he were, he'd be building premium automatic clocks instead of software to work around your cheap components.
So, as someone tells you the software dude has driven 12V into the UART pin, *again*, be sure to remember it's all your damn fault.
I was tasked with a fiber cabling project for a new upstream connection at a small ISP. I documented the requirements, placed a purchase order, interviewed contractors, recommended one of them and went ahead with the project. My boss was downsized during this process, and when I informed my new boss that the cabling was completed and that his signature was required in some document in order for the contractor to be paid, he said something along the lines of "did nobody tell you that the upstream connection will not use that kind of fiber?" I wanted to die at that moment, but the fact was that it wasn't my fault - it was a consequence of the massive layoffs, the resulting chaos, and the deficient flow of information.
Valar Idiotis.
...by writing a simple page and putting it under load on a Sun E4500... which was the front end of our dot-com's website. We were only invisible to the rest of the world for a few minutes, thankfully...
Village idiot in some extremely smart villages.
/PURGE on your diskcopy is a bad choice when you've just sat the disk on top of an unstable rack shelf.... because *I* would be the only one in there and *I* would know that its there and *I* won't forget about it or *I* wont trip over the cable.. ummm... I was wrong... about a whole 1Tb of wrong.. :( data irrecoverable even by professional recovery crew.. not sacked just self-shamed. (did get budget to fix the problem - the usb was a "patch" because "there was no money".. hummm)
Around 1997 I was working for a subsidiary of British airways at Gatwick airport, UK. The lab they had their was a token ring network and the topology was completely flat - nothing was segmented.
So it was a Friday afternoon, and we had received these printer adapters that you plug j to the network that turns a normal parallel printer into a network printer. We started plugging them in and went to configure them - but I and behold they had already picked up an IP address (no bootp or dhcp at work though). So we we were all mystified as to how they got configured - but didn't think too much of it and all went home for the weekend.
Come Monday.......
I rock up to work and there's this big hoo haa going on - apparently our little adapter things had managed to pick the IP addresses of the baggage handling systems at Gatwick. This caused delays of a few hours to all the flights - think this cost them in the region of several million pounds.
Query Analyzer open all set to run a "DELETE FROM TABLE WHERE CONDITION", but didn't realize that the "DELETE FROM TABLE" was highlighted. Turns out SQL Server runs the highlighted part only. Nuked the sales table. Oops. Boss was good about it. He took half the blame for not having a decent backup routine. Setup a heluva backup after that.
Cost $120,000 (AUD but it was about at USD parity at that time). All in just a few minutes. It did influence me directly in that my bonus was smaller, but markets were good and I made most of that back over a month, so didn't hurt too bad.
I learned my lesson that thorough testing applies in every circumstance, even if the change was small.
I once blanked out one of the big boards at a major US exchange in the middle of the trading day. I had no idea until I started getting angry phone calls from floor traders. Not sure the monetary loss, but I don't think it was too bad - I didn't get fired at least.
I read an article years ago about a guy who developed the software that made transacting CDOs (Collateralized debt obligations) much easier. Basically that lead to the entire sub-prime mortgage industry which lead to 2008. So I think that he wins this whole discussion.
Once upon a time, I worked for a defense contractor, which did some work for the NSA.
Obviously it wasn't just me It wasn't just me working there, and we weren't the only ones working on this project and related projects. Still, I'm pretty sure that in my small way, I cost the civilized world a lot.
I was tasked with updating rsyslog to rsyslog5 on a whole environment of RHEL based systems. Procedurally, I was not allowed auditing access to most of the machines targeted for the upgrade, and was assured that the test environment had the same basic deployment as the working servers.
*Hah*. To update rsyslog to rsyslog5, you need to delete the old "rsyslog" package. On RHEL 5, if you don't happen to have 'sysklogd' installed , yum removes *every single package* dependent on on the "syslog" metadependency, so it takes out "yum" itself. Hilarity ensues, because it also takes out the daemons which have 'syslog" dependencies. It never showed up in testing because most of the servers had been hand installed or imaged by the "architect" who absolutely refused to document *any* of his procedures, because "last time he bothered, the Wiki blew up"
It took down over a hundred servers and led to a lot of panicked restoration work.
I ran into the "Perforce symlink" error. A company, who shall remain named, had a very large network. The clown writing the DNS for all of this insisted on using a single large text file, whitespace instead of tab separated fields, with no verification step, When working with this file, I used it in a build system with symlinks to development or live code, as needed, so I could test it with other components.
All well and good, but Perforce lied about symlink changes. I'm not sure if it still lies: when you *changed* a symlink in Perforce, the local copy would be changed, but it wouldn't get altered in the actual upstream source control. So if you checked out that workspace again, well, you had the original symlink. The only way to reliably change the symlink was to delete the link, commit that, then make a new link and commit the new one. The result was that I made a dev workspace, checked out a clean copy and edited what I thought were files in dev, but wound up editing files in production. And with absolutely no verification tools available for production, I made a mistake, and it got pushed to prod, and I got screamed at for touching the production code.
Forgive me, father, for I have sinned.
I was a small child in a local university's SUMMER FUN COMPUTER CAMP, eighteen or nineteen years ago. Most of the classes were taught by undergrads feigning enthusiasm over HTML and embedded Java applets, but I noticed that the computers (400 Mhz, if I remember correctly! State of the art!) had weird breadboards hooked up to a COM port. After the supervisors had all quit the lab for the day, I decided I was going to figure out how those breadboards actually worked, so I booted into DOS. There was no assembler anywhere to be found on the computer, so I started messing around with QBASIC's IN and OUT commands. After half an hour of effort, I managed to get some LED's on the breadboard to flash. Success!
Then the computer shut down. Disappointed but not surprised, I tried starting it back up. Didn't work. I unplugged the computer and plugged it back in. Didn't work. Then I noticed a smell that I would, a few years later, learn to be the scent of magic smoke, so I very calmly stood up, pushed my chair in, and walked out of the lab so I could find a payphone and call my mom for a ride home. I have never told anyone this story before.
We launched a new version of a popular site at 4am and everything was working fine. It seemed a tiny bit slower than in testing, but nothing worth worrying about. However, as users woke up and started using the site, it began running slower and slower. Once things were in full swing around 9am, it was taking over 10 seconds for each page load. Our first thought was that the server couldn't hold up under the load, but it turned out it wasn't really stressed in the least.
I can't actually remember how we discovered the problem, but it turned out I had left the wrong database connection string. It was pointing to the development database, which was hosted in a different company's data center than the live site, on the other side of the country, but fortunately had the exact same data in it that the live server had. This wouldn't have been a major problem in and of itself, but someone else on the project had left about a dozen queries on each page without WHERE clauses.
So, every single page load was sending about 20mb of data between data centers, combined with about 50,000 users hitting refresh repeatedly because pages weren't loading. Transfer overages would have been something like $5000, and we had to rush and fix dozens of queries on a live site, and then figure out how to solve the actual problem of cloning the dev database over the live one (now that users had started filling it with real data) and start using the correct server. At that point I had already been awake for 24 hours.
Fortunately, there were actually provisions in the hosting provider that allowed for misconfigurations in the first few days of service, so we didn't end up paying anything. The client was only a little bit upset that the first morning of the launch went poorly, but nothing major.
Beyond that, any major errors have only really cost me my own time (fairly frequently). I haven't been involved in anything that actually caused financial harm to anyone except myself, so I count myself quite lucky, despite having to pull a few overnighters at my own expense.
I worked for a 3PL (third party logistics) company. Years ago, they'd decided they were going to make $$$ with SaaS, basically selling our services to others. A huge undertaking had been embarked upon to make our system usable for other companies. They got a grand total of one client.
A few years later I was working there, and we got a second client! Bad news was, literally no one was still working there that had been when the first SaaS client had been set up. So there was a lot of guesswork trying to recreate it. I was a Junior Developer at the time, and was tracking down why some data loading wasn't working right. I knew the issue was almost definitely a trigger in the database, so that day I made some changes, loaded the days's data import into the Test DB, and checked if my fixed worked. It didn't, so I cleared out the load, made another change, and did it again. OK, now it was kind of fixed, but there was a problem somewhere else. Wash, rinse repeat.
I'm sure you see where things went wrong.
About the sixth or seventh time I did this, I accidentally ran it against production. I distinctly remember the panic that gripped me the moment I hit the F5 key to execute that SQL statement - I realized what I'd done immediately. The drivers (this was a logistics company, remember?) had been out on the road for about two hours at this point, and all the sudden all their handheld devices just stopped working. Where's the next stop? As far as their handheld was concerned they didn't even have a route, much less anything on the truck. This happened for all of the Office Depot drivers in Florida. And we couldn't just reload the day either. After the initial import happened at around 1:00 am a lot of virtual paperwork was done by humans to optimize routes and such, work that couldn't be easily duplicated.
I spun around in my cubicle and told him what I'd done immediately (I was told later I looked white as a sheet) and he assured me it'd be OK. An hourly snapshot was taken by the database. We'd lose a bit of data, but it wasn't the end of the world. He went to talk to the DB Admin.
Those snapshots? It turned out six weeks ago they'd just stopped running. Why? I don't think we ever figured out for sure, but either way they weren't there. Now everyone was panicking a bit. This was a new client we'd just picked up and we didn't want to screw the pooch. In the end, they ended up doing an emergency purchase of some software that allowed them to roll the database back using the transaction logs. Fun times.
I was trying to fix a broken backup process on an AIX box, and found that there were a ton of stuck Legato processes on the system. Rather than kill each one individually, entered the killall command to get the correct syntax to kill all of the processes with legato in the name.
In Linux, entering killall gives you the syntax on how the killall command works. In the old version of AIX this system was using, it killed EVERYTHING with no warning and basically rebooted the box. That's not usually not a big deal, except that this was the primary SAP database server for a Fortune 500 company. It took the DBA's about a day to clean up the mess.
The system was clustered, thankfully, but it probably cost about 10K in labor to clean up the mess.
I once built a Windows NT 4 system image that used an older version of a Novell Netware driver that was incompatible with the newer version of Netware that the file servers were using.
It seemed to work fine on the master system that I built, but after that image got deployed to 50 classroom computers it flooded the network with garbage traffic and caused the entire University network (about 500 computers at the time) to crash. It took the network team about two days to figure out what the problem was.
When I was 12 I put the BIOS chip from one motherboard (it was still the kind of EEPROM with pins) into another in an experiment.
Sadly I didn't know what the orientation of the pins was or what the little dot meant (pin 1) so I must have reversed them.
Put the BIOS chips back but I had fried both boards.
Working on Cisco command line, I was in the habit of typing "no " and doing a double-click-middle-click on the line I wanted to delete. Worked very well except for
(IIRC)
redistribute bgp 100 metric 100 metric-type 1 subnets route-map BGP2OSPF
In this specific copying the entire line after "no " does not remove the line, it just removes the route-map limitation, and hey presto I was redistributing our full BGP into OSP. Clincher was that it took some 20 minutes for the network to actually stop working, so bu that time I had totally forgotten about it. It took an hour to find out what the problem was and to correct it, during which my ISP was basically of the network.
*I had an off-by-one error in a TopCoder problem (I used > instead of >= in a loop) that I didn't catch that cost me $3000 in prize money and a trip to the finals.
*I was working at an observatory on campus and left the huge, Peltier-cooled CCD for the telescope on a table but still plugged into a computer and left for the day. When I came back, I found that someone had tripped over the cable, smashing the CCD on the floor. They then sat the broken CCD next to the computer without a note or anything. $7000 CCD destroyed.
*Another time I was working with an AFM in a basement of the university, and left for the day. It stormed really hard that night, and when I came back the next day the basement had 6 inches of water in it. It turns out that the water had come from a leak directly above the AFM. I guess the AFM didn't like getting a shower in filthy storm water and it cost $20-$30K to replace.
*However, my biggest save was probably more important than all of that combined. Without divulging too many details, I was writing some tests and caught a serious data-loss bug in production before any customers were affected by it. The bug actually made the news: http://www.theregister.co.uk/2...
I'm the amateur programmer who first programmed the code for Lawrence Lessig's Mayday PAC. I don't know if you remember this, but the site went down on May 2, for about 8 hours, when we were raising roughly $10,000/hr. I had built everything on a LAMP stack and sent everything through a single MySQL database, which just didn't scale. (I was - and still am - an amateur). Luckily, pro developers stepped up and staunched the bleeding, and eventually we moved onto a Ruby-on-Rails system for the front-end and a NodeJS/Google App Engine solution for the backend.
Back when Linux was much more primitive I had to set the video monitor parameters by hand coding configuration files. And, by accidentally over-specifying the maximum sync rates, I "smoked" the flyback (horizonal output) transformer in a new 21" Sun monitor in short order. I typed in one wrong number and $$$.
Somebody unhooked the cable from inside a cabinet to a spectrum analyzer I was trying to use to monitor a signal I was setting up to a satellite. I thought something was broken and was messing around with the controls to see if anything happened. I finally found the cable wasn't connected about the same time the satellite controller came across screaming that I was about to burn out the satellite. I didn't, but it was a very close almost. When I plugged in that cable there was a huge spike on the screen.
Three people, working independently, made errors in programming and website updates which nearly bankrupted United Airlines when the errors came together on September 8, 2008. "Shares fell to about $3 from more than $12 in less than an hour, wiping more than $1 billion in value before trading was halted.".
When the market first opened that Monday, United Airlines was trading at over $12 a share. The public summary of the events state that Chicago Tribune re-indexed their archives, resulting in a six-year-old story about United Airlines bankruptcy to be re-posted on the Web site of The South Florida Sun-Sentinel without a date. Google picked up the "new" article, saw the missing date, and inserted the current date of 9/8/2008. That article was picked up by a research firm, Income Securities Advisers, which then posted a link to it on a page on Bloomberg News, which sent a news alert based on the old article. The news alert triggered automated trading systems to issue sell orders. Nasdaq finally ordered a halt in trading the stock at 11:08 a.m, but the damage had been done, United Airlines Stock had lost 75% of it's value.
I do not deploy Linux. Ever.
Underestimating time needed happens all the time in the software industry. It probably is worse in the gaming industry where publishing deadlines often get set 6 months or more in advance, but I still get hit with guaranteed release dates for customer commitments at my job now where I've put in ~100 hour weeks to fulfill (telecommuting many of these probably saved my marriage, as I would work 4 hours after my wife went to bed). Still, it is nothing like the 160 hour weeks in the office for a game release crunch (and no, that isn't all work - I slept on beanbag chairs in the testing room and they catered in meals, but at some point you're just so burned out and stinking of feet that you need a night sleeping at home and a long shower).
I can't think of any instance where I've cost a project, but I'm sure they exist. OTOH, I did have a workaround for a $5 million dollar contract where the customer was going to reject our Linux port due to a bug I found and reported. The developer and pubs person assigned the defect were laid off after 9/11 so the defect slipped through to the customer. Fortunately, I overheard a sales person talking about it and supplied the workaround, saving the contract.
A number of years ago someone else used my PC and opened a window to the production server using my test server background colours.
I dropped an index on the production passport server by mistake.
I paniced, and ran the re-creation script.
That locked the table.
For 30 minutes they closed all the customs and security stations in all the international airports in my country.
I didn't get fired, but I did learn to lock my PC everytime I stood up instead of waiting for the screensaver to do it for me.
I work for one of those well known companies with many, many computers. I managed to turn about 130,000 servers into glorified space heaters for about a day.
We fixed the problem, identified the real system issue that allowed a simple mistake to have such catastrophic effects, and wrote code to prevent it from happening again. Because that's what mature companies and teams do instead of firing the one who just happened to be the one pushing buttons that day.
My second big one at a different shop was losing the private key for our internal CA for our financial software package. I had emphasized the consequences of losing the key, but failed to follow up with making sure my team took appropriate steps to protect it, including putting backup copies in appropriate places. Oops.
About 15 minute outage at roughly $75,000 revenue/minute. So about a $11.2 million dollar typo. Result - still working there and several promotions.
When forced to pull new features on an all-nighter for the demo taking place the next morning, I accidentally committed an empty file to version control around 3am or so, which broke half of the application in front of the potential customer later that day.
The potential customer didn't sign the 30k contract.
I wasn't fired but left some time after that for a job with less all-nighters.
So somebody forgot a hyphen in the "computer code instructions" back in 1962 and it cost NASA $80 mill back then, equal to $630 mill or so today. According to this site:
http://priceonomics.com/the-ty...
I applied for and was accepted to my dream job. I had wanted to work there since as early as I can remember. I was told to follow the orders of the higher ups and to never rock the boat. My first team assignment was to help find a piece of missing equipment. We split up into groups and searched around asking people if they had seen it. We saw two guys with machines matching the descriptions and stopped to ask them. I was convinced they were the correct machines but my more senior co-worker talked with the older man and assured me they weren't. I thought he was wrong, but it was my first day so what did I know? In my defense, the protective gear we had to wear was very constrictive. I couldn't hear their conversation, but I could see the documentary crew across the street and didn't want to look foolish.
To make a long story short, I eventually learned those were the droids we were looking for. I don't know the cost, but my mistake lead to the entire collapse of the Empire. All I can say now is that I'm really glad I was wearing the protective gear. The video was released but no one could identify me. The work assignments were later destroyed when the server overheated. Some idiot forgot to put a grill in front of the exhaust pipe to prevent any back-flow. I don't think anyone can identify me, but I've tried to stay out of sight anyway, that's why I'm posting AC.
When I was a junior programmer working on a mainframe, I was given a problem ticket for an intermittent issue. I stuck diagnostics into the code, but because my disk quota was far to small, I sent the output to a virtual printer that I looped back to my account. Unfortunately, after I got the whole testcase set up (couple hours) the mainframe crashed and I went for coffee along with the rest of the 300 users on the system, for the 10 mins it took to restart. After several days where I hadn't been able to make progress because of the suddenly frequent mainframe crashes, I got a message from the operator asking me to delete my large spool files, since the mainframe was crashing due to a lack of spool space. That's when the penny dropped that my testcase had been exhausting the system spool space, crashing the mainframe about 8 times. Probably $100,000 in lost labour.
Years later, working on extending some high reliability software, I found some bugs in pre-existing code. The system had some internal checks and watchdog timers that would force a restart if it thought some code was taking too long. Both bugs would trigger the restart system by making something take too long and triggering the watchdog timer. One was in very complicated code, but explained some intermittent issues we'd seen over the years. The other was in a newly released, still unused utility, that didn't work properly on old HW, but would need to be re-written to fix. I only had time to fix and test one bug before going on a month long vacation, so I fixed the complicated one. While I was on vacation, an alpha release of the product went out, and promptly started crashing intermittently with stack corruption issues. I got back, to find six such tickets on my desk. In the meantime, the broken utility had acquired some users, so I decided to spend a couple of days fixing the utility.
It turned out that the stack corruption issue was holding up the production release, worth many millions of dollars.
Of course, I wasn't able to reproduce the intermittent stack corruption.
I spent 3 weeks looking everywhere, trying anything to reproduce it, resorting to rebuilding the alpha load where I could sometimes reproduce it, but not if I loaded my diagnostics.
Meanwhile, management was getting very antsy about the revenue implications.
My boss was very good, and sheilded me from the flames, but I didn't like seeing him getting fried, as the release date kept getting pushed.
I tried hunting around to see if anyone had been changing code in that area of the system, but of course, there were only my updates. I asked anyone I could find for suggestions, and nobody had any ideas until one person said it reminded them of one very old issue they'd worked on, and described the problem they'd had.
I went back and checked my archived output. Sure enough, I'd been a bit careless testing the broken utility before fixing it. I only checked that my testcase triggered a restart, not why. It turned out that long before it could trigger the watchdog timer, the utility corrupted the stacks of other processes.
I'd just spent 3 weeks holding up an important release, because I didn't realize I'd already fixed the bug.
Most people don't realize that 100/full on one side and Auto on the other should properly negotiate to 100/full and 100/half in a duplex mismatch. I've seen that problem many times.
Learn to love Alaska
Why was there no ILO/BMC/etc? Easy to fix remotely.
...a second too early. Worked as broadcasting engineer and cut short a commercial by one second. Lucky me, that was in the middle of the night, so the damage was not that bad. As you may have guessed, that was in Germany and quite a while ago. When I watch commercials on US TV they get cut off constantly, seems as if the ad customers are more forgiving here. Working as broadcasting engineer was awesome except for the craptastic hours and the constant stress of not being allowed to make even a tiny mistake.
The implications of deciding one way not the other were a million dollars worth of ironmongery (9.925in OD liner pipe) being run and cemented into the hole. That operation occupied a rig crew of 90-odd people for 8 days while I was on leave. When we drilled ahead, it became clear that I had been wrong. Total unnecessary cost was about 2 million dollars.
These days, I don't lose sleep for less than ten million. The fact that I still do work for the client suggests that they figure it's better to have me around than not.
A couple of years ago I got some grief for pointing out a problem on day 10 of a job, which people upstairs from me decided wasn't likely to be a problem. So they shelved the problem, told me in writing to shut up, and continued with the well. 3 months of work later, we'd made a beautifully-tuned geo-steered well ... and had to wait on weather for a major storm. And when we came back on location, the problem I'd been making a fuss about had come back to haunt us and forty million dollars worth of ironmongery and effort was junk. Several embarrassed faces upstairs, but all my fellow contractors knew who had said "We need to deal with this problem, now." when we were five million into the project. Who needs advertising?
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Found a gaping goatse-sized security vulnerability in a package that had been outsourced and the original contractors long since gone.
It was less expensive to just kill the product which we had been selling for about 3 years than to re-engineer the thing from the ground up with new staff.
two stories actually:
1. Half a class C was supposed to be scanned, a whole class C was scanned - knocked down ALL the printers in ___ National Railways. I took the phone call, she was hot, demanded her money back, they didn't get it. Needless to say, they did not renew.
2. Someone downloaded porn using a customers network. I took the call. He was fired before he got back.
I've never heard of somebody *heating* a drive to recover a stuck head, but I've done the opposite.
Many a drive has been recovered by a day or two's stint in the freezer in deflated ziplock bag. I'd imagine the principle is the same.
With cooling, you do have to watch out for condensation build-up as the drive defrosts. With the heating I'd worry about damaging the data on the disk (magnets in general do not like heat, so I'd imagine magnetic storage would similarly be a gamble).
Assumed RAID 5 meant backups were unnecessary.
I was wrong... ate humble pie and company paid around a grand for data recovery.
Live and learn....
Back in the very early 80's I was the technical of my family, and my father bought us a very early edition APCO pc (Apple Compatible IIe) which had 64k of memory. We even upgraded it with more memory (which came in a plastic tube full of 16 pin chips),
I mistakenly decided one day to open it up and remove one of the expansion cards without powering it down first. Shorted out the entire motherboard in the process, and lost all the BASIC and INTBASIC coding I had done for the past year (and with no one else owning anything compatible near me, this work was lost forever). Probably lost the family a good $2500 (not sure what the list price was back then).
Needless to say, I was very mad (and as a 9 year old, this led to some very sad weeks) but my father came through and next month, we were upgraded to an IBM PC, and that was the last Apple product I owned until 2009.
I am wondering why didn't anyone post about the Therac X25 radiotherapy machine. Is software calculated wrongly the radiation doses. Result: six people were killed as a consequence.
I once worked a co-op job at a company that had a mammoth database system that literally drove their entire business. Literally every line of business, from HR through to procurement, was custom built into a single mammoth database. And this was no small business, we're talking thousands of employees across 20-30 locations.
Anyways, somehow, me, the lowly co-op student, managed to accidently log into the production database, and DROP a table.
Needless to say, this threw me into a panic once I saw what I had done, and I immediately owned up to the error and ran to the DBA team to tell them what I had done. Thankfully, the company had a good DBA team, and they were able to perform some wizardry to undo the damage.
I wasn't fired, and they actually praised the DBA team for being able to recover the issue so quickly. Needless to say, security was tightened up, and I was never allowed near a production asset again lol
I'm writing firmware today that stores the date as a 16 bit unsigned integer giving the number of days since 1/1/2000. When printed it is converted to an 8 bit unsigned year and formatted with %02u (2 digits). I'm well aware that this will fail on 1/1/2100, but... I'll almost certainly be dead and no-one will be running this code in 85 years time, surely...
I'm starting to feel bad about it now.
I would not be so sure about the code not still running in 2100. Way back in the 70's I was updating existing code and realized that the coding for the date would not work in year 2000, so I modified the code to work in 2000, even though I thought it would be long retired by 2000, which seemed forever away then. Well the code was still running in Y2K and after. Also I happened to have the job of monitoring that code as well as many other programs on Y2K up all night, and it worked. I was glad I had taken the time to fix it.
I am retired now though.
Since I write software that writes software for machine tools, I have extra opportunities to break things.
There's a technology called Electrical discharge machining, which means putting stuff close together in a fluid, running current through them, and having sparks burn off little pieces of material until you've got what you want. One manufacturer makes machines that have sophisticated programming, but it's not at all safe. Once, with the support guy from the company we got these from looking over my shoulder, I made a slight mistake that caused the arm of the EDM machine to slam against the metal we were machining, for a $16K repair.
Another time, a variable contained a Z level (height) that was used for two different things, but for everything we'd done up to then the two different things shared the same value. I was the guy who made the change that made the difference significant, and so some of our CNC mills thought the metal being machined was significantly lower than it was, so the setup moves for the machining that assumed the endmill was moving through air tried slamming through the metal. Some of the results were spectacular, although I never did find the cost.
Fortunately, at least for my self-esteem, people more experienced than me were supervising each of these mistakes, so I didn't feel too stupid, and my colleagues were very understanding.
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
A co-worker of mine had just finished implementing a new caching system for a legacy app that interfaced between multiple systems and the mainframe to track progress and shipping of pilot production runs. Due to a bug in his code, in a very specific use case, one of the cached systems would not get flushed. This was identified a few days after the production release when the company (a multi-billion dollar food sciences multi-national corporation) received a phone call from a Pastor in BFE, Minnesota asking why we had sent him almost 500 gallons of ice cream. Apparently, his church's address was in the system from some charity event we had sponsored, since the ID and business type didn't flush from the previous transaction, when the pilot plant told the software to print labels for the next order, it pulled the shipping address from the wrong database and the ID just happened to collide.
The cost of shipping the ice cream back for disposal was ridiculous. So the company told the Pastor to have a huge ice cream social.
The responsible developer was not fired, but there were running gags about him being the Ice Cream Man for the next year.
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
While examining a new client's server and checkingn the network my hand slipped at just the wrong moment and I disabled the LAN on the server.
Lost about an hour's work times 3 or 4 web-designers.
Dad never made good computer purchases:
1. Bought a used TRS-80 with dual floppy drives with a lot of "software"- drives were NOT TANDY drive and eventually failed and all that software was labeled blank disks-the seller kept all the original disks
2. His 50,000 sq foot retail store he decided to buy with advice from his major vendor of products which he ran the business under their trade name -a new computer system so he can do just in time inventory management back when that was all the rage in retail. He ends up with the register systems with laser scanner wands like JC Penney was using at the time. The computer was a WANG minicomputer of some kind and he contracted and paid $10 grand to a local programmer to code the software to make it all work together including the interfacing to his vendor's ordering system. You guessed it coder never got the software finished and in abut 4-5 years a higher end PC could do the work of the minicomputer for a few grand off the shelf.
3. After retirement I gifted the parents a PC I built from castoff parts from my own gaming pc which dad promptly re-gifts to a local charity then proceeds to buy a new pc with Win8 on it and the tile screen acts like kryptonite on him and he cant get past the tile screen
Instead of deleting an old snapshot of a virtual machine I deleted the actual virtual machine datastore (they were named the same) - lost a day's worth of accounting data. The chief accountant wanted to kill me but I survived.