Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost?
NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!
But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!
I was in charge of ordering a leak correlation system for a water utility that I work for. The system I choose was not quite what we needed, but worked. One week after the warranty expired, I dropped the correction unit and it has never worked since. I found out the correlator wad unrepairable and we had to order a whole new system.
I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..
They didn't have any cable management and only one border router..
Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.
I cost our Asian office a day's work after I failed to verify that a deployment completed successfully.
The deployment was done on Friday evening US time, which would have been around 1 or 2am UK time. I couldn't be bothered to stay up for that so figured that I'd check in the morning.
Naturally I forgot to do that.
Throughout the weekend whenever I was out, I'd suddenly remember and think "I'd better check that when I get back in."
Naturally, I forgot to do that.
On Monday morning, I received a lot of phone calls and emails asking where I was and to get into the office ASAP. When I got in, I found out that the deployment had failed and the rollback scripts that I'd asked the team to run had not been run.
After a lot of frantic phone calls, we found a DBA in the Asia office who still had database access to the Production servers and he rolled the changes back.
By then however, Asia had lost a whole day of work and I was given a written warning by my manager.
It's still a running joke amongst my friends that I "took out all of Asia for a day". And if I ever interview and I can see it's going badly, I tell this story in response to the "What's your weakest asset" question, just to see the look on their faces.
Heh - would have to total all that up... sigh... but it still works!
Mark
I used a system improperly over the course of a month. It connected to some services that ran up a $50k bill. I was mortified when my boss told me, thought for sure I'd be canned on the spot. I was only 22 and it was my first job out of college, so the amount was nearly double what I was being paid. The boss basically took the heat for not having explained it to me better, and I was not reprimanded in any way.
I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.
Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.
I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.
So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.
Next day everyone is working happily, everything is working smoothly, no worries.
Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.
So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.
Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).
In the land of the blind, the one-eyed man is kinky.
In 1993, I failed to file the US Patent on "A means of accessing a relational database via the Internet." If we'd known we could do it, CompuServe might still be around.
About $2 Trillion.
I worked for the Florida Electoral Commission back around 2000.
I have been part of of a large mistake costing hundreds of thousands of dollars.
However most mistakes are part of a chain of events of little mistakes, where they all combine to a big mistake. For example, if someone happen to trip over a plug that unplugged a production server. Then questions on why was the cable was out where it can be tripped, who decided that it wasn't worth the money to put time, to get a better system of cable management...
Normally a person will get fired for a mistake if it was due to intentional misconduct or it happens to get political and needs someone to blame, however if it happens you need to be sure that you put the blame back on the system (not an individual), then you will need to follow up to fix the system so it doesn't happen again.
Most of the most expensive mistakes, are often due to a huge chain of events. A good system should be in place to stop a simple mistake from escalate into big ones.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
When I was 12 years old and hanging out on BBSs in 1989, I didn't realize dialing Gilroy from San Jose was long distance (Both were 408 area code). My parents were not pleased at the nearly $500 phone bill.
As High Proctor of Fahz, I once led my whole species into unrelenting suicidal despair when during the Chinz-Rahl celebration I passed our Ultron onto Chief Groo, who was not prepared to hold such a heavy object and dropped it.
My Mask of Ultimate Embarrassment and Shame is not enough to express the deep chasm of depression into which I sink.
I maneuvered downward the left button of the mouse attached to the computer I was working on which pointer was right on a small gif saying "Send" that technically sent a message I should never have sent. Cost me a lot.
Slashdot, fix the reply notifications... You won't get away with it...
Not me, but a friend. In high school the best computer in the school was a 386SX. They decided to upgrade it to a DX by adding a maths co-processor to the main board. So the ordered one, and when it arrived, they gave it to my friend to install for some reason. Now, the chip had one corner cut, which you are supposed to line up with the cut corner on the socket, so you know it's seated the right way. Of course, my friend put it in completely backwards (because it fit an any direction.) So he tries to boot up the computer and nothing happens. So he looks at it again, and realizes the chip is in backwards. So he turns the box off, pulls out the co-processor, rotates it 180 degrees and puts it back in the socket. Unfortunately, misfiring it in the wrong direction had toasted the chip completely, and when he put it into the socket in the correct orientation, the socket locked itself shut, as it's supposed to do. But, since the chip was fried, this effectively locked the motherboard in an unbootable configuration with a dead shop. Sigh.
- In Soviet Korea, only old people loose all their bases to Natalie Portman's petrified hot grits overlords.
Dropped and broke a $40k USD Symantec Gateway Security Appliance
I made a calculation error that cost $10k per day. Took 9 months to straighten things out.
I later won an award for outstanding work.
Some bugs I've been responsible for, although it's hard to tell exactly what they did cost:
- rounding error when programming a timer in an embedded system, resulting in a baud rate to be 10% off, causing problems with several units shipped to customers
- overflow of an 8-bit counter, resulting in a serial protocol failing
Plus tons of other errors I forgot or haven't been aware of. Total damage for sure thousands of Euros. However, that's probably little for a 25+ years career mostly in software development.
Lost a slide for 3rd party client that was to be featured in a skateboarding magazine.
I think one of the coworkers stole it as I did not get along with them.
Insurance claims for that kind of thing can involve the cost of setting up the shoot again, whatever that entails.
Was fired not long after.
Our group at FedEx released code that I wrote on a Saturday night. This was two days before the Apple iPhone 4 shipped. The code worked perfectly, however, despite our repeated warnings about nearly doubling downstream traffic, the downstream systems (like billing and tracking) weren't ready for it.
So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).
Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.
Life, the Universe, and Everything... in my image.
First one, I was lucky... there wasn't a switchover to a new database yet, and I made sure to schedule a large downtime window, because I try to do like Scotty... take the time I think will fix something at the worst, then double it. If the PHB gripes, start into detail. A side effect is that users tend to be happy when stuff is back up earlier than planned.
Well, this was a two node HA cluster back in the day where a certain vendor had a passive node and an active node configuration selling for an insane amount. They were connected via serial connections for heartbeats.
Well, it was time to do a simple update of the machines. I staked out 24 hours, just because I wanted to do backups first.
Well, I did the sysbacks, so I had two tapes of the entire boxes.
Ran one set of updates on both machines, rebooted... all fine. Noticed there was a drive array microcode update... just a 0.0.x update. Well, I tossed that on and rebooted... Well, both boxes blew their kernels. All the data on their drives was gone, because the microcode patch got the array in such a state that one machine started writing garbage to all drives.
At least I was able to restore both machines and build the shared data.from the tapes.
The second one would have been just as bad. I was cleaning out source code tree of .o files and executables... came to found one dev had libraries that were only present in binary only format, and whose only backup was in the tree (where the backup program excluded all binaries for space sake.) Thankfully, the tree was on a NetApp, and a simple copy from a snapshot fixed everything. Were it on another server, I'd have Hell to pay.
digital signal processing chip from TI. The $750 (in 1986 dollars) wasn't the big deal. That the parts had serial numbers hand-lettered on them and I had to go back on the waiting list to get a replacement was.
A long time ago on mainframes. IBM 3083's and VAX's. I was running analysis on some waveform data, took probably about 20 reels of mag tape. Fucking marine seismic data. I sent the big deck of cards down to the floor on a Friday. 1st thing Monday, I had to go the VP's office. He explained that Monday morning, the fucking job was still running. Turns out, instead of sampling the data every 4ms, I accidentally sampled it every 2ms. Back then, you didn't own your mainframes, IBM leased it to you. The VP explained that I cost the company anywhere from $40-60k. Nice guy actually. Texas engineer, cowboy boots and a suit. He politely asked me, "Son, you probably won't be making this mistake again, will you?" I stuck around for another couple of years. Goddamn it took an army to process data back then.
"He's using a quantum encryption scheme! That'll take hours to break!"
Long before Amazon was ever more than a bookseller in the mid 1990s, a friend and I had this idea of a website that would allow for comparison shopping pulling data from other sites allowing folk to buy the cheapest electrical items possible
We never progressed because we couldn't see any way for it to make money. We had no idea that was the absolute last thing we should have cared about.
So now I'm here, an anonymous coward posting about our total lack of foresight and imagination, and not some rich fecker who owns real-estate like /Slashdot
I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.
I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"
Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.
On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.
I do not deploy Linux. Ever.
i used to insert the cartridges too hard and broke it to the point where i had to spend 15 minutes playing with it every time i wanted to play a game
obsolescence, I got the task to shut 'em down. I also forced a worldwide recall of PC card disk drives in the switches that were the backbone of the Internet when we kept the vendor engineering on the phone all day for a failed switch... and read the duty cycle of the drives to them, like 5 minutes a shot, 10 minutes an hour, when they were running read/write continuously.
but I got a haircut indeed when we had to get out stuff out of a colocate that was shutting down. built a mirror data system for that in the new place, had the trunks up, costed over the traffic. then it was time to demanage and power down the old shelf. telcordia assigned a code to the new unit that was one letter different than the old one.
the good news is I got the new one back up in 20 minutes and they didn't stake me out over an anthill.
if this is supposed to be a new economy, how come they still want my old fashioned money?
We were writing a Unix program to parse transactions from some specialized terminals that read customer invoices and the checks that accompanied them, writing the transactions to digital tape to carry over to the mainframe system. During testing our tapes were compared to tapes generated by the legacy IBM system. Our team lead got a call from the customer liaison *early* on morning saying "Do you realize one of your batches was 5 MILLION DOLLARS SHORT - yes, she was shouting. Turns out that the $5 million transaction was the largest we'd ever tested with so far. All others were less than $999,999. It was my bug - I'd put the sign nybl (half a byte) on top of the most-significant digit of the packed-decimal payment-amount field on the test tape, dropping that digit from the field. Trivial fix - I had just been auditing the relevant code the previous day.
How many people will refrain from posting because the statute of limitations hasn't run out yet?
My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.
Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.
Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.
The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.
Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.
Take Nobody's Word For It.
I used to work as a SDH/DWDM admin. In early 2000's, while my colleague screwed up a major firmware update on a STM1/4 ADM and I as senior (haha - I was in my 1st half of 20ies) admin had to drive up to site (since the affected node was unresponsive to management system). After many unsuccessful attempts to recover it, at about 3 am. I decided to hard reboot the node, which caused it to boot up from corrupt firmware bank (it had two of those); which in turn just erased all the configuration, including traffic connections (which is built very robust btw). Since the site was on a (relatively small) island and had only 2 ADM's at the time, I more or less cut off the entire communication with mainland. For morning, I had managed to get my colleagues to ferry me another, fully fitted ADM (our last resort backup scenario was to replace entire node) - but as it turned out, it was in a hurry fitted with cards with different firmware (entire network was in middle of upgrade process) which resulted in same kind of useless "brick" I had already at hand. Although it was very cool to fly ~200km/h to port and back in my sporty car, to pick up the spare (not many police on the island and I had a very good excuse). By the afternoon, my higher-up manager had mobilized a helicopter to personally deliver me fully functional ADM, which we promptly replaced and restored configuration from backup. I still have copy of the local newspapers front page, praising how our company heroically saved the day to restore connection with outer world.
At that time I was already able to make up excuses that would have made BOFH proud, which saved my ass.
I let a vendor sell me a product without really testing it. Turns out it didn't work (at all) and we lost €50k on license fees for a product we could not use.
I was able to lay the blame on an accountant who had locked us into a 5-year contract in exchange for a minor discount. So I didn't get fired.
Some other fool did not install the panel properly, and left one of the three nuts off. Distinctive nuts, used in only one place.
Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.
Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
Training mission aborted. much sheet metal work needed.
Actual repair cost? Unknown, but easily 5 figures if not more.
Working for a desktop publishing house in it. Spent just under $4000 on 36 inch flat panel displays. Accidentally plugged in printer power cable. Immediately fried monitor. My boss was not happy. The internship did not go well the rest of. The summer.
McAfee on a mass spectrometer data acquisition system. System control would be periodically lost. Cost over $12.6K in lost instrument time and labour to determine that McAfee was blocking serial comms to the instrument (but only when it felt like it).
Lesson learned: never run McAfee or Norton on a mission-critical data system.
I let a upgrade bug slip by me during a software upgrade for the accounting software. In retrospect it should have been caught before it got out of hand. It got out of hand in about 3-4 seconds and had a cascading effect bringing down the whole datacenter for the company.
It happened when a "guaranteed" bid was due for a 2 million dollar job. We had nothing. Not so guaranteed...
Fortunately (?) I had a ownership stake in the company; so I also screwed myself too. Figuring ~12% profit on the job was typical and 10% of that was mine ... it cost me personally over $20K on that mistake.
Ooops.
I was working as a Jr. Network admin, helping to install some new cisco PoE switches to facilitate our building's move to VoIP phones. I aligned a brand new 48-port poe switch slightly off when inserting it into the chassis, and bent the insanely-complex connector at the back of the card, rendering it unusable. Fortunately, we had a ridiculous service agreement with cisco, and a new card arrived at our office within 4 hours. I distinctly remember buying burritos and beer for me and the Sr. admin to help make up for the fact that neither of us got to sleep that night.
I was on the NASA Genesis price team. Only a few hundred million lost on that one when it crashed into Earth...
A year spent in artificial intelligence is enough to make one believe in God.
Six or so years ago I was using a (fairly cheap) Virtual Private Server as a dev/testing box for a pet project of mine.
The VPS company was bought by a larger company, and prices were to double on the next billing period. I hastily chose a new provider without doing any research. I paid for 3 months of service in advance, got the container set up the way I like, migrated all of my data over, and was up and running.
2 months in the new provider vanished, along with all of my data. I wasn't very concerned about the months worth of money I had lost by not getting the 3 months I had paid for, I think it was only about $15. "Okay," I thought. I'll just pull my data out of my nightly backups and move on. It turns out I forgot to adjust my local cron script that pulled the data over rsync to the new IP address. My backups had not been pulled in over 2 months.
Luckily it wasn't very important, as it didn't make me any month and was mostly just for fun. I ended up starting over from scratch and ended up with a better system anyway.
I learned my lesson, though.
Not me, but my thesis adviser became the Technical Director for JSIMS, which ran through +/- $1B before the pentagon pulled the plug. He is not shy about mentioning that fact.
http://www.nationaldefensemaga...
Oh yea, the "HME lock speed and duplex to full scripts". New some admins at a financial services company that didn't remember to run that on building the servers. Servers made it through testing, got turned on in production. The next day was ugly until we looked at the change management book (was really a paper book) and saw the new servers. 5 ethernet cable disconnects later we were back up our original capacity until they sorted it out.
The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.
My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.
I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.
Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.
I will say, the Server was the problem.
It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.
It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.
By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.
So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.
The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)
I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.
Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.
I just did what I thought I could to keep the Titanic afloat.
So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.
But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....
Unfortunately - that only works if the damaged disk decides to drop out of the array.
It didn't.
I find th
So there I was, scribbling down some notes off the PC screen by hand, when I reached for the keyboard and Ctrl-S'd.
Not selling the company for $250M because he wanted $300M during the dot-com boom. My boss personally owned about 30% of the company at this point.
The real "Libtards" are the Libertarians!
Two totally incompetent twits from a populous south Asia country. Cost about $32k in salary and 4 month schedule slippage. Another contractor, who is competent, said she suspected they gave 'ghost' interviews, a common practice n her country. I heard managers say the same thing, that the two who showed up for work were not the ones they phone interviewed. They did not know command line basics in either bash or Windows, how to use remote desktop, J ava, unit tests, and other things we required.
Oddly enough of the 4 foreign contractors we used recently the two women have been competent, the two men useless.
putting the 'B' in LGBTQ+
Havent caused errors with a quantifiable dollar-amount loss. But have been involved with several errors in various systems, as I suspect is the case for developers who write code that actually goes to production ;)
For an embedded hardware/firmware module for use by a backend application, I made a bug causing the module to reboot if a given parameter passed from the application was missing in certain circumstances where it was supposed to be present. The application wasnt supposed to call with this combination of parameters, and unfortunately the test harness didnt test for this case either. And in fact the application didnt usually call with the wrong parameters. But due to a database crash and associated data integrity error (which turned out to be a bug in the DB software itself which was later fixed) the column corresponding the parameter in question actually became NULL for a few users in the database- And since the application didnt check the validity of parameters but just passed on whatever it got from the DB, this resulted in the firmware receiving the illegal NULL value thus causing a reboot whenever one of these users logged in. The module brought itself up quickly after each reboot and there was redundancy so there wasnt any user impact, but a lot of warnings and alarms went off every time and it took some time to figure out how the error could happen.
I was brought onto a small web startup project as a co-lead. By this time the project was already 2.5 years old and had been rewritten at least three times by progressively less lousy developers. The final iteration was built on CodeIgniter (MVC framework), a decent choice in 2013.
My first day I'm browsing the codebase to see what's what, and a grep finds something like "UPDATE my_table set foo=" . $_POST['bar']. Not in a controller... not in a model... in a view.
So I immediately told the other leads that we needed to do a security audit on the entire codebase; it took a few days for the owners to consent. The audit revealed three different mechanisms for database queries (the standard CI driver and two other crude home-grown libraries, all used inconsistently) and that one of the devs, who not conicidentally had resisted the audit, was actually AFK for 20%-50% of the hours he billed every week. It took two months to do the audit and resolve the redundant code (no one was full time, mind you). Finally the owners told us "give us two weeks to decide whether or not we want to proceed". After six weeks of silence they pulled the plug and abandoned it entirely.
I had a friend who's job it was to find a way to break satellites. She said she was quite often successful.
(Hey, the OP didn't say it had to be an accident.)
We get big discounts that way.
“He’s not deformed, he’s just drunk!”
Heh. I sort of miss the days when CPUs had pins and the sockets were just a pattern of holes. The ZIF socket of the nineties worked quite well.
Do not look into laser with remaining eye.
I left the cover off of a $40,000 stabilized vsat antenna in a rainstorm once, That did about 10k in damage to the electronics inside. That's nothing compared to what our customers do though. Lets just say communications systems don't belong IN the ocean.
sorry for my comments, I'm drunk
I prepared a powerpoint presentation, where we could see small black dots. These were dirt marks on the lense of the camera.
But I thought it was missiles with nuclear warheads or chemical weapons, and presented that theory to a bunch of idiots. Next thing I knew, we were invading Irak!
- Colin
I got hired with a local ISP/network service group, and my first assignment was to go install a new frac-t1 router in a new client's office (yea, this was ~15 years ago, cheap t1 routers were still ~$1k). So the boss takes me back into the storeroom, digs out a router from a pile, and grabs a random power supply by comparing the size of the plug to the hole in the router. I actually bother to check the rating, and find that the power supply is 24V, and the router wants 18V. The boss tells me to plug it in.
Me: "Um, I don't think this is the right power supply."
Boss: "It'll work, come on, we're in a hurry."
Me: "But this is a 24V supply, and the router wants 18V"
Boss: "I said plug it in, what are you, deaf?"
Me: "OK..."
BANG! Fizzle-smoke-spark!
Boss: "What did you do that for?"
Shortest job I've ever had.
During an acquisition, the company being acquired helpfully passed along the list of AS they used in their BGP4 configurations in their core routers.
They helpfully had included the ones from other networks they provided connectivity to as well, but just had sent the AS numbers over in one big list, unlabeled, along with the AS their network originated: "Do these."
So during the network integration I dutifully entered the entire list of AS into the core routers as AS to be originated. Needless to say, hilarity ensued.
So perhaps not entirely my fault - though I should, in hindsight, have asked for more clarification or done more investigation rather than blindly trusting the information I had been given. This was a couple decades ago, and I was not cynical enough yet.
Got this domain "hsa.com" in the *very* early days of the Internet (pre-web). Decided that since we were a Canadian company, I we should have a Canadian domain, and surrendered it and got hsa.on.ca. (we weren't allowed to have hsa.ca, since all our offices were in Ontario...)
A three letter .com address would probably have been the most valuable asset of the company :-).
Worst thing (so far) has been formatting a PHP date() DB timestamp wrong for entries associating users and payments. I think it was something like accidentally using 'M' for both month and minute.
At the same time, there was a bug somewhere that periodically caused only one of the 2 tables to be written to, when we noticed that the tables were out-of-sync we immediately jumped to the timestamps to make some sense of the situation, which of course didn't work in this case.
Took only a few hours to sort out since we could use other available information to fix it, but it was my 1st or 2nd real job at around 18 so I figured I was canned; I wasn't though, it was one of those "lesson learned, watch out for it next time" situations -- my boss was really frustrated though.
... plugging a kettle into your 6-hour UPS is not a recommended way to make a cup of tea. This, however, is exactly what I did a long time ago. 10 or so seconds later, I had still-cold kettle of water and an entirely drained UPS. Oooops !
I once forgot to open a water valve before turning on a laser in the lab.
The low-pressure safety switch for sensing water flow had been bypassed (not by me) and the laser tube immediately cracked and broke due to the instant heat buildup. Total cost, about $4000.
Just cruising through this digital world at 33 1/3 rpm...
I worked on both the Mars Climate Orbiter and the Mars Polar Lander, though not on software related to the failures. I did fry a $12k damper during testing though due to a misunderstanding with the thermal engineers on hardware placement (I didn't lose my job, I was fresh out of college). Due the fact that the capillary pumped loop heat pipe thermal system didn't work, they ended up cutting it off and adding extra heaters/sensors at the last minute. Looming launch deadlines make for crazy times ...
Pretty much all modern Intel CPUs from the past many years.
Mostly I make my career out of fixing other people's tech mistakes. Which is not something that uni taught me how to do. Man I'm glad I got out of that place before I ran up any significant student debt. Did I mention I trash talked a uni on a news blag website?
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Basically an loading tool with a bug I knew from testing, you could set it correctly once in production but if you set it twice every user was f*cked up and could only be fixed from the web interface by about 5 clicks per user, no programmatic solution. And of course we had an error in the production setup, I altered that part - which I could - but forgot to take out the "you can run this only once" settings. Hundreds of users borked and the vendor support would take forever or claim there's no other way, what do?
This was a consulting company, trying to bill this would look bad on both our vendor and ourselves and it pretty much broke everything so we gave a benched consultant the assignment from hell. Click here, here, browse, pick, save in this somewhat less than instant web interface. Now do that all day, every day for all users until you're done. Personally I'd be ready to jump off the roof after an hour, but apparently she stuck to it for three days and finished. I don't think we won any popularity points with her though.
Live today, because you never know what tomorrow brings
Melted down a couple of LARGE high end power supplies (worth about 200K - I think the repair was about 50K). Did I lose my job? Nope, not even really called on the carpet. I had a triple redundant fail safe system, approved by management (in writing), and reviewed by both levels of client, and ALL THREE systems failed! (1 software, one independently developed firmware, one mechanical). Failure analysis on just the last one (the mechanical) was it was a once in over a million chance of it failing (yes, we did a failure analysis). Something (surge?) fried the computer, the firmware controller, AND welded the mechanical contactor closed (LOW duty cycle - close at start of test, open at end, 3x safety factor on ratings, something welded them during the test - aka I watched them close, visually inspected, and went home for the night, as per SOP)
One of those freak things, but we changed to a carbon contactor so it could not weld, and changed the firmware unit to a more robust unit, and did some other isolation. As far as I know, never happened again
was created by my boss. I fixed the bug instead of reporting it. The boss was incompetent and was costing the company millions in missed opportunities and in increased turn over of really good people. He couldn't see when his successes were pure accidents and when his mistakes were entirely foreseeable and preventable. I had a few opportunities to get him fired when fixing his messes. I wasn't ruthless. It cost a number of good smart people their jobs and cost the company millions (in fixes, unnecessary delays and missed opportunities). I'd put the dollar figure at around $10mil. But it may be much larger if some of those missed opportunities were first-to-market.
Any guest worker system is indistinguishable from indentured servitude.
... too bad it was here :)
On the whole, I find that I prefer Slashdot posts to twitter ones because I don't get limited to 140 chars before
The two biggest I have seen: -Comms card slips out of box while being carried over to submarine. Worth about $220,000, fell into the water and had to be recovered by divers for security. -Electrician didn't test circuit was isolated, he went to disconnect 3 phase circuit and decided to start with neutral. He lifted the neutral off, putting up to 400v where there should have been 230v. This destroyed over $300,000 in components, and cost another $200,000 due to lost operations.
Back in the 80's I worked for a field service organisation, fixing and maintaining PDP11 and VAX systems, but also CDC-9766 removable disk systems. Big 14" removable disk packs like you see them in old scifi movies. One of my customers had a string of 10 or so attached to a five-node Tandem Non-stop system.
Each week they brought two out of ten off-line for me to work on. I cleaned the heads, then used a servo disk pack to realign those heads.
To do this, I needed to remove the control cable from the string, and plug in an excersizer. One day I forgot to pull the control cable. So instead of moving the heads of my offline drive to a specific track, I moved the heads of *ALL* disks in the string! Without the O/S knowing about it
Believe me, that will bring a Tandem Non-stop to a grinding halt. That was my last time on the floor for that customer, but I didn't lose my job. Cost? I don't know. Perhaps a weekend of data recovery for the operators?
To Terminate, or not to Terminate, that's the question - SCSIROB
Late 70's. Central datacenter for a state not to be mentioned. I modified the JES2 startup JCL. Our mean-time-to-reboot was typically 2 weeks. Because of important state business, we didn't get a chance to reboot for 3 weeks. So, we reload and JES2 dies for JCL error. Then, we realized that all of our daily backups have the same error. And our last 2week backup has same problem. Our next backup, monthly, is stored at a site that is 1.5 hours away. Meanwhile, programs like AFDC and prison support apps are not up. Governor starts getting calls from important folk - wheres the system? Governor calls DP director - wheres the system? I see the end of my career looming. Fortunately, my boss had an old SVS system on tape that was just enough to allow us to edit the JES2 deck. After this, we changed our backup policy and put in stricter rules on modifying production systems. I just retired after 46years in computer industry. Still remember the fear on that day.
But it's worth repeating in this context. Thankfully, it wasn't me.
When I worked at a KC bank, we had a Wire Transfer team manager who loved golf. He was supposed to come in Saturday and test a firmware/OS upgrade, then restore. Nice, sunny day Saturday, so he decided golfing would be better.
Came in Sunday. Installed firmware/OS upgrade. Tested fine. Forgot to reinstall previous firmware and powered up old OS.
Incompatible. Froze the machine solid. He panicked and tried for maybe four hours to fix things himself. No go. Finally called Cupertino for help 4+ PM.
The techs had to be found, gathered and flown out from CA to disassemble said machine and reassemble. No wires until 1 or 2 PM Monday. Much money loss for all customers.
To answer the obvious question, no - beyond my understanding, he wasn't fired or even demoted.
One of my first engineering jobs out of uni involved modifying a UPS. This UPS had a massive battery bank that was quite dangerous to load test and didn't have an automatic load testing function. I came up with a small design involving a contractor and some minor wiring changes and we were part way through implementing it on every UPS at this site.
This UPS was part of a redundant pair that fed an emergency shutdown system at an oil refinery. In between the UPSs and the ESD system were about 120 circuit breakers, two for each circuit, and one of them was off. We modified the first UPS without issue then started the process for the second one. After calling the control room to let them know they will receive an alarm I switched off the UPS and was suddenly meet with a steam of profanities over the radio.
We lost power to 80 field instruments which triggered a fail safe action on the shutdown system tripping 4 units at the refinery, one of them was the FCCU which is core to a lot of refinery processes. To add insult to injury the unit was unable to be hot restarted because of a stuck valve and then thermally contracted breaking of large chunks of coke from the overhead line which blocked the internal cyclones. The FCCU was down for repair for roughly 10 days, I had made a name for my self and was asked to display the cock-up award (a giant dildo mounted on a plaque) on my desk.
Total cost of the outage was about $8million. Fortunately only partially my fault.
In the very early 80's, I was tasked with getting a VAX 11/780 onto an internal Ethernet network using a proprietary Ethernet Unibus card (one guess where I worked). This VAX had a Unibus backplane in a separate cabinet cabled to a Unibus adapter board on the system bus in the "main" cabinet. The Unibus adapter backplane was wirewrapped and since this Ethernet card did DMA (it's been a long time, but I think that was why), it needed control of a bus line which was normally jumpered on the backplane bypassing each slot so "dumb" cards didn't have to deal with passing the signal along. Therefore, I had to snip this jumper on the backplane of the slot I was installing the card in.
The VAX wasn't used by our group but was used by other departments during the workweek for some fairly important stuff and there was no backup system. The machine was given to me on a Saturday morning and I was admonished it absolutely had to be up by 8AM (IIRC) on Monday morning. No problem as I had studied the problem and had been in email communication with someone at another site who had performed exactly the same procedure.I had never physically touched a VAX before in person but there really wasn't anyone to help me with the task locally so I was on my own (in retrospect, maybe that wasn't the smartest decision) but, being young and brash, that didn't bother me.
It didn't take me long to find the VAX once I got into the data center -- after all there was only one of them. I shut it down cleanly from the console. I set the switch on the main cabinet front panel to the OFF position (I don't actually recall how it was labeled), the lights on the front panel went off and I could hear the area around me got a little quieter as fans spun down (although there was a lot of other hardware around, so it just reduced the din slightly). I was well prepared and had just the perfect pair of wire cutters to do the job. I opened the Unibus adapter cabinet and put the card in. I then accessed the backplane, carefully identified, double checked, and triple checked, the slot and jumper that I needed to cut. In retrospect, maybe I should have paid attention to a rather obvious condition that was staring me in my face, but I had rehearsed this work flow in my mind and proceeded onward. I confidently stuck the wirecutters into the maze of wires, snipped the relevant wire, and everything was going very well.
Then I withdrew the wirecutters from among the wire-wrap posts and was more than a little surprised as sparks arced from the wirecutters to wirewrap posts that they brushed against. Nearly simultaneously with the arcing, I noticed one little detail that I should have noticed earlier -- the fan in PDU or power supply in the bottom of the cabinet was still whirring away happily and the light showing it was powered on was clearly glaring at me. Ooops...
Well, I thought, hopefully, no harm done and I closed the cabinet. It was around then that I noticed a very concerned look on the faces of a couple of FEs who were working on an adjacent machine. I walked over to them and their concerns quickly became mine -- turns out they were "downwind" of the VAX and the distinctive odor of scorched electrical bits was strong around them. I guess I made someone happy that day though - they were very relieved that it was my machine, not theirs, that was emitting that lovely unmistakable fragrance.
Unfortunately, although the VAX seemed to boot, a bunch of stuff didn't work... Ooops...
We had 7/24 support with DEC so I called service out and watched a completely incompetent service guy (he was our PDP-11 repair guy who apparently was stuck on call supporting hardware he knew nothing about) fumble around for hours and concluded that the Unibus backplane had been fried and initiated getting a new one counter-to-countered to us (fortunately, that got blocked by someone who knew what they were doing somewhere). The guy didn't even know how to run diagnostics on the VAX and refused to attempt to do so.
In the end, the machine was not up a
I missed one character in a regex in a monitoring system that would cause it to think all the hard drives in a machine had failed when the machine was booted. Since it only happens on boot, it wasn't noticed until there was maintenance work that powered off an entire datacenter. When they turned the power back on, ~5000 machines all decided their hard drives had failed simultaneously. Took 2 days to clean up the mess.
kc8apf
About 15 years ago, a QA engineer in my office (a large Wall Street financial form) placed a fake trade for 1,000,000 shares of company stock in one of our test systems. The test order somehow got out to the New York Stock Exchange and actually moved the market. Backing out that trade was reportedly quite expensive.
The engineer didn't get fired, because he had done everything correctly. The system infrastructure had been set up wrong.. wasn't his fault.
Mid 90's. Spent a lovely weekend below the waterline on a frigate updating the ship's maintenance system with a new data picture of its systems. All went wonderfully well and I walked ashore late afternoon on Sunday and flew back to my home city. Fast forward to 4pm Monday and we get a call from the ship at sea saying the maintenance system no longer functioned: get your butt out here and unf*ck it. So, in the car, 3 hours drive to where the ship anchored for the night, RHIB ride out to the ship, up the rope ladder, about 10PM... fix it, you have until 6AM or you are sailing with us (for a week). That, my friends, is great motivation to work fast. To cap it off, there was a small fuel leak in the space outside the computer room: wonderful aroma to deal with. Tried to work out the obscure linkage between existing maintenance jobs and the system description that was causing the issue. Ultimately had to roll the database back to the pre-update state. Off the ship at 6 along with many bags of oil-soaked rags used on the fuel leak. Ship lost a few days of data and a day at sea: captain not happy... and we had to do the whole exercise again later.
Tape for data, $100, Airfare and and accommodation, $600, warship all at sea, priceless.
Not entirely my doing (what is these days) but I was the man that delivered the fun. No names, no pack drill over this.
Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
No, not me, but it's worth noting that the XBox 360 Red Ring of Death was (according to EE Times) caused by someone at MS who thought he could save a couple million bucks by doing the graphics ASIC work in-house instead of paying someone with experience like ATI to do it. That cost $1.3 billion. As far as I know nobody involved in deciding that or doing the ASIC work has ever been named (and I wouldn't blame the poor ASIC guys), but I can only imagine it would be like to know that was you.
I fried a voice coil on a fairly expensive Hitachi 2.2 GB optical drive back in the late 1980's with a QA stress test while working for FileNet. This led to engineering improvements and I got to keep the burnt out coil as a trophy.
Bought a Buffalo Terrastation. Went on vacation a year later to a country with limited internet access. On trip, one-year warranty expired and it died the next day, taking all data with it.
Fortunately, I had a copy of the server with me on a portable hard drive, so I could work remotely. That was our only backup. Sending the accounting database back to the office via GPRS was a lot of fun, but mailing that drive back to the office (after duplicating it of course) scared me to death.
The solution at the time was the right one; we didn't have the money for anything more. Ever since we have a hot backup server synchronized to the primary, for a small business. Like most screw-ups, what is important is how you move forward.
I flubbed the script and while there was no data loss, i, by myself on the night shift broke about 25k email accounts. I had a long night fixing it.
I still remember the frantic calls from the help desk as I was in panic mode trying to find out how bad it was.
Silence is a state of mime.
Bought it on eBay. Had crappy Verizon firmware on it that wouldn't allow any kind of audio streaming (web page streaming or TuneIn). Loaded Cyanogen on it and it worked fine but still wouldn't stream due to some remnants of Verizon FW.
Backdated Cyanogen to older mod and that mod was corrupt. It destroyed the boot loader so I couldn't flash another copy of non-corrupt OS.
I still have the phone but no way to get an OS on it without a boot loader on it.
I knew a guy who did support for a multi million pound company. They had many problems, mostly due to the fact that he was too scared to reboot their servers because he did all the support remotely and it would be a 100 mile trip up to their office if the machine didn't come back up. They insisted that he do maintenance in the evenings or at weekends to avoid disrupting their work.
So their terminal server was still running IE 7, because he was too afraid to update to IE 9 as it required a reboot. Someone actually got fired because they infected the server with a drive-by. Their mail server had a dodgy network card, but it took nearly a year to diagnose because he was terrified of updating the driver in case it didn't come back up, so that was just intermittently not responding or dropping incoming connections for over a year. The driver update fixed it in the end.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Well, one time, I had a problem with my land line, and I erroneously accused the wrong phone and threw that one out instead of the one that was causing the problem. Then I ended up throwing away two phones.
Since then I've solved the problem more generally by not having a land line anymore.
Wasn't mine, but it's too good not to share. Back in the mid 80s, I was working at (let's call it) SuperBigCorp's IT department. There was a fellow there who maintained the programs that handled the savings elections for employee 401K funds. One day, while making some changes to the COBOL programs that sent which funds to what investment vehicles....he made a little mistake. He got confused in a conditional statement, and all the funds that should have gone to stable investment selections went to the highly speculative vehicles, and vice versa. Even more unfortunately, this area of activity was not supervised and audited half as well as it should have been....by the time it was noticed, several months had gone by, and the stock market had suffered a bit of a setback. Millions of dollars were lost by SuperBigCorp getting it straightened out. They had to let the poor fellow go, in disgrace. The Chief of IT was reported to have said, that if the market had just moved the other way, the programmer would have been a hero...
There is no God, and Dirac is his prophet.
Back in the mid 80s, I was fortunate enough to get my first programming job. I worked with an incredibly capable programmer, let's call him Dr. Bob. I learned a great deal about programming from kindly Dr. Bob - he was a whiz at PDP11 and VAX assembly coding, and a great mentor. One day we came back from lunch and he picked up his mail and messages from the department secretary on the way to his desk. He opened one of the envelopes he'd gotten, read the letter within briefly, then started cursing like a sailor and threw the letter in the trash. He stalked off in a rage. I retrieved the letter and saw it was a page from a phone book, with the name "David Alexander" circled. After a couple hours, when Dr. Bob had calmed down. I told him he had to tell me what was going on. It turns out that his very first assembly language programming gig had been at the local University. It involved managing the data for a planned 50 year long psychology experiment, tracking the names, addresses, and project info for all of the participants over time. Now this was the mid 70s, so there was no database, just a bunch of tape files and MACRO programs to do the updating and reporting. Dr. Bob really liked the work, and the folks in the Psych Dept were really friendly, it was a great atmosphere. One day, Bob made....the Big Mistake. Due to some typoes, he inadvertently replaced the name and address info in every record in the files with the data from the first record....David Alexander's. This was a tape database and it only went back a few tapes worth....by the time it got noticed it was too late - all the good data was gone. The long range experiment was totally destroyed since they couldn't track the participants. He had to quit in disgrace - he said what really upset him was the way the Psych Dept folks were so nice about it and didn't want to fire him. Anyway, that's bad enough...but when his "friends" caught wind of it, they started popping up David Alexander references everywhere they could - they'd leave him phone messages from David Alexander, they'd get mailings sent to his address to David Alexander, and so forth. By the time this event I saw occurred, it had been going on for years (for all I know it still is). Anyway, due to kindly Dr. Bob's David Alexander mistake, I always check my code just a leetle more carefully than I otherwise might be bothered to - I personally don't ever want to make my Big Mistake....
There is no God, and Dirac is his prophet.
Just fifteen minutes ago I realized that my script to refactor the primary file server (newly converted to ZFS) into more sensible datasets had an irritating detail wrong (a path element was being duplicated in some paths).
I said to myself "oh, I'll just roll that whole thing back to the snapshot I made 30 minutes ago".
Then I go "zfs list -t snapshot" and discover that my snapshot was holding onto 0 GB because I forgot the -r switch to make the snapshot recursive.
Oh, well. By some impossible-to-separate mixture of good management and good fortune, it turns out I had a set of (different) snapshots from the last two days covering all datasets in questions. I lost very little work (only scripts were executed against these datasets and I still have all the scripts).
My real screw up?
Back in my second co-op workterm job, I managed not to notice that a system I was backing up changed the order of the listed drives between two very similar screen requests that I made almost immediately one after the other. Unfortunately, on the second pass I selected the active system drive as the recipient of the system backup, picking from the position in the menu where the desired destination drive had appeared moments before.
I had become accustomed to my home system being deterministic in the order it listed things. My bad.
This is back at the very beginnings of the 4.77 MHz era, so my PC was actually not yet what we now know as a "PC" (its father had an S-100, and its mother had a itty-bitty CRT).
Thirty years later I still can't type dd of=/dev/ada3 without making three trips to the metaphorical bathroom.
Whenever I type a disk-level dd command, I leave the sudo off, until after the third proof-read and several console consultations in which at least two different programs give me the same view of the drive name.
In dollar costs I couldn't say. In psychic cost, it's indelibly etched onto my permanent record.
I had a co-worker once (EEng) who claimed that as a junior intern during the late 1990s back when laser gear for fiber optics was all the rage, he routinely fried extremely delicate $2000 DUTs while the old hands just shrugged their shoulders. Dotcom dollars. Who really gave a fuck? It was considered barely worse than ruining a nice chair.
$22M - 6 hrs of downtime for 1 application due to a corrupted DB. I typed what the vendor told me to type into sqlplus. The vendor was clueless, obviously. Took about an hour to determine the root cause, took another hour to find a real DB (on staff) then some more time to bring him up-to speed and restore from daily backups.
Over 20K workers couldn't do anything that day.
The lead technical architect (hired gun), my team, and the direct business clients who knew protected me. S-VPs in the client organization all wanted to fire someone - me. They never found out who to fire. However, I've been stuck in the same position the last 8 yrs. No promotion since.
Not if that database insertion caused money to be moved somewhere else and database entries existed on a system belonging to someone else.
My personal best was when I was writing the firmware for a customer's laser marker system. It was a big industrial machine that moved the laser head on a very expensive gantry using 15-pound servos that could generate ungodly amounts of torque. I had a bug in the code that drove the servos, and I issued a command to home the gantry, after which the X-axis went zipping across as fast as it would go. Wouldn't have been a problem except there was a faulty limit switch on that end of the axis, so the 25-pound laser head got slammed into the stops at what we estimated was about 100 inches per second. Totally destroyed the laser head (there's nothing more disheartening to hear than the tinkling of broken steering mirrors and seeing a cracked flat field lens as a bonus), and caused some severe mechanical damage to the rest of the assembly. Fortunately the motors shut down automatically when the temperature sensor tripped, but it wasn't fun explaining to the boss that we had to replace about $30,000 of hardware.
My favorites are those I thankfully had nothing at all to do with - where I am now, we write and maintain the warehouse management software for a very, very large snack food vendor, and we have a VPN link to all of the plants to maintain and monitor what's going on. It's happened before where co-workers haven't paid close enough attention and have connected to live plants instead of the test systems, and accidentally shut down the warehouse, which means production gets shut down too since there's nowhere to put those thousands and thousands of bags of chips until the warehouse system comes back up, and it takes them hours to get stuff restarted and settled once that happens. I don't know how much it costs, but it can't be cheap. I'm also not sure why we don't have some kind of two-factor system with a unique key for each plant to keep that from happening. [shrug]
Please stand clear of the doors, por favor mantenganse alejado de las puertas
I nearly cost my employer several million by fixing a bug.
The first task I was given in my new job was to look at an old system that printed labels to be put on containers of car parts. A message would come in on a serial cable saying what part was going to be needed within a few hours at a car assembly line, the parts were packed into stillages (a frame designed to hold a certain number of a certain part, like bonnets, bumpers, doors panels, etc.) and when a stillage was full, or when a certain amount of time had passed since the first part was picked, then a label was printed, applied to the stillage, and it was dispatched over the road to the factory.
Every time the serial number rolled over 9999 to 0001, the system would go wrong and stop working. This happened about once a month, and the help desk had a sheet of instructions on how to fix the problem. Some of the staff knew the fix off by heart.
I looked at the code, found a roll-over bug, and fixed it. Everything was fine, and a couple of years went by with no problems.
Then, at 3 in the morning, the help desk called me and said that it had happened again. They didn't have the sheet of paper any more, and no-one could remember how to fix it. I rubbed the sleep from my eyes, and tried to get my brain into gear and remember what to do. It took me about an hour talking with a couple of help desk people, and between us we figured out what the fix was, and they called the warehouse and talked them through it.
The next day I talked with my colleagues, and found out that we had come within a few minutes of triggering a penalty clause for halting the production line that could have run into millions of pounds. This was back in the '90s when millions of pounds were a lot of money!
I looked back over the code, and found that there were actually two very similar bugs in the code, one of which happened fairly regularly, and one which only happend much more infrequently, but the same fix worked for both of them.
Back when I first started working in IT, my boss told me, "One day, you will probably make your million pound mistake. In our business, we build systems that, over the course of our careers, will save millions of pounds in lots of small ways. Eventually you will make a mistake, and one of those systems will go wrong, and it might cost millions. Your employer will bear the cost of it, which is why we don't earn those millions ourselves. You have to be prepared for that eventuality. If it happens while you're working for me then I will kick your arse, and maybe I will fire you, but I'd be wrong to do so, that's just the nature of the business that we are in."
Not sure if it counts as it was an Amiga 3000 and they came to my house to fix it for free.
I had a "friend" who brought over a new hard drive to get working on the Amiga I did my best then the system just quit, He then says yep, did the same thing to mine.
Oops! Wrong terminal!
I was sshd into a production server and did a poweroff. Meant to run it on my own box. I didn't have authority with our host to ask them to turn it back on and those who did already left for the day. Probably didn't cost the company much since it was a small saas product, but if I pulled that stupidity elsewhere it could have.
I don't think that counts: a) it wasn't your mistake; and b) the company should never had had that revenue in the first place, so it wasn't a "loss" but a restitution.
USB connectors also fit neatly in RJ45 ports, and this too can lead to interesting side-effects.
lucm, indeed.
I temporarily ran a copper network cable out of a window to another building while our building to building fiber was being installed.
Over a weekend we had huge lightning storms. The voltages induced in the unshielded twisted pair cable hanging outside 3 floors up fried both switches on either end of the cable.
That was an $8000 mistake.
Back in the early 80's, I took off a little too fast in my company station wagon, and $10k DTS Data Terminal hit the road hard. Ooops.
Not me this one, but a classic.
One Friday afternoon Telecoms tech was checking a remote unmanned exchange, one of the checks was to measure the levels on the analog multiplexer for the trunks to the main exchange, which acted as the brains for the dumb remote.
The procedure was to plug a 6.5 mm phone jack, attached to a large fixed meter into each channel at a time. Unfortunately, this chap grabbed the wrong hanging jack, this on having 50v exchange battery on it. He then proceeded to plug into each channel of the carrier system, and was mystified when there were no reading. As he plugged in the last channel, the exchange went totally silent. Whole exchange was down for 2 days.
What about the guy who sold Slashdot to Dice? :)
During a panel discussion with very senior technical leads, the question came up: "How many of you have made a $1,000,000 mistake?"
Every single one raised their hand. This was a very large semi-conductor company, and everyone had been involved in at least one instance where bad masks were made because a check was skipped or step was botched in the design flow.
I worked on a chip design where it took six design revs to get clean masks. All five of the prior revs had avoidable (human) errors during the design and build process.
Pay me now (in time running checks) or pay me later (in nre: non-recoverable expense) for bad hardware.
I once wrote a temperature monitoring system for a cargo airline flying 747s. The system would read the loadplan to determine if there was temperature-sensitive cargo onboard, then after takeoff, would send an ACARS message to an aircraft asking the ECS what the temperature was in each section of the aircraft. The rules table could be set to a different frequency of monitoring based on the exact cargo, so AVI (live animals) would be monitored every 5 minutes, pharmaceuticals every 10, etc. Once the temperature report came back, the system would compare that to determine if the temperature was within limits of the cargo onboard. Anyway, accidentally put zero in the frequency table, and basically DOSd 5 aircraft that were in-air carrying perishables. Realized the error pretty quickly when the monitoring system freaked out, but the data charges alone where about 30k in 30 seconds. ARINC was very nice and waived the fees though - thanks guys!
Warranty work: In the late 90's I was repairing a beige desktop Mac (early PPC), I needed to remove the logic board, and while attempting to pry up the logic board I slipped with the screwdriver, which ripped off a resistor in the process. As it was warranty work on behalf of the manufacturer (I was working for a service agent), all parties agreed it was a mistake that could have happened to any technician, so it continued to be covered.
Destroyed keyboard: I once spilt a Fanta on a white Apple keyboard, the clear plastic base with the full height keys, the last of it's kind before the current flat aluminium keyboards cam in.
Almost lost data: I was click happy once during the process of backing up a laptop for a staff member (planning to upgrade the OS), and instead if hitting backup, I hit erase. I was able to restore the data thanks to hard drive erasing only modifying the first block or two on the disk, instead of going to the time and trouble of erasing the entire disk.
Back in the 70's when I was still a junior electrical design engineer working for a distribution transformer company, we used algorithms loaded into TI calculators to compute the electrical, heat, and mechanical stresses. I later got the task of modernizing those codes and merging them with a FORTRAN code that another engineer had written and abandoned because it was too expensive to run. Things went well at first, we saved a lot of time and used that as any good engineer would to optimize our designs using different parameters to reduce cost and improve efficiency, both very important to my company and its customers. Then one day we got a limiting case which we didn't recognize at the time. As usual, one of our engineering assistants used the computer generated design and the old methods to validate the design. The engineer always takes responsibility for the design. After the build, the unit, a 3 phase unit that had 76,000 volt inputs, was tested in our "hi pot" chamber - a voltage pulse of the rated voltage but with reduced current and only for a short pulse. The center core winding turned into shards of copper spaghetti in the 8 foot tall tank. It cost $25,000 to repair, and delayed delivery for 3 weeks. My heart rate hit about 200 when the engineering manager called me and my supervisor into his office. Then he explained that he had run the calculations also, and discovered that our methods had a flaw in the prediction of the axial forces on the center coil. It was a very subtle mistake, and he said it could have been much worse. We were able to revise the code within a few hours, and that incident led to further improvements in methods and automation. It also taught me my most important lesson about computers - human error is the greatest risk. Real tests of your code sometimes do "blow up".
Every change is not progress, but there is no progress without change.
Cant be worse than the Kenwood TrueX DVD-ROM drives. Those things were fast as hell, but notorious for dying.
Comment removed based on user account deletion
I managed to flood it with enough data that it locked up, and required a manual reset. The second and third time that I did it, the network admins were getting much faster about fixing it, but my boss told me to stop doing it.
I have no idea how much it cost ... but it was the router that fed NASA Goddard's active missions, and I was told that the Hubble folks were getting upset when it kept happening.
I didn't get fired, as I was testing to ensure that we had sufficient bandwidth for SDO data transfers. (we didn't ... and I probably didn't need to run the additional tests to prove it). It did convince them to move us over to an isolated network when we moved offices, though.
Build it, and they will come^Hplain.
I dropped a 50k sensor on the ground but it tested out fine afterward. It was used for development so if there was hidden damage it didn't really matter.
I was tasked with a fiber cabling project for a new upstream connection at a small ISP. I documented the requirements, placed a purchase order, interviewed contractors, recommended one of them and went ahead with the project. My boss was downsized during this process, and when I informed my new boss that the cabling was completed and that his signature was required in some document in order for the contractor to be paid, he said something along the lines of "did nobody tell you that the upstream connection will not use that kind of fiber?" I wanted to die at that moment, but the fact was that it wasn't my fault - it was a consequence of the massive layoffs, the resulting chaos, and the deficient flow of information.
...by writing a simple page and putting it under load on a Sun E4500... which was the front end of our dot-com's website. We were only invisible to the rest of the world for a few minutes, thankfully...
Village idiot in some extremely smart villages.
I read an article years ago about a guy who developed the software that made transacting CDOs (Collateralized debt obligations) much easier. Basically that lead to the entire sub-prime mortgage industry which lead to 2008. So I think that he wins this whole discussion.
I only know of two such instances where this happened or something similar happened. One was only about five years ago and the other was longer - it made the news. Assuming it was the latter then that grocery store chain either begins with an S or a K? I can not recall which one it is but I do recall hearing about a computer mishap that took out warehouse access for a major grocery chain. The more recent one was due to a malware infection that spread across their network (as I recall) and its primary goal had been collecting credit card data but it had spread much further. That one was covered in eWeek and noted, by me, simply due to its proximity to me.
"So long and thanks for all the fish."
I worked for a 3PL (third party logistics) company. Years ago, they'd decided they were going to make $$$ with SaaS, basically selling our services to others. A huge undertaking had been embarked upon to make our system usable for other companies. They got a grand total of one client.
A few years later I was working there, and we got a second client! Bad news was, literally no one was still working there that had been when the first SaaS client had been set up. So there was a lot of guesswork trying to recreate it. I was a Junior Developer at the time, and was tracking down why some data loading wasn't working right. I knew the issue was almost definitely a trigger in the database, so that day I made some changes, loaded the days's data import into the Test DB, and checked if my fixed worked. It didn't, so I cleared out the load, made another change, and did it again. OK, now it was kind of fixed, but there was a problem somewhere else. Wash, rinse repeat.
I'm sure you see where things went wrong.
About the sixth or seventh time I did this, I accidentally ran it against production. I distinctly remember the panic that gripped me the moment I hit the F5 key to execute that SQL statement - I realized what I'd done immediately. The drivers (this was a logistics company, remember?) had been out on the road for about two hours at this point, and all the sudden all their handheld devices just stopped working. Where's the next stop? As far as their handheld was concerned they didn't even have a route, much less anything on the truck. This happened for all of the Office Depot drivers in Florida. And we couldn't just reload the day either. After the initial import happened at around 1:00 am a lot of virtual paperwork was done by humans to optimize routes and such, work that couldn't be easily duplicated.
I spun around in my cubicle and told him what I'd done immediately (I was told later I looked white as a sheet) and he assured me it'd be OK. An hourly snapshot was taken by the database. We'd lose a bit of data, but it wasn't the end of the world. He went to talk to the DB Admin.
Those snapshots? It turned out six weeks ago they'd just stopped running. Why? I don't think we ever figured out for sure, but either way they weren't there. Now everyone was panicking a bit. This was a new client we'd just picked up and we didn't want to screw the pooch. In the end, they ended up doing an emergency purchase of some software that allowed them to roll the database back using the transaction logs. Fun times.
I spent about 32,000 USD upgrading to CD-Rs in ca. 1995. The worst part is that only covered eight of the computers in the office. At the end of the year there was an offering from HP that was under 1,000 USD. By the following summer they were half that. At the end of that year they were half again. Then, not more than a year and a half after that I could find SCSI CD-Rs for near 125 USD. Blank CDs were something like eight bucks when you bought in bulk... My mistake was adopting the tech that early. We were using large data sets (for the time) and the idea was portability. It worked, it *sort of* paid for itself. It would have paid much nicer to wait. I can not say that it lost us money but I can say it sure as hell did not make us any.
"So long and thanks for all the fish."
I was trying to fix a broken backup process on an AIX box, and found that there were a ton of stuck Legato processes on the system. Rather than kill each one individually, entered the killall command to get the correct syntax to kill all of the processes with legato in the name.
In Linux, entering killall gives you the syntax on how the killall command works. In the old version of AIX this system was using, it killed EVERYTHING with no warning and basically rebooted the box. That's not usually not a big deal, except that this was the primary SAP database server for a Fortune 500 company. It took the DBA's about a day to clean up the mess.
The system was clustered, thankfully, but it probably cost about 10K in labor to clean up the mess.
I once built a Windows NT 4 system image that used an older version of a Novell Netware driver that was incompatible with the newer version of Netware that the file servers were using.
It seemed to work fine on the master system that I built, but after that image got deployed to 50 classroom computers it flooded the network with garbage traffic and caused the entire University network (about 500 computers at the time) to crash. It took the network team about two days to figure out what the problem was.
When I was 12 I put the BIOS chip from one motherboard (it was still the kind of EEPROM with pins) into another in an experiment.
Sadly I didn't know what the orientation of the pins was or what the little dot meant (pin 1) so I must have reversed them.
Put the BIOS chips back but I had fried both boards.
Working on Cisco command line, I was in the habit of typing "no " and doing a double-click-middle-click on the line I wanted to delete. Worked very well except for
(IIRC)
redistribute bgp 100 metric 100 metric-type 1 subnets route-map BGP2OSPF
In this specific copying the entire line after "no " does not remove the line, it just removes the route-map limitation, and hey presto I was redistributing our full BGP into OSP. Clincher was that it took some 20 minutes for the network to actually stop working, so bu that time I had totally forgotten about it. It took an hour to find out what the problem was and to correct it, during which my ISP was basically of the network.
*I had an off-by-one error in a TopCoder problem (I used > instead of >= in a loop) that I didn't catch that cost me $3000 in prize money and a trip to the finals.
*I was working at an observatory on campus and left the huge, Peltier-cooled CCD for the telescope on a table but still plugged into a computer and left for the day. When I came back, I found that someone had tripped over the cable, smashing the CCD on the floor. They then sat the broken CCD next to the computer without a note or anything. $7000 CCD destroyed.
*Another time I was working with an AFM in a basement of the university, and left for the day. It stormed really hard that night, and when I came back the next day the basement had 6 inches of water in it. It turns out that the water had come from a leak directly above the AFM. I guess the AFM didn't like getting a shower in filthy storm water and it cost $20-$30K to replace.
*However, my biggest save was probably more important than all of that combined. Without divulging too many details, I was writing some tests and caught a serious data-loss bug in production before any customers were affected by it. The bug actually made the news: http://www.theregister.co.uk/2...
I'm the amateur programmer who first programmed the code for Lawrence Lessig's Mayday PAC. I don't know if you remember this, but the site went down on May 2, for about 8 hours, when we were raising roughly $10,000/hr. I had built everything on a LAMP stack and sent everything through a single MySQL database, which just didn't scale. (I was - and still am - an amateur). Luckily, pro developers stepped up and staunched the bleeding, and eventually we moved onto a Ruby-on-Rails system for the front-end and a NodeJS/Google App Engine solution for the backend.
Back when Linux was much more primitive I had to set the video monitor parameters by hand coding configuration files. And, by accidentally over-specifying the maximum sync rates, I "smoked" the flyback (horizonal output) transformer in a new 21" Sun monitor in short order. I typed in one wrong number and $$$.
Somebody unhooked the cable from inside a cabinet to a spectrum analyzer I was trying to use to monitor a signal I was setting up to a satellite. I thought something was broken and was messing around with the controls to see if anything happened. I finally found the cable wasn't connected about the same time the satellite controller came across screaming that I was about to burn out the satellite. I didn't, but it was a very close almost. When I plugged in that cable there was a huge spike on the screen.
Three people, working independently, made errors in programming and website updates which nearly bankrupted United Airlines when the errors came together on September 8, 2008. "Shares fell to about $3 from more than $12 in less than an hour, wiping more than $1 billion in value before trading was halted.".
When the market first opened that Monday, United Airlines was trading at over $12 a share. The public summary of the events state that Chicago Tribune re-indexed their archives, resulting in a six-year-old story about United Airlines bankruptcy to be re-posted on the Web site of The South Florida Sun-Sentinel without a date. Google picked up the "new" article, saw the missing date, and inserted the current date of 9/8/2008. That article was picked up by a research firm, Income Securities Advisers, which then posted a link to it on a page on Bloomberg News, which sent a news alert based on the old article. The news alert triggered automated trading systems to issue sell orders. Nasdaq finally ordered a halt in trading the stock at 11:08 a.m, but the damage had been done, United Airlines Stock had lost 75% of it's value.
I do not deploy Linux. Ever.
Underestimating time needed happens all the time in the software industry. It probably is worse in the gaming industry where publishing deadlines often get set 6 months or more in advance, but I still get hit with guaranteed release dates for customer commitments at my job now where I've put in ~100 hour weeks to fulfill (telecommuting many of these probably saved my marriage, as I would work 4 hours after my wife went to bed). Still, it is nothing like the 160 hour weeks in the office for a game release crunch (and no, that isn't all work - I slept on beanbag chairs in the testing room and they catered in meals, but at some point you're just so burned out and stinking of feet that you need a night sleeping at home and a long shower).
I can't think of any instance where I've cost a project, but I'm sure they exist. OTOH, I did have a workaround for a $5 million dollar contract where the customer was going to reject our Linux port due to a bug I found and reported. The developer and pubs person assigned the defect were laid off after 9/11 so the defect slipped through to the customer. Fortunately, I overheard a sales person talking about it and supplied the workaround, saving the contract.
When I was a junior programmer working on a mainframe, I was given a problem ticket for an intermittent issue. I stuck diagnostics into the code, but because my disk quota was far to small, I sent the output to a virtual printer that I looped back to my account. Unfortunately, after I got the whole testcase set up (couple hours) the mainframe crashed and I went for coffee along with the rest of the 300 users on the system, for the 10 mins it took to restart. After several days where I hadn't been able to make progress because of the suddenly frequent mainframe crashes, I got a message from the operator asking me to delete my large spool files, since the mainframe was crashing due to a lack of spool space. That's when the penny dropped that my testcase had been exhausting the system spool space, crashing the mainframe about 8 times. Probably $100,000 in lost labour.
Years later, working on extending some high reliability software, I found some bugs in pre-existing code. The system had some internal checks and watchdog timers that would force a restart if it thought some code was taking too long. Both bugs would trigger the restart system by making something take too long and triggering the watchdog timer. One was in very complicated code, but explained some intermittent issues we'd seen over the years. The other was in a newly released, still unused utility, that didn't work properly on old HW, but would need to be re-written to fix. I only had time to fix and test one bug before going on a month long vacation, so I fixed the complicated one. While I was on vacation, an alpha release of the product went out, and promptly started crashing intermittently with stack corruption issues. I got back, to find six such tickets on my desk. In the meantime, the broken utility had acquired some users, so I decided to spend a couple of days fixing the utility.
It turned out that the stack corruption issue was holding up the production release, worth many millions of dollars.
Of course, I wasn't able to reproduce the intermittent stack corruption.
I spent 3 weeks looking everywhere, trying anything to reproduce it, resorting to rebuilding the alpha load where I could sometimes reproduce it, but not if I loaded my diagnostics.
Meanwhile, management was getting very antsy about the revenue implications.
My boss was very good, and sheilded me from the flames, but I didn't like seeing him getting fried, as the release date kept getting pushed.
I tried hunting around to see if anyone had been changing code in that area of the system, but of course, there were only my updates. I asked anyone I could find for suggestions, and nobody had any ideas until one person said it reminded them of one very old issue they'd worked on, and described the problem they'd had.
I went back and checked my archived output. Sure enough, I'd been a bit careless testing the broken utility before fixing it. I only checked that my testcase triggered a restart, not why. It turned out that long before it could trigger the watchdog timer, the utility corrupted the stacks of other processes.
I'd just spent 3 weeks holding up an important release, because I didn't realize I'd already fixed the bug.
Most people don't realize that 100/full on one side and Auto on the other should properly negotiate to 100/full and 100/half in a duplex mismatch. I've seen that problem many times.
Learn to love Alaska
It is all good. I can not blame you for not commenting. You may well still work there or still be covered by some sort of contract such as an NDA. I wouldn't recommend violating any such things - a job is not worth losing for idle banter with random pixels nor are said random pixels worth a court case.
"So long and thanks for all the fish."
...a second too early. Worked as broadcasting engineer and cut short a commercial by one second. Lucky me, that was in the middle of the night, so the damage was not that bad. As you may have guessed, that was in Germany and quite a while ago. When I watch commercials on US TV they get cut off constantly, seems as if the ad customers are more forgiving here. Working as broadcasting engineer was awesome except for the craptastic hours and the constant stress of not being allowed to make even a tiny mistake.
The implications of deciding one way not the other were a million dollars worth of ironmongery (9.925in OD liner pipe) being run and cemented into the hole. That operation occupied a rig crew of 90-odd people for 8 days while I was on leave. When we drilled ahead, it became clear that I had been wrong. Total unnecessary cost was about 2 million dollars.
These days, I don't lose sleep for less than ten million. The fact that I still do work for the client suggests that they figure it's better to have me around than not.
A couple of years ago I got some grief for pointing out a problem on day 10 of a job, which people upstairs from me decided wasn't likely to be a problem. So they shelved the problem, told me in writing to shut up, and continued with the well. 3 months of work later, we'd made a beautifully-tuned geo-steered well ... and had to wait on weather for a major storm. And when we came back on location, the problem I'd been making a fuss about had come back to haunt us and forty million dollars worth of ironmongery and effort was junk. Several embarrassed faces upstairs, but all my fellow contractors knew who had said "We need to deal with this problem, now." when we were five million into the project. Who needs advertising?
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Found a gaping goatse-sized security vulnerability in a package that had been outsourced and the original contractors long since gone.
It was less expensive to just kill the product which we had been selling for about 3 years than to re-engineer the thing from the ground up with new staff.
I've never heard of somebody *heating* a drive to recover a stuck head, but I've done the opposite.
Many a drive has been recovered by a day or two's stint in the freezer in deflated ziplock bag. I'd imagine the principle is the same.
With cooling, you do have to watch out for condensation build-up as the drive defrosts. With the heating I'd worry about damaging the data on the disk (magnets in general do not like heat, so I'd imagine magnetic storage would similarly be a gamble).
Since I write software that writes software for machine tools, I have extra opportunities to break things.
There's a technology called Electrical discharge machining, which means putting stuff close together in a fluid, running current through them, and having sparks burn off little pieces of material until you've got what you want. One manufacturer makes machines that have sophisticated programming, but it's not at all safe. Once, with the support guy from the company we got these from looking over my shoulder, I made a slight mistake that caused the arm of the EDM machine to slam against the metal we were machining, for a $16K repair.
Another time, a variable contained a Z level (height) that was used for two different things, but for everything we'd done up to then the two different things shared the same value. I was the guy who made the change that made the difference significant, and so some of our CNC mills thought the metal being machined was significantly lower than it was, so the setup moves for the machining that assumed the endmill was moving through air tried slamming through the metal. Some of the results were spectacular, although I never did find the cost.
Fortunately, at least for my self-esteem, people more experienced than me were supervising each of these mistakes, so I didn't feel too stupid, and my colleagues were very understanding.
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
A co-worker of mine had just finished implementing a new caching system for a legacy app that interfaced between multiple systems and the mainframe to track progress and shipping of pilot production runs. Due to a bug in his code, in a very specific use case, one of the cached systems would not get flushed. This was identified a few days after the production release when the company (a multi-billion dollar food sciences multi-national corporation) received a phone call from a Pastor in BFE, Minnesota asking why we had sent him almost 500 gallons of ice cream. Apparently, his church's address was in the system from some charity event we had sponsored, since the ID and business type didn't flush from the previous transaction, when the pilot plant told the software to print labels for the next order, it pulled the shipping address from the wrong database and the ID just happened to collide.
The cost of shipping the ice cream back for disposal was ridiculous. So the company told the Pastor to have a huge ice cream social.
The responsible developer was not fired, but there were running gags about him being the Ice Cream Man for the next year.
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
Do you happen to work for RIAA? They tend to sue people for causing them losses like these.
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2