Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost?
NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!
But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!
I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..
They didn't have any cable management and only one border router..
Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.
I used a system improperly over the course of a month. It connected to some services that ran up a $50k bill. I was mortified when my boss told me, thought for sure I'd be canned on the spot. I was only 22 and it was my first job out of college, so the amount was nearly double what I was being paid. The boss basically took the heat for not having explained it to me better, and I was not reprimanded in any way.
I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.
Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.
I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.
So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.
Next day everyone is working happily, everything is working smoothly, no worries.
Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.
So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.
Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).
In the land of the blind, the one-eyed man is kinky.
Our group at FedEx released code that I wrote on a Saturday night. This was two days before the Apple iPhone 4 shipped. The code worked perfectly, however, despite our repeated warnings about nearly doubling downstream traffic, the downstream systems (like billing and tracking) weren't ready for it.
So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).
Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.
Life, the Universe, and Everything... in my image.
I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.
I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"
Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.
On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.
I do not deploy Linux. Ever.
How many people will refrain from posting because the statute of limitations hasn't run out yet?
My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.
Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.
Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.
The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.
Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.
Take Nobody's Word For It.
Some other fool did not install the panel properly, and left one of the three nuts off. Distinctive nuts, used in only one place.
Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.
Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
Training mission aborted. much sheet metal work needed.
Actual repair cost? Unknown, but easily 5 figures if not more.
The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.
My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.
I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.
Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.
I will say, the Server was the problem.
It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.
It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.
By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.
So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.
The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)
I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.
Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.
I just did what I thought I could to keep the Titanic afloat.
So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.
But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....
Unfortunately - that only works if the damaged disk decides to drop out of the array.
It didn't.
I find th
So there I was, scribbling down some notes off the PC screen by hand, when I reached for the keyboard and Ctrl-S'd.
Pretty much all modern Intel CPUs from the past many years.
Now the programmers in the audience could probably think of like 10 different specific things that could be coded into the system to prevent that from happening, but this company didn't. Which really isn't too surprising. I asked one of the devs on the ground systems team if the ground systems was using GMT or UTC. His answer was "What's the difference?" I was able to infer from his answer that it was most likely GMT, and that did appear to be the case. Somewhere deep in the bowels of the system there was presumably some piece of code written by an Indian contractor with a math degree adjusting times for leap seconds, but it wasn't in any code that anyone knew about.
The early history of that company read like a Monty Python sketch. The first satellite exploded on the launch pad. The second satellite fell over and then exploded. The third satellite burned down, fell over, exploded and then sank into the swamp. The forth satellite got into orbit and was promptly bricked by sending the wrong version of Windows(!) to it. To be fair they only had to do that because they launched it with the wrong version of Windows(!!) in the first place. One would think that ANY version of Windows would be the wrong version of Windows to shoot into space, but that's why you're not the head of a billion dollar satellite company.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
No kidding. I'm glad we didn't. It means I can look at myself in the mirror. Career-wise, I've done okay without it. But it would have been a completely legal patent through which CI$ would have raked in millions and mililons of dollars. And, as far as I can determine, it would have been completely legal. There was no MySQL, no Postgres; OraPerl had *just* been released and was barely stable on SunOS, and there were no known instances of a CGI / OraPerl gateway on the Internet until Pacific Power & Light asked us if it was possible to connect their consumer-oriented energy savings database to that new thing called "the world wide web."