Slashdot Mirror


Ask Slashdot: How Much Did Your Biggest Tech Mistake Cost?

NotQuiteReal writes: What is the most expensive piece of hardware you broke (I fried a $2500 disk drive once, back when 400MB was $2500) or what software bug did you let slip that caused damage? (No comment on the details — but about $20K cost to a client.) Did you lose your job over it? If you worked on the Mars probe that crashed, please try not to be the First Post, that would scare off too many people!

65 of 377 comments (clear)

  1. I'm retired now by Anonymous Coward · · Score: 5, Funny

    But back in the 1960's, I figured we could save a bit of money by only storing the year in our data records. No one would use my program decades later, right? Boy, was I wrong!

    1. Re:I'm retired now by Rei · · Score: 5, Funny

      I don't have anything nearly that bad - my worst only cost me data. A friend taught me (while I was still learning Linux) a trick, how you could play music with dd by outputting the sound to /dev/dsp. But as I said, I was still learning Linux and hadn't quite gotten all of the device names into my head, and I mixed /dev/dsp up with /dev/sda...

      --
      Dear Lord: One of your creatures may be hurt tonight. Please let it be the other creature.
    2. Re:I'm retired now by JaredOfEuropa · · Score: 5, Interesting

      I over-promised on a time estimate once, or rather: I let myself be convinced to pad the estimate. Not by a vendor but by the client! One of the client's systems was due for an upgrade, and between myself and the support guys in India I figured it would be a 19 man-day job. I would run it as a "small project" meaning that I could run it any way I wanted. However, the client asked me: "Can you make the estimate 21 days?" That meant it would be a "proper" project run according to the client's methodology, which the client preferred for budgetary reasons. I had nothing to worry about according to the manager, a PM would be assigned to me to take care of the project formalities. So I agreed.

      At the time I was not aware of the unbelievable bureaucracy of large multinationals, and what this would do to my project. Normally I estimate the amount of real work, and add 20% for project management overhead. Maybe another 20% for red tape. But in this case, the PM was more or less forced to involve an ever increasing legion of other teams from various Centers of Excellence in the client's organization. A simple upgrade turned into a project that ran for over half a year. And by agreeing to this approach, I probably cost the client around $300,000. Of course it was mostly their own organization that ran up the cost, and they asked for this in the first place, so they never gave me any grief.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
    3. Re:I'm retired now by AmiMoJo · · Score: 4, Funny

      I'm writing firmware today that stores the date as a 16 bit unsigned integer giving the number of days since 1/1/2000. When printed it is converted to an 8 bit unsigned year and formatted with %02u (2 digits). I'm well aware that this will fail on 1/1/2100, but... I'll almost certainly be dead and no-one will be running this code in 85 years time, surely...

      I'm starting to feel bad about it now.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    4. Re: I'm retired now by khellendros1984 · · Score: 2

      About 14 years ago, I used Linux for the first time, after having used various versions of DOS and Windows starting around 1993. There was so much different about how you use the system, how things get done, and new mindsets to get used to. On top of that, discoverability of device paths, standard Unix utility names, etc is pretty terrible. So yes, "Learning" seems like the appropriate word.

      --
      It is pitch black. You are likely to be eaten by a grue.
    5. Re:I'm retired now by AmazingRuss · · Score: 2

      The moment I hear "Center of Excellence" I run for the exit.

  2. $24,000 by Anonymous Coward · · Score: 2, Interesting

    I was in charge of ordering a leak correlation system for a water utility that I work for. The system I choose was not quite what we needed, but worked. One week after the warranty expired, I dropped the correction unit and it has never worked since. I found out the correlator wad unrepairable and we had to order a whole new system.

  3. Outage.. by steveb3210 · · Score: 4, Interesting

    I unplugged the wrong thing in a datacenter once which took 20k domains offline. Traced the cable from the machine to the wall 2 or three times before pulling too..

    They didn't have any cable management and only one border router..

    Didn't lose my job, I was a very young sysadmin who was learning but good at what I did.. everyone kinda shrugged it off as a lesson learned.

    1. Re:Outage.. by Anonymous Coward · · Score: 4, Informative

      DNS servers on the same subnet. You, know, the thing you aren't supposed to do, but everyone does anyway.

    2. Re:Outage.. by jellomizer · · Score: 2, Insightful

      As with most mistakes, it is part of a system that is faulty and awaiting one simple mistake to escalate.
      Any one human can make a mistake. However a good system should have built in methods to protect against this.
      Why wasn't their a backup system, why didn't it have have a fail over network/power, why wasn't there proper labeling.

      Chances are there was a culture of trying to save money: paying for a redundant system cost twice as much, or more. Having those network guys spend hours cleaning up and reorganizing where they can be working on more profit driven activities.
      They are too focused on being agile and quick, that they will let little things slip.

      For 99% of the failures and mistakes that happen it is the fault of the system, and not of the person who happened to make mistakes.

      Organizations need to prioritize these methods and follow to make sure they are worked. Not just write them down, post them on some intranet and blame people for not following them if it wasn't followed. It needs the full organization to make sure checks are in place.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    3. Re:Outage.. by turbidostato · · Score: 4, Interesting

      "As with most mistakes, it is part of a system that is faulty and awaiting one simple mistake to escalate."

      Can't agree any more.

      "Chances are there was a culture of trying to save money"

      Sometimes the "cargo cult" is so ingrained that even the techs are unable to see it.

      Anecdote:

      Was in a hiring process, not remember if it was Google or Amazon. One of the questions (from a hands-on tech team lead) was about a single server that went crazy and couldn't spawn any more processes, so it was almost impossible to do nothing with the computer. It still was offering whatever services it hosted just OK.

      It went more or less like this:
      Me: Has this happened before?
      Recruiter: Nope.
      Me: So... Can I try this, or that, or this other one?
      R: No, because you can't run any new process.
      M: Ok, reboot it (I of course know saying somehting like that is taboo for a unix/linux sysadmin). Let's look at the booting messages to see if we get some clue and let's monitor it afterwards to see if this happens again. If that's the case, we will be in better position to diagnose, if not, we will put it on the "computer gnomes" account.
      R: Won't try to diagnose anymore before rebooting?
      M: Nope. My time is valuable and there will surely be more productive things on my to-do list.
      R: But the computer host a service that if turned off will cost the company a bazillion!
      M: Nope. If that were the case, the powers-that-be would have engineered the service with high avaliability in mind -which in turn means we could reboot the server without further hesitation. Since that's not the case, the implicit is that business already considered it not a critical service so point above about me costing money still applies.
      R: But, but, but...
      [...]

      Of course, I knew from the very begining the answer he wanted was to find a way to list the process list without spawning a new process so after a while I went throw that route -I vaguely remember there was some Bash built-in that would allow me to do it, but not exactly which one, but back in that time I wanted to see the culture of that place.

      There's no need to say I wasn't hired. But I didn't wanted to be hired either. Not within that team at least.

    4. Re:Outage.. by JSG · · Score: 2

      My DNS servers are on the same subnet and there isn't one cable anywhere you could unplug that would take them both offline.

      What about:

      * Router misconfig, takes out default gateway for a while for both
      * An extra cable is added and {MR}STP was disabled by accident or something like that.
      * etc etc

      Anyway, your proud boast may one day discover that people do the funniest things. If your DNS servers are in fact the same box with two IPs ...

    5. Re:Outage.. by Anonymous Coward · · Score: 2, Informative

      Be careful about criticizing others. Routers don't have default gateways, they have null routes. They can also be set up to be redundant gateways for others and have many redundant null routes themselves...

      Turning off STP on just one router would never be a problem. There are master and standby root bridges. Even if they both go down, others will step in to take the job. It would require a total network shutdown of all layer three equipment before it would be a problem and even then, ttl limits and excess traffic would cause the routers to drop one of the cables in the loop within seconds.

      This is entry-level networking knowledge.

    6. Re:Outage.. by steveb3210 · · Score: 2

      I unplugged the only border router.

    7. Re:Outage.. by Anonymous Coward · · Score: 3, Interesting

      That lets me think about a cleaner who for some unknown reason had the keys to open all rooms including the server room. Around Christmas time she needed to find a wall plug for the Christmas tree. She found one in the server room with the switches/routers/ups/backups/aircos (why she had a key of the server room, nobody knows) and just plugged the Christmas lightning in an unused socket, between UPS and switches. Of course as usual, the Christmas lightning didn't work and short circuited the network, which shutdown the airco power supply. And she just left it there. It was winter, and the servers weren't heating up that much while just idling, but they started to heat up when work started again after the weekend and when they became under heavy load. One failure after the other, the servers started to shut down one of after the other, and it was over 50 degrees Celcius in the server room. I was a programmer, but was ordered to help in emergencies, like dragging new server hardware in and out the room, but spare aircos? That's something we didn't have. On top of that all the specialists of the aircos were on a holiday, those bastards could got the days of during the end year holidays, while the 'IT guys' always had to be present in case of failure. While the system administrators were close to get a heart attack, and already pulled out half of their hairs because they couldn't find the problem, and were like sweating like a horse (remember it was over 50 C in that room), I was the one who noticed the Christmas tree and followed the cable that went over the dropped ceiling into the server room and simply unplugged it. A few moments later the aircos turned on again, one after the other, and within half an hour the temperature went back to the 26-27 degrees and the system administrators could restart the servers again.

      I never told them what I did. I had some sympathy for the cleaner, she was a pretty smart Hungarian woman with a degree in Laws and philosophy that was useless in our country, and worked hard (16 hours a day) to give her only son a change to study in our country and get a decent degree and job. If I told, she would certainly be fired right at the time her son would need lots of money to spend on new books for the second semester. I told her of course that she should never enter the server room, and comforted her with the fact that I also was just a worker and didn't tell anyone.

      She was grateful for the whole time I worked there. I was eventually the one who got fired, for not wanting to create a Java Applet to power the client side of a web shop in 2011 (!!!!). Some marketing guy had read some completely outdated books about web shops (probably from the nineties) and decided that we also need such an advanced Java Applet based web shop.

      They actually wanted to do things with a web client, like editing photos with layers, like a mini Photoshop/Gimp, that could simply not be done with a webclient (maybe it could be done with some advanced Javascripting, but I was no expert in Javascript but it would still be overkill for a simple website). They actually found a fresh college 'Java expert' who was willing to pick up the job. The last time I checked their completely outdated web shop, the Java Applet simply could not be loaded because of security problems. The web shop was marketed to their customers so hard that it backlashed enormously. May customers ended up with malware because Oracle/Sun installed the Ask toolbar (most customers didn't have Java yet) and still couldn't run the Java Applet. So recommendation where done like using XP with IE 6 to run the webshop, and that was in 2013 when the webshop was finally ready.

      Ultimately the business went bankrupt because once you go the online service way, customers will find other services when yours sucks

      My failure in this was that I could not convince marketing people that they were wrong and I was right. I was fired and found a new, more interesting, higher paying job while they ran their business into the ground in jus

    8. Re: Outage.. by turbidostato · · Score: 2

      "I just read you as, "not my problem""

      Yes, that's the case... from a certain point of view.

      I usually respect enough others' work as to give them their due credit. In this case, it means I credit the system architect as being able to design the system properly. No high availability means it's not a critical server, so I adapt my procedures accordingly.

      "figuring out what went wrong is precisely your job"

      No, it isn't. My job is to produce the most value for the company within my assigned competencies. Sometimes it means scratch my head for hours to solve a problem. Some others it means reboot/destroy a server wihtout a second look then go to the next item on my to-do list. You know, servers are not pets but cattle.

      "You sound like a dick answering a different question than asked"

      In fact, I didn't. I was asked to solve the problem, not to diagnose the problem and solve it without rebooting the server, and I honestly gave the answer I considered to be the most effective. As it resulted, it was not the answer my interviewer expected nor wanted but I'm fine with that: in a hiring process the prospective employee is interviewing the employer just as much as the other way around.

    9. Re:Outage.. by ultranova · · Score: 2

      Anyway, your proud boast may one day discover that people do the funniest things.

      Hmm...

      1. Create a domain.
      2. Have that domain host a single page saying "Nothing can take down this page."
      3. Have that page and DNS server hosted in a datacenter in an enemy country.
      4. Sit back and watch.

      Weaponized hubris - what could possibly go wrong?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    10. Re: Outage.. by jellomizer · · Score: 2

      The Job interview process is actually a two way process.
      The company needs/wants the resource, that is why they are open positions.
      The Person needs/wants a job or a better job, that is why they are applying.

      Now even in the height of the last recession and it was a big one. In America average Unemployment was under 10% of the population. While that created a market where employees had the advantage, it was only an advantage not supreme power.
      1. The employees wanted people who were currently employed (Using an outdated reasoning that if they weren't laid off then they must be good enough to have made it). So while these applicants may be looking for a better job, they have a job currently and is only willing to take a better offer.

      2. If your industry isn't offering the type of work people want to do for the money anymore, then people may make life decisions to go a different route. Go back to school and study a new topic. Use their skills in a different industry.

      3. High turnover: Turnover is really expensive on average it takes 150% of the salary to deal with an employees turnover, having to retrain new employees, catch up time etc... If your corporate culture is poison. Then you will have a hard time keeping employees.

      I have been on some job interviews where I lost my temper with the recruiter. One company had a very particular piece of software (Like so particular I couldn't find a relative match it with a Google search, except when I added the industry on it, then it was a few pages deep.) The recruiter kept on hounding me on this tool. I asked what does it do, where then I can at least give a general abstract answer to the questions. The they didn't know either. From this interview I got the following impression. The guy who worked on the software (Probably the guy who made it) left the company for a better job. They are trying to find someone with the exact skill sets and pay them as much as the guy who left for a better job. So they let a good resource leave, and they haven't learned from their mistakes and either realize that they will need to lower the requirements, or raise the salary and benefits.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
  4. Improper use of systems by pierced2x · · Score: 4, Interesting

    I used a system improperly over the course of a month. It connected to some services that ran up a $50k bill. I was mortified when my boss told me, thought for sure I'd be canned on the spot. I was only 22 and it was my first job out of college, so the amount was nearly double what I was being paid. The boss basically took the heat for not having explained it to me better, and I was not reprimanded in any way.

  5. Well... by Jethro · · Score: 4, Interesting

    I don't know what monetary cost they assigned to this, but this is the one I got in the most trouble for.

    Frankly, it was something I got blamed for. I guess I can take partial responsibility. You guys tell me.

    I was the only UNIX guy at this place. We were moving our Main Internal Server to a newer machine. I had set up a cron job to rsync all user data nightly, so that when we transition over the rsync would be faster.

    So, the big day comes. I come in on a weekend, do the final rsync, change some DNS entries, shut down old machine, bring new machine up. No problem.

    Next day everyone is working happily, everything is working smoothly, no worries.

    Or so I thought. Turns out the main developer wanted something off the old server, so he turned it back on to copy his files... and then left it up.

    So, during the night, the thing automatically rsyncs and overwrites an entire day's work for about 80 people.

    Definitely partially my fault for not disabling the cron job, but I was the only one who got in any kind of trouble at all for this (to the extent of almost losing my job, and frankly that was the catalyst for me leaving that place).

    --


    In the land of the blind, the one-eyed man is kinky.
    1. Re:Well... by drinkypoo · · Score: 5, Insightful

      Definitely partially my fault for not disabling the cron job,

      Or pulling the network cable. You have to plan for idiots, because there will be idiots. And odds are, they will outrank you.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    2. Re:Well... by Jethro · · Score: 2

      They weren't supposed to, but the head developers were like gods at that place. They had the root passowrds and I wasn't allowed to restrict them in any way.

      It stemmed from them being among the original 10 people when the company started, and even though the place was now a 200+ employee organisation, in some ways they still ran it like 10-person operation.

      I did vocally complain about this. They quite often went in and overrode stuff I did.

      --


      In the land of the blind, the one-eyed man is kinky.
    3. Re:Well... by Jethro · · Score: 4, Interesting

      You know the old saying, "make something idiot-proof and someone will come up with a better idiot."

      They'd have plugged it back in. Again, the guy physically went into the server room and pushed a button.

      I certainly should've disabled the cron job or, better yet (as pointed out by AC down there) have known what rsync actually was and used that - I know I said I did in the original post but in retrospect I couldn't have as it wouldn't have overwritten everything. This was about 20 years ago...

      --


      In the land of the blind, the one-eyed man is kinky.
    4. Re:Well... by R3d+M3rcury · · Score: 4, Funny

      Can't speak for a cost, but I thought this one was funny...

      A company I used to work for used Lotus Notes. For some reason, and I don't remember exactly what the reason was, I set up my e-mail to copy my mail to another account. I think it was just a "hey, I can do this" thing, playing with the e-mail system. Unfortunately, I made a typo in the name of the account to forward to.

      When I came in the next morning, the e-mail system was running really slowly. Everyone was complaining about it. I logged into my e-mail and, low-and-behold, there's all sorts of e-mails in my account complaining about how it couldn't send this message to the other account and, of course, the contents of the e-mail was a message that it couldn't send this message to the other account, and the contents of that message was a complaint that...you get the idea.

      I turned off the script and deleted all the e-mails. And, suddenly, from the office next door, I hear, "Hey! E-mail is working again!"

      Shhhhh...

  6. Patent filing missed. by Elf+M.+Sternberg · · Score: 2

    In 1993, I failed to file the US Patent on "A means of accessing a relational database via the Internet." If we'd known we could do it, CompuServe might still be around.

    1. Re: Patent filing missed. by Elf+M.+Sternberg · · Score: 3, Interesting

      No kidding. I'm glad we didn't. It means I can look at myself in the mirror. Career-wise, I've done okay without it. But it would have been a completely legal patent through which CI$ would have raked in millions and mililons of dollars. And, as far as I can determine, it would have been completely legal. There was no MySQL, no Postgres; OraPerl had *just* been released and was barely stable on SunOS, and there were no known instances of a CGI / OraPerl gateway on the Internet until Pacific Power & Light asked us if it was possible to connect their consumer-oriented energy savings database to that new thing called "the world wide web."

  7. $480 Phone bill by Anonymous Coward · · Score: 2, Funny

    When I was 12 years old and hanging out on BBSs in 1989, I didn't realize dialing Gilroy from San Jose was long distance (Both were 408 area code). My parents were not pleased at the nearly $500 phone bill.

  8. $10k. .... per day by Anonymous Coward · · Score: 2

    I made a calculation error that cost $10k per day. Took 9 months to straighten things out.

    I later won an award for outstanding work.

  9. Software bugs by nodan · · Score: 2

    Some bugs I've been responsible for, although it's hard to tell exactly what they did cost:
    - rounding error when programming a timer in an embedded system, resulting in a baud rate to be 10% off, causing problems with several units shipped to customers
    - overflow of an 8-bit counter, resulting in a serial protocol failing

    Plus tons of other errors I forgot or haven't been aware of. Total damage for sure thousands of Euros. However, that's probably little for a 25+ years career mostly in software development.

  10. A Photographic Slide by trabby · · Score: 2

    Lost a slide for 3rd party client that was to be featured in a skateboarding magazine.
    I think one of the coworkers stole it as I did not get along with them.

    Insurance claims for that kind of thing can involve the cost of setting up the shoot again, whatever that entails.
    Was fired not long after.

  11. About $2M -- But not really a mistake... by jnaujok · · Score: 4, Interesting

    Our group at FedEx released code that I wrote on a Saturday night. This was two days before the Apple iPhone 4 shipped. The code worked perfectly, however, despite our repeated warnings about nearly doubling downstream traffic, the downstream systems (like billing and tracking) weren't ready for it.

    So, on the day everyone wanted to track their new iPhone, my code shut down all tracking on FedEx for about 12 hours before we could switch the config setting (10 minutes) and the downstream systems could catch up (11+ hours).

    Estimate of cost was around $2 million in lost time and revenue and extra calls to customer service. Luckily, since I wasn't actually at fault, and we had multiple email chains backing up the volume estimates and warnings, we didn't get the axe.

    --
    Life, the Universe, and Everything... in my image.
    1. Re:About $2M -- But not really a mistake... by Tablizer · · Score: 3, Informative

      The poster was not the boss. The boss calls the final shots. The technician's job is to present the risks (trade-offs) as accurately and clearly as possible. If the boss(es) then choose to ignore the risk warnings, the blame falls on them. If you usurp their power, you are out the door (unless it's a legal matter).

      Incidentally, I was in a somewhat similar situation where marketing planned to release about 30 websites for satellite offices all at once along with a press release about the new sites. I pointed out our "budget-oriented" infrastructure may not be able to handle such a sudden load, and suggested staggering the releases. Other technicians agreed with my warning, but the marketing chief was really disappointed, saying something like, "It's better P/R to have one big release. Staggering the releases takes the punch out of it."

      I was tempted to respond, "30 crashed sites is not good P/R either", but smartly bit my tongue (based on prior experience with "reality" statements). He was a true P-H-B, always looking for a cheap short-sighted shortcut, but tried to blame us when his paper tigers got eaten. He drove one guy to retire early. Later he was under investigation for giving contracts to his buddies instead of basing them on merit. Not surprising, his buddies were also idiots.

  12. Fried an early... by michael_cain · · Score: 2

    digital signal processing chip from TI. The $750 (in 1986 dollars) wasn't the big deal. That the parts had serial numbers hand-lettered on them and I had to go back on the waiting list to get a replacement was.

  13. Took an online trading company offline for a day by Nonesuch · · Score: 4, Interesting

    I was hired as a firewall admin at an online trading company, then quickly discovered the director of IT was insane, but kept management happy because he made his numbers by keeping his team constantly understaffed; I was told to work on not just servers, but installing Sun servers in racks, running cable, and fixing just about anything plugged into the network.

    I made the mistake of showing competence in networking, so was asked to "expand my role" (new title, same salary), and start working on the switches themselves, including executing an "upgrade" to stacked HP ProCurve switches with VLANs (replacing a hodge-podge of random manufacturer switches). The actual upgrade went fine, basic testing (ping) showed everything stable, but as soon as trading opened the next day, everything went to hell, performance dropped through the floor and customers started calling in about trades timing out. Long story short, turned out that Solaris HME cards were unable to negotiate properly with ProCurve switches, half the machines were dropping packets due to duplex mismatches. There's a reason people call the Sun interface cards "Happy Meal Ethernet"

    Cost the company approximately $180,000 in direct and customer exodus losses, and was likely a factor in their eventual collapse. I wasn't fired, but management never trusted me again so I saw the writing on the wall, and quit to do consulting work at a (also doomed) dot-com online supermarket.

    On the upside, I was able to make thousands in consulting income from installing those same "lock speed to 100 and duplex to full" Solaris scripts on servers for various customers who also had performance issues plugging in Sun servers to cheap switches.

  14. I killed three networks, but that was planned. by swschrad · · Score: 2

    obsolescence, I got the task to shut 'em down. I also forced a worldwide recall of PC card disk drives in the switches that were the backbone of the Internet when we kept the vendor engineering on the phone all day for a failed switch... and read the duty cycle of the drives to them, like 5 minutes a shot, 10 minutes an hour, when they were running read/write continuously.

    but I got a haircut indeed when we had to get out stuff out of a colocate that was shutting down. built a mirror data system for that in the new place, had the trunks up, costed over the traffic. then it was time to demanage and power down the old shelf. telcordia assigned a code to the new unit that was one letter different than the old one.

    the good news is I got the new one back up in 20 minutes and they didn't stake me out over an anthill.

    --
    if this is supposed to be a new economy, how come they still want my old fashioned money?
  15. My $5 million bug by llib_xoc · · Score: 2

    We were writing a Unix program to parse transactions from some specialized terminals that read customer invoices and the checks that accompanied them, writing the transactions to digital tape to carry over to the mainframe system. During testing our tapes were compared to tapes generated by the legacy IBM system. Our team lead got a call from the customer liaison *early* on morning saying "Do you realize one of your batches was 5 MILLION DOLLARS SHORT - yes, she was shouting. Turns out that the $5 million transaction was the largest we'd ever tested with so far. All others were less than $999,999. It was my bug - I'd put the sign nybl (half a byte) on top of the most-significant digit of the packed-decimal payment-amount field on the test tape, dropping that digit from the field. Trivial fix - I had just been auditing the relevant code the previous day.

  16. I wonder... by waspleg · · Score: 4, Insightful

    How many people will refrain from posting because the statute of limitations hasn't run out yet?

    1. Re:I wonder... by dcollins117 · · Score: 4, Interesting

      How many people will refrain from posting because the statute of limitations hasn't run out yet?

      Well, I'm certainly not going to admit to the most costly mistake as it appears no one realizes it was me and what I had done. So I'm not gonna do it; wouldn't be prudent.

      The most embarrassing mistake was I inadvertently brought down the clients' network (a major hospital) during the middle of the day. Didn't realize what I had done until about three minutes later when about a dozen IT guys flooded the computer room paying particular attention to the area I was just working in. It appears I made an error. To this day I am likely persona non grata in that computer room.

  17. Click of death by Wowsers · · Score: 4, Interesting

    My worst IT disaster was suffering from a hard drive failure, click of death. I had warning of a few days of it, and I deliberately kept the pc on 24/7 instead of normal switch on/off, to make sure the drive stayed alive until its replacement arrived.

    Obviously I had to turn the pc off to change the drive, it was not hot-swapable. When I powerd the pc up, the old hard drive failed, didn't work at all. I was faced with losing all the data on it. I left the drive alone for months wondering what to do, reading different ideas online, some of them weird.

    Eventually I decided to try the least distructive idea first. I put a sheet of paper on the failed drive to make sure the label doesn't come off, and heated up the clothes iron, then applied the iron directly onto the top of the hard drive. When the drive casing was wam enough (not so hot as to make it hard to carry), I took it to my pc, and powered up.

    The failed hard drive came to life, and I managed to grab all the files on it onto the new hard drive, uncorrupted.

    Out of interest, the failed drive failed about three months before I do forced drive change as a backup / failure prevention. I got lucky.

    --
    Take Nobody's Word For It.
    1. Re:Click of death by BlackPignouf · · Score: 2

      Wait, what?

    2. Re:Click of death by Anonymous Coward · · Score: 2, Informative

      Heating it up causes the metal to expand which can unjam a stuck head in some circumstances.

  18. Not sure how much $$$ by minimum · · Score: 2

    I used to work as a SDH/DWDM admin. In early 2000's, while my colleague screwed up a major firmware update on a STM1/4 ADM and I as senior (haha - I was in my 1st half of 20ies) admin had to drive up to site (since the affected node was unresponsive to management system). After many unsuccessful attempts to recover it, at about 3 am. I decided to hard reboot the node, which caused it to boot up from corrupt firmware bank (it had two of those); which in turn just erased all the configuration, including traffic connections (which is built very robust btw). Since the site was on a (relatively small) island and had only 2 ADM's at the time, I more or less cut off the entire communication with mainland. For morning, I had managed to get my colleagues to ferry me another, fully fitted ADM (our last resort backup scenario was to replace entire node) - but as it turned out, it was in a hurry fitted with cards with different firmware (entire network was in middle of upgrade process) which resulted in same kind of useless "brick" I had already at hand. Although it was very cool to fly ~200km/h to port and back in my sporty car, to pick up the spare (not many police on the island and I had a very good excuse). By the afternoon, my higher-up manager had mobilized a helicopter to personally deliver me fully functional ADM, which we promptly replaced and restored configuration from backup. I still have copy of the local newspapers front page, praising how our company heroically saved the day to restore connection with outer world.
    At that time I was already able to make up excuses that would have made BOFH proud, which saved my ass.

  19. Other way round for me by Anonymous Coward · · Score: 2, Interesting

    I let a vendor sell me a product without really testing it. Turns out it didn't work (at all) and we lost €50k on license fees for a product we could not use.

    I was able to lay the blame on an accountant who had locked us into a 5-year contract in exchange for a minor discount. So I didn't get fired.

  20. F-16 panel flew off in flight by YrWrstNtmr · · Score: 4, Interesting

    Some other fool did not install the panel properly, and left one of the three nuts off. Distinctive nuts, used in only one place.
    Someone found it overnight, and held it up at the morning meeting. "Anyone know where this goes?" Unfortunately, I did not recognize it as a part one of my systems.

    Aircraft flew, panel breaks off, punching several other holes in the side as it departs.
    Training mission aborted. much sheet metal work needed.

    Actual repair cost? Unknown, but easily 5 figures if not more.

  21. Power cable mistake by Anonymous Coward · · Score: 2, Interesting

    Working for a desktop publishing house in it. Spent just under $4000 on 36 inch flat panel displays. Accidentally plugged in printer power cable. Immediately fried monitor. My boss was not happy. The internship did not go well the rest of. The summer.

  22. ~$60k by fox1324 · · Score: 2

    I was working as a Jr. Network admin, helping to install some new cisco PoE switches to facilitate our building's move to VoIP phones. I aligned a brand new 48-port poe switch slightly off when inserting it into the chassis, and bent the insanely-complex connector at the back of the card, rendering it unusable. Fortunately, we had a ridiculous service agreement with cisco, and a new card arrived at our office within 4 hours. I distinctly remember buying burritos and beer for me and the Sr. admin to help make up for the fact that neither of us got to sleep that night.

  23. Just my time by corychristison · · Score: 2

    Six or so years ago I was using a (fairly cheap) Virtual Private Server as a dev/testing box for a pet project of mine.

    The VPS company was bought by a larger company, and prices were to double on the next billing period. I hastily chose a new provider without doing any research. I paid for 3 months of service in advance, got the container set up the way I like, migrated all of my data over, and was up and running.

    2 months in the new provider vanished, along with all of my data. I wasn't very concerned about the months worth of money I had lost by not getting the 3 months I had paid for, I think it was only about $15. "Okay," I thought. I'll just pull my data out of my nightly backups and move on. It turns out I forgot to adjust my local cron script that pulled the data over rsync to the new IP address. My backups had not been pulled in over 2 months.

    Luckily it wasn't very important, as it didn't make me any month and was mostly just for fun. I ended up starting over from scratch and ended up with a better system anyway.

    I learned my lesson, though.

  24. The Final Nail by Dartz-IRL · · Score: 4, Interesting

    The total cost was actually weet FA in numbers terms, but I think I put the final nail in the company's coffin.

    My first 'job' was a jobbridge internship with a 'small' company. Small enough that I was literally person number three on the employee roster. The company worked in the renewable energy sector, and had been hammered pretty hard over the last few years by The Recession as domestic and corporate purse strings were pulled tighter and tighter.

    I was taken as an Engineer, but rapidly found myself wearing a wide range of hats from Sales, to Customer Support, to System Design, to Project Management, web development in PHP, and finally, IT Support.

    Because, one day, I managed to figure out why one of my colleagues couldn't log in to the server upstairs, and corrected the problem.

    I will say, the Server was the problem.

    It was a dinosaur. It was 14 years old - twice as old as the company - and had been bought second hand. It was a monstrous beige tower with a pentium II processor and God Knows What else inside. It ran Windows Server 2000, and was solely dedicated to serving the company accounts and acting as a networked file storage. Inside the case where four HDD's.... A pair of 9GB ones for the OS and programs, and a pair of 32GB ones for files. Both pairs were mirrored in RAID 1. It had a pair of lockable Zip disk drives still fitted though the keys long lost, along with a floppy drive and a CD Drive with no write ability. Or ability to read DVDs.

    It creaked as it worked, then fumed, whuffed, whirred and occasionally burped. And it sat there, creaking away for years without thought or consideration to its well being or security. Until I came along.

    By this stage, it was obvious the company was dying - the Titanic had hit the iceberg a long time ago, and everything that was happening was just a desperate attempt to bail it out. We might've slowed the sinking - from two months, out to six, even buying a full year - but the abyss of liquidation always loomed.

    So, any suggestion of upgrading the server hardware was met by 'With What Money?'. At the same time, everybody knew the server was the lynchpin. If it broke, that was it - company gone. A suggestion that I use a spare computer from home was quietly discouraged - in case the company went under by surprise and someone decided to liquidate it to pay a creditor rather than give it back to me. Or we turned up to find the doors locked.

    The best I could do was schedule a backup of the accounts and a few other critical systems, and have it go somewhere offsite. I asked our webhost if we could use our spare space for it, and they were happy to let it happen, provided we didn't cause them problems. So, I set it to run the backup every Sunday morning - 1am or so. Each successive backup would overwrite the previous because there just wasn't the spare space to hold two (No money to pay for it)

    I figured even if the server went pop, or we had a building fire or some other catastrophe, at least those copies would survive. I'd figure out what to run them on afterwards.

    Someone, somewhere, should see the potential problem in this. In my defence, I am not, nor ever was, an IT professional. The software education I have is more related to the engineering side of things - making machines and robotics work with a view towards industrial automation, rather than the maintenance and setup of IT infrastructure and data security.

    I just did what I thought I could to keep the Titanic afloat.

    So, one Monday morning, I come to the office and am met by shrill sound of metal screaming against metal and a high speed. There's a heart-in-mouth moment as I realise that it's coming from the server cabinet.

    But, we have backups, I assured myself. The disks are mirrored in RAID 1, so if one drops out, the other should still be clean and working. If that fails, I've my own little backup too....

    Unfortunately - that only works if the damaged disk decides to drop out of the array.

    It didn't.

    I find th

    --
    So there I was, scribbling down some notes off the PC screen by hand, when I reached for the keyboard and Ctrl-S'd.
    1. Re:The Final Nail by drinkypoo · · Score: 2

      There's a clawing feeling that it was somehow 'My Fault'.... and it probably was. With hindsight, maybe I should've set it to run the backup while we were in the building, rather than at home over the weekend. I could've used an external drive to keep one locally too. There were probably a dozen things that I could've done that'd stop it.

      Only one thing which really mattered... verifying your backups. If you don't do that, there's almost no point in making any. (It gives you something to pray for...)

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    2. Re:The Final Nail by Tablizer · · Score: 3, Informative

      Databases should be backed up with a text-dump (such as an SQL INSERT list), not the actual database file, because of the internal pointers that are fragile. A text-dump "flattens" the pointers. If you do use the actual database file as a backup, shut all DB writing off first, during the backup. And keep multiple generations.

    3. Re:The Final Nail by Dartz-IRL · · Score: 2

      I honestly had no idea how it actually backed up, it was a function within the accounts application itself to generate the backup. Which it did, to a local disk. I then had an automatic scheduled upload of that backup to the server.

      Ultimately, like I said, I'm not really an IT guy - I was the one with google and enough patience to fuck about until things worked again. We didn't have one. We did pay one company a hundred quid a month for a while in case something went TU, but we stopped paying him six months before the final death just to make the dead plane glide those few hundred yards further.

      The most IT thing I've done is run a simple website off my own desktop at home, and maybe the whole make a datalogger work with remote internet access.

      --
      So there I was, scribbling down some notes off the PC screen by hand, when I reached for the keyboard and Ctrl-S'd.
  25. Not my mistake, but my boss' by whoever57 · · Score: 2

    Not selling the company for $250M because he wanted $300M during the dot-com boom. My boss personally owned about 30% of the company at this point.

    --
    The real "Libtards" are the Libertarians!
  26. Re:Intel CPU sockets are terrible. by Bengie · · Score: 3, Informative

    Pretty much all modern Intel CPUs from the past many years.

  27. Re:Multiple multi-million dollar satellites. by Greyfox · · Score: 5, Funny
    Funnily enough at the satellite company I worked for that one time, one of the older guys there mentioned how he almost lost a satellite once by logging in to his own account and issuing a maneuver command to the satellite. Problem was the satellite was expecting times in GMT and got them in MST. Took them days to get it oriented correctly again.

    Now the programmers in the audience could probably think of like 10 different specific things that could be coded into the system to prevent that from happening, but this company didn't. Which really isn't too surprising. I asked one of the devs on the ground systems team if the ground systems was using GMT or UTC. His answer was "What's the difference?" I was able to infer from his answer that it was most likely GMT, and that did appear to be the case. Somewhere deep in the bowels of the system there was presumably some piece of code written by an Indian contractor with a math degree adjusting times for leap seconds, but it wasn't in any code that anyone knew about.

    The early history of that company read like a Monty Python sketch. The first satellite exploded on the launch pad. The second satellite fell over and then exploded. The third satellite burned down, fell over, exploded and then sank into the swamp. The forth satellite got into orbit and was promptly bricked by sending the wrong version of Windows(!) to it. To be fair they only had to do that because they launched it with the wrong version of Windows(!!) in the first place. One would think that ANY version of Windows would be the wrong version of Windows to shoot into space, but that's why you're not the head of a billion dollar satellite company.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  28. On the plus side, it discovered life... by Minupla · · Score: 2

    ... too bad it was here :)

    --
    On the whole, I find that I prefer Slashdot posts to twitter ones because I don't get limited to 140 chars before
  29. Posted this before by Oligonicella · · Score: 2

    But it's worth repeating in this context. Thankfully, it wasn't me.

    When I worked at a KC bank, we had a Wire Transfer team manager who loved golf. He was supposed to come in Saturday and test a firmware/OS upgrade, then restore. Nice, sunny day Saturday, so he decided golfing would be better.

    Came in Sunday. Installed firmware/OS upgrade. Tested fine. Forgot to reinstall previous firmware and powered up old OS.
    Incompatible. Froze the machine solid. He panicked and tried for maybe four hours to fix things himself. No go. Finally called Cupertino for help 4+ PM.

    The techs had to be found, gathered and flown out from CA to disassemble said machine and reassemble. No wires until 1 or 2 PM Monday. Much money loss for all customers.

    To answer the obvious question, no - beyond my understanding, he wasn't fired or even demoted.

  30. $8m UPS modification. by thegarbz · · Score: 2

    One of my first engineering jobs out of uni involved modifying a UPS. This UPS had a massive battery bank that was quite dangerous to load test and didn't have an automatic load testing function. I came up with a small design involving a contractor and some minor wiring changes and we were part way through implementing it on every UPS at this site.

    This UPS was part of a redundant pair that fed an emergency shutdown system at an oil refinery. In between the UPSs and the ESD system were about 120 circuit breakers, two for each circuit, and one of them was off. We modified the first UPS without issue then started the process for the second one. After calling the control room to let them know they will receive an alarm I switched off the UPS and was suddenly meet with a steam of profanities over the radio.

    We lost power to 80 field instruments which triggered a fail safe action on the shutdown system tripping 4 units at the refinery, one of them was the FCCU which is core to a lot of refinery processes. To add insult to injury the unit was unable to be hot restarted because of a stuck valve and then thermally contracted breaking of large chunks of coke from the overhead line which blocked the internal cyclones. The FCCU was down for repair for roughly 10 days, I had made a name for my self and was asked to display the cock-up award (a giant dildo mounted on a plaque) on my desk.

    Total cost of the outage was about $8million. Fortunately only partially my fault.

  31. Re: Multiple multi-million dollar satellites. by GrantRobertson · · Score: 2

    Wow. Just, wow.

  32. Re:Took an online trading company offline for a da by AmiMoJo · · Score: 2

    I knew a guy who did support for a multi million pound company. They had many problems, mostly due to the fact that he was too scared to reboot their servers because he did all the support remotely and it would be a 100 mile trip up to their office if the machine didn't come back up. They insisted that he do maintenance in the evenings or at weekends to avoid disrupting their work.

    So their terminal server was still running IE 7, because he was too afraid to update to IE 9 as it required a reboot. Someone actually got fired because they infected the server with a drive-by. Their mail server had a dodgy network card, but it took nearly a year to diagnose because he was terrified of updating the driver in case it didn't come back up, so that was just intermittently not responding or dropping incoming connections for over a year. The driver update fixed it in the end.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  33. Re: Multiple multi-million dollar satellites. by bitingduck · · Score: 2

    I talked to someone recently who lost a day of science data from a UAV because the Windows system driving the instrument decided to auto update while in the air with something like a 56kbps data rate.

    I recently built a field instrument and made it Linux based specifically to prevent things like that, as well as to keep power and latency down by being able to kill unnecessary background tasks.

  34. interesting synchronicity by epine · · Score: 2

    Just fifteen minutes ago I realized that my script to refactor the primary file server (newly converted to ZFS) into more sensible datasets had an irritating detail wrong (a path element was being duplicated in some paths).

    I said to myself "oh, I'll just roll that whole thing back to the snapshot I made 30 minutes ago".

    Then I go "zfs list -t snapshot" and discover that my snapshot was holding onto 0 GB because I forgot the -r switch to make the snapshot recursive.

    Oh, well. By some impossible-to-separate mixture of good management and good fortune, it turns out I had a set of (different) snapshots from the last two days covering all datasets in questions. I lost very little work (only scripts were executed against these datasets and I still have all the scripts).

    My real screw up?

    Back in my second co-op workterm job, I managed not to notice that a system I was backing up changed the order of the listed drives between two very similar screen requests that I made almost immediately one after the other. Unfortunately, on the second pass I selected the active system drive as the recipient of the system backup, picking from the position in the menu where the desired destination drive had appeared moments before.

    I had become accustomed to my home system being deterministic in the order it listed things. My bad.

    This is back at the very beginnings of the 4.77 MHz era, so my PC was actually not yet what we now know as a "PC" (its father had an S-100, and its mother had a itty-bitty CRT).

    Thirty years later I still can't type dd of=/dev/ada3 without making three trips to the metaphorical bathroom.

    Whenever I type a disk-level dd command, I leave the sudo off, until after the third proof-read and several console consultations in which at least two different programs give me the same view of the drive name.

    In dollar costs I couldn't say. In psychic cost, it's indelibly etched onto my permanent record.

    I had a co-worker once (EEng) who claimed that as a junior intern during the late 1990s back when laser gear for fiber optics was all the rage, he routinely fried extremely delicate $2000 DUTs while the old hands just shrugged their shoulders. Dotcom dollars. Who really gave a fuck? It was considered barely worse than ruining a nice chair.

  35. I nearly cost my company millions by PhilHibbs · · Score: 2

    I nearly cost my employer several million by fixing a bug.

    The first task I was given in my new job was to look at an old system that printed labels to be put on containers of car parts. A message would come in on a serial cable saying what part was going to be needed within a few hours at a car assembly line, the parts were packed into stillages (a frame designed to hold a certain number of a certain part, like bonnets, bumpers, doors panels, etc.) and when a stillage was full, or when a certain amount of time had passed since the first part was picked, then a label was printed, applied to the stillage, and it was dispatched over the road to the factory.

    Every time the serial number rolled over 9999 to 0001, the system would go wrong and stop working. This happened about once a month, and the help desk had a sheet of instructions on how to fix the problem. Some of the staff knew the fix off by heart.

    I looked at the code, found a roll-over bug, and fixed it. Everything was fine, and a couple of years went by with no problems.

    Then, at 3 in the morning, the help desk called me and said that it had happened again. They didn't have the sheet of paper any more, and no-one could remember how to fix it. I rubbed the sleep from my eyes, and tried to get my brain into gear and remember what to do. It took me about an hour talking with a couple of help desk people, and between us we figured out what the fix was, and they called the warehouse and talked them through it.

    The next day I talked with my colleagues, and found out that we had come within a few minutes of triggering a penalty clause for halting the production line that could have run into millions of pounds. This was back in the '90s when millions of pounds were a lot of money!

    I looked back over the code, and found that there were actually two very similar bugs in the code, one of which happened fairly regularly, and one which only happend much more infrequently, but the same fix worked for both of them.

    Back when I first started working in IT, my boss told me, "One day, you will probably make your million pound mistake. In our business, we build systems that, over the course of our careers, will save millions of pounds in lots of small ways. Eventually you will make a mistake, and one of those systems will go wrong, and it might cost millions. Your employer will bear the cost of it, which is why we don't earn those millions ourselves. You have to be prepared for that eventuality. If it happens while you're working for me then I will kick your arse, and maybe I will fire you, but I'd be wrong to do so, that's just the nature of the business that we are in."

  36. Everyone makes $1,000,000 mistakes by NothingWasAvailable · · Score: 2

    During a panel discussion with very senior technical leads, the question came up: "How many of you have made a $1,000,000 mistake?"

    Every single one raised their hand. This was a very large semi-conductor company, and everyone had been involved in at least one instance where bad masks were made because a check was skipped or step was botched in the design flow.

    I worked on a chip design where it took six design revs to get clean masks. All five of the prior revs had avoidable (human) errors during the design and build process.

    Pay me now (in time running checks) or pay me later (in nre: non-recoverable expense) for bad hardware.

  37. Powerful mistake by gtarthur · · Score: 2

    Back in the 70's when I was still a junior electrical design engineer working for a distribution transformer company, we used algorithms loaded into TI calculators to compute the electrical, heat, and mechanical stresses. I later got the task of modernizing those codes and merging them with a FORTRAN code that another engineer had written and abandoned because it was too expensive to run. Things went well at first, we saved a lot of time and used that as any good engineer would to optimize our designs using different parameters to reduce cost and improve efficiency, both very important to my company and its customers. Then one day we got a limiting case which we didn't recognize at the time. As usual, one of our engineering assistants used the computer generated design and the old methods to validate the design. The engineer always takes responsibility for the design. After the build, the unit, a 3 phase unit that had 76,000 volt inputs, was tested in our "hi pot" chamber - a voltage pulse of the rated voltage but with reduced current and only for a short pulse. The center core winding turned into shards of copper spaghetti in the 8 foot tall tank. It cost $25,000 to repair, and delayed delivery for 3 weeks. My heart rate hit about 200 when the engineering manager called me and my supervisor into his office. Then he explained that he had run the calculations also, and discovered that our methods had a flaw in the prediction of the axial forces on the center coil. It was a very subtle mistake, and he said it could have been much worse. We were able to revise the code within a few hours, and that incident led to further improvements in methods and automation. It also taught me my most important lesson about computers - human error is the greatest risk. Real tests of your code sometimes do "blow up".

    --
    Every change is not progress, but there is no progress without change.
  38. underestimates... by Creepy · · Score: 2

    Underestimating time needed happens all the time in the software industry. It probably is worse in the gaming industry where publishing deadlines often get set 6 months or more in advance, but I still get hit with guaranteed release dates for customer commitments at my job now where I've put in ~100 hour weeks to fulfill (telecommuting many of these probably saved my marriage, as I would work 4 hours after my wife went to bed). Still, it is nothing like the 160 hour weeks in the office for a game release crunch (and no, that isn't all work - I slept on beanbag chairs in the testing room and they catered in meals, but at some point you're just so burned out and stinking of feet that you need a night sleeping at home and a long shower).

    I can't think of any instance where I've cost a project, but I'm sure they exist. OTOH, I did have a workaround for a $5 million dollar contract where the customer was going to reject our Linux port due to a bug I found and reported. The developer and pubs person assigned the defect were laid off after 9/11 so the defect slipped through to the customer. Fortunately, I overheard a sales person talking about it and supplied the workaround, saving the contract.