Slashdot Mirror


GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

356 comments

  1. Yawn... by Anonymous Coward · · Score: 5, Insightful

    No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.

    This has been going on since the dawn of computing and it seems there's no end in sight.

     

    1. Re: Yawn... by Nutria · · Score: 5, Funny

      paki chimps in jungle

      Someone failed geography class...

      --
      "I don't know, therefore Aliens" Wafflebox1
    2. Re: Yawn... by Anonymous Coward · · Score: 3, Insightful

      No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.

      He's just a racist.

    3. Re:Yawn... by Anonymous Coward · · Score: 0

      Fucking amateurs they are. I keep telling this same stuff at work year after yar, but nobody seems to listen. We've even had an Active Directory setup wiped out at one of the places I used to work (not kidding: a whole friggin AD setup...).

    4. Re: Yawn... by Applehu+Akbar · · Score: 0

      "Genocide everything that shits."

      So, all humans then. You must be a Green.

    5. Re: Yawn... by Nutria · · Score: 0

      And baby deer, and Spotted Owls, and kittens.

      --
      "I don't know, therefore Aliens" Wafflebox1
    6. Re:Yawn... by Anonymous Coward · · Score: 2
    7. Re: Yawn... by volodymyrbiryuk · · Score: 0

      What about birds? They shit too... from above. They deserve to get killed twice.

      --
      sudo rm -r -f --no-preserve-root /
    8. Re: Yawn... by Anonymous Coward · · Score: 0

      paki chimps in jungle

      Someone failed geography class...

      Global warming must have transformed the Netherlands into a tropical jungle and the media did not notice. LOL

    9. Re: Yawn... by Anonymous Coward · · Score: 0

      If he weren't stupid, he might have noticed that GitLab != GitHub.

    10. Re:Yawn... by Big+Hairy+Ian · · Score: 2, Insightful

      Clearly their DR Plan didn't get any form of QA. It's no good having five forms of backup/replication if non of them work!

      --

      Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.

    11. Re:Yawn... by zifn4b · · Score: 1

      No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.

      This has been going on since the dawn of computing and it seems there's no end in sight.

      You'd think so but the level of incompetence these days rivals the incompetence of 20 years ago. I just heard yesterday that a global multi-national company that's been around for years lost a file because "another file from a different source came in too soon and overwrote it". At that point, I did a complete facepalm because I was astounded that we still have software around running critical business operations sometimes even global operations.

      --
      We'll make great pets
    12. Re: Yawn... by Anonymous Coward · · Score: 0

      Humans are a species, not a race.

    13. Re:Yawn... by Anonymous Coward · · Score: 1

      It's one of those called "PowerPoint DR Plan".
      Everyone has one nowadays but unfortunately it isn't as effective as the real deal.

    14. Re: Yawn... by Anonymous Coward · · Score: 0

      Who denied humans were a species. Within the human species, there are different RACES, just as there are different races of dogs - only nowadays they call them 'breeds', so that we don't make the obvious connection between RACE, temperament and intelligence...

      Why aren't Africans coding this?

    15. Re: Yawn... by moronoxyd · · Score: 1

      What does GitHub have to do with Gitlab.com?

    16. Re: Yawn... by gman003 · · Score: 0

      The jury's still out on bears, though.

    17. Re: Yawn... by nitehawk214 · · Score: 0

      I guess only paki bears shit in the jungle.

      --
      I'm a good cook. I'm a fantastic eater. - Steven Brust
    18. Re: Yawn... by nitehawk214 · · Score: 1

      Sounds like they used the "mirror = backup" solution. Best way to destroy everything.

      --
      I'm a good cook. I'm a fantastic eater. - Steven Brust
    19. Re:Yawn... by Anonymous Coward · · Score: 2, Insightful

      And will continue to happen because experienced admins are expensive. Cheaper to hire the new grad who knows the buzzwords than the experienced admin who has lived through a couple of catastrophes and now knows how to plan for them.

    20. Re: Yawn... by Anonymous Coward · · Score: 0

      I'm worrying about what you've got against birds.

    21. Re: Yawn... by Anonymous Coward · · Score: 0

      "You must be a Republican"

      FTFY

    22. Re: Yawn... by Anonymous Coward · · Score: 0

      They got a bigger dick than the OP so he's jealous :)

    23. Re: Yawn... by Anonymous Coward · · Score: 0

      You get what you pay for. Fuck em.

    24. Re:Yawn... by shaitand · · Score: 1

      It is a tough scenario, executing DR plans for real involves disruption to the involved systems so you either definitely impact your operations from time to time of you take the risk there might be some kind of disruption IF you ever need to fall back on DR.

      That is why most organizations actually have reliable backup systems that periodically verify data integrity, alert when agents aren't communicating, etc.

    25. Re:Yawn... by Penguinisto · · Score: 1

      But... but... The Cloud! The Cloud is our DR solution!

      (*chuckle*)

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
    26. Re:Yawn... by Anonymous Coward · · Score: 0

      This, plus to a lot of managers, backups have no ROI. I interviewed at a company where the CTO actually told me, "for a cloud-based company like us, asking us about "backups" or "uptime" is like asking a Tesla driver what type of buggy whip they use.

    27. Re: Yawn... by Anonymous Coward · · Score: 0

      And you must be a virtue signaling libtard, tell us of your heroics feats oh great SJW!

    28. Re: Yawn... by Anonymous Coward · · Score: 0

      Testing your fail over plan only impacts things if the plan doesn't work - but since you planned the failure event, you should be able to resume normal operations much more easily.

    29. Re:Yawn... by Anonymous Coward · · Score: 0

      It bothers me more that they blame this on an overworked sysadmin. Pretty much guarantees more problems in the future.

    30. Re: Yawn... by Qzukk · · Score: 3, Interesting

      There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).

      Then, there's "oh my god the datacenter is on fire!". This is what people usually call "Disaster Recovery". One dead server isn't a disaster when you have failovers, but when your entire datacenter is dead, THAT's a disaster. It's tough as nails to automate too, since without having at least three datacenters, it's inherently a split-brain issue. If Datacenter A stops responding to Datacenter B, which one is actually down? If you aren't an AS and can't just republish your IPs at Datacenter B with a BGP routing change, that means you're going to have to publish new DNS records and wait one TTL for everyone to see them. If you had an authoritative DNS server at Datacenter A, then hopefully it was able to recognize that its down and shot itself (or at least updated its zone files with B's IPs) or you can somehow get to it and kill it, otherwise when Datacenter A comes back online, it'll be serving up A's IPs again and conflict with the other DNS server. This also is setting aside replicating your data between datacenters and how much of that is lost when you switch back and forth.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    31. Re: Yawn... by Anonymous Coward · · Score: 0

      Lol. Yes all of this!

    32. Re: Yawn... by Anonymous Coward · · Score: 0

      More like 90% Indians. I don't encounter too many annoying Pakistanis online, with the exception of some who are all following the same "how to make many dollar with SEO spam" YouTube videos.

    33. Re: Yawn... by Anonymous Coward · · Score: 0

      I stood up to a bigot today. What have *YOU* accomplished?

    34. Re:Yawn... by LeftCoastThinker · · Score: 1

      It is the classic human factor. The sysadmin probably knew all the right steps to take, but got lazy thinking he didn't need all the extra work and then, working late he made the mistake that all those steps would have protected against.

      No one is immune to complacency.

      --
      If you disagree, please post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like
    35. Re: Yawn... by Anonymous Coward · · Score: 0

      Backups are the last line of defense yet are also the last thing funded

      Sucks for the admin

    36. Re:Yawn... by The-Ixian · · Score: 1

      Luckily this is a lab environment and not production...

      --
      My eyes reflect the stars and a smile lights up my face.
    37. Re:Yawn... by AK+Marc · · Score: 1

      Experts still call it a "backup plan". A "backup" is a copy. Nobody wants a copy. People want a usable service. There should never be a "backup plan". There should only be a "restore plan." So people stop saying "the backup completed without error, it should work". If you didn't restore it, you didn't test the restore.

      I've seen too many places with backups that completed without error that didn't have usable backups. And never tested a restore.

    38. Re: Yawn... by Anonymous Coward · · Score: 0

      Huzzah!

    39. Re:Yawn... by riffraff · · Score: 1

      At one of my jobs we were required to test our backups at least once a quarter. 8T database backups that took hours to restore, then hours to import, so several days of work, so that we made sure the backups were tested and accurate. It was a lot of work, but worth the trouble to make sure something like this doesn't happen. And with deduping and versioning, it should be easy to go back to an older version, even only a few minutes old.

    40. Re: Yawn... by Tablizer · · Score: 1

      No! It's right there in pop-up #73 of "Sarah Palin's Illustrated Geography".

    41. Re: Yawn... by chris_osulliva · · Score: 1

      if you have 2 data centers you should be load-balancing already!

    42. Re:Yawn... by K.+S.+Kyosuke · · Score: 1

      I thought Git was supposed to be its own backup, on the basis of all copies being complete (and basically a persistent data structure, if I'm not mistaken)? Shouldn't it be possible for developers to just sync the server with their local copies?

      --
      Ezekiel 23:20
    43. Re: Yawn... by Anonymous Coward · · Score: 0

      There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).

      But that's not addressing data loss swept away from the DB on a file level rm -rf. No failover is going to compensate for an rm -rf on that master copy of the db, you will still see up to a couple of seconds of data loss which is obviously too much. Accidental or non accidental loss of data. That's what backups are for.... OOPS!

    44. Re: Yawn... by volodymyrbiryuk · · Score: 1

      It was actually a sarcastical reply to "The most worthless race is the human race. Genocide everything that shits." But the sarcasm was never strong with the anonymous coward crowd.

      --
      sudo rm -r -f --no-preserve-root /
    45. Re: Yawn... by Anonymous Coward · · Score: 0

      Or, maybe they have outsourced already...

    46. Re:Yawn... by minstrelmike · · Score: 1

      We have a database we can actually export and import in less than 2 hours. So every night, we export production, drop our test database, and import the latest and greatest data. And that version is what we use for testing and what our power users use to modify data experimentally to see if they like the changes.

      We know fairly quickly if we have a bad "backup." I've been burned many times by files that won't open, won't load, won't whatever.

    47. Re: Yawn... by peawormsworth · · Score: 1

      ...they used the "mirror = backup" solution.

      I used that solution and it worked ideally. I ran a laptop of an SD card and set RAID to mirror to a 2nd SD card. To make a backup, I just pull one card out and stuff a fresh card in. That's it. The new disk would automatically sync up and be ready for the next backup when I choose to pull that one out. To restore, I just shut down the computer, put in the last backup and boot up. My computer system was instantly restored back to the last time I pulled out that disk.

      I didn't continue with this test set up because laptops do not have 2 SD cards and having a dongle hanging out of a laptop all the time is a recipe for loose and broken USB ports.

      The sysadmins told me that RAID should never be used as a backup. But I found this set up dead simple and the backup restore process was instantaneous.

      Perhaps RAID in mirror mode is not an ideal backup solution. But it is very close to the way backups should work.

    48. Re: Yawn... by nitehawk214 · · Score: 1

      Please repeat after me, "RAID is not backup." Not just, "not an idea backup..." it is not a backup at all.

      If you accidentally do a "rm -rf /" or ransomware encrypts your entire drive, RAID won't help you a bit. If you don't test your backups, you are equally likely to fall into the same pit.

      That being said I do have mirrored disks on my main computers. A spinning disk is more likely to die than I am to get malware or screw up on the command line, and it is far easier to swap a disk than restore a system from backups. And just about everything supports Raid 1 in hardware these days and disks are dirt cheap.

      --
      I'm a good cook. I'm a fantastic eater. - Steven Brust
    49. Re: Yawn... by psycheitout · · Score: 1

      The importance of things like backups and redundancies are lessons people usually have to learn the hard way. I bet you gitlab won't ever make this mistake again

  2. I feel that lone sysadmin's pain by sixdrum · · Score: 5, Insightful

    A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

    rm -rf /home/user1 /home/user2 /home/ user3

    Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

    1. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      many years ago, there was a commit on gitHUB of a major project that was like this... I think it was nodeJS or something that was attempting to delete /tmp/somefile and was accidently / tmp/somefile for a single commit (apparently it wiped the CI test server before anyone caught it... and the next commit was like 8 hours later.... with some "choice" words in it

    2. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      there is so much to learn from this
      you should either do cd /home; rm -rf user1 user2 user3 or simply use "mc" to delete stuff. i know that it is lame but deleting the wrong stuff is even worse. i rarely ever use rm -rf (especially as root) because it has such disastrous typo consequences
      and still, read twice before pressing enter

    3. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 5, Insightful

      That's why you always always run ls first.

      ls -ld /home/user1 /home/user2 /home/ user3

      Then edit the command to rm. Always.

    4. Re:I feel that lone sysadmin's pain by mmell · · Score: 5, Interesting

      Sadly, I remember personally making a similar mistake about a decade ago. Upgrading SAN hardware, preparing the old hardware for decommissioning (deleting data prior to sending the units to vendor). Even with offsite data replication, I survived several uncomfortable days and never did fully live down my error. Could've been worse - I thought I had a career change opportunity on my hands. My only saving grace was that I was acting under direction from vendor tech support when the error occurred (although it was still my fingers on the keyboard).

    5. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      One of the causes of mistakes like this is the human tendency to underestimate their own capability to blunder. When doing something dangerous like this, acknowledge that you're capable of making typos and choose a form that helps you to protect yourself from your own mistakes:

      cd /home
      rm -rf user1 user2 user3

      Better still, don't actually throw anything away until you've checked and double-checked you made the right selection, and even then only after the next backup:

      cd /home
      mkdir removed_home_directories_2017_02_01
      mv user1 user2 user3 removed_home_directories_2017_02-01/

    6. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Exactly right.
      Too many a sysadmin have rm -rf / as their hand slipped before typing the rest of the command.

    7. Re:I feel that lone sysadmin's pain by arglebargle_xiv · · Score: 4, Funny

      Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

      Actually: Check your privilege! (Especially if rm -rf is involved).

    8. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      i hope you were not using "rm" to securely delete stuff before sending it to a third party :)

    9. Re:I feel that lone sysadmin's pain by Opportunist · · Score: 2

      That's fine until he decides that typing "rm user1 user2 user3 user4..." is too much of a hassle and he replaces it with a script that lists the directories and removes them all. ...blissfully forgetting that there is a ".." directory. Oh .., how many well intended scripts have thee turned into the spawn of hell...

      --
      We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    10. Re:I feel that lone sysadmin's pain by Narcocide · · Score: 1

      cd /home
      mkdir removed_home_directories_2017_02_01
      mv user1 user2 user3 removed_home_directories_2017_02-01/

      chmod 0500 removed_home_directories_2017_02_01

    11. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Wow, that's bad. You don't even check for the cd's result before the deletion. What if /home is for whatever reason inaccessible?

    12. Re:I feel that lone sysadmin's pain by Megane · · Score: 1

      This is when tab completion is your friend, especially when you have path names with spaces in them. Also, for me the big one is overwriting stuff with the mv command (tab completion can make this easier to do), so I have it aliased to "mv -i". I almost never want to delete a file by overwriting it with the mv command.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    13. Re:I feel that lone sysadmin's pain by AmiMoJo · · Score: 4, Interesting

      mkdir ./trash
      mv file_to_delete ./trash

      If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    14. Re:I feel that lone sysadmin's pain by stridebird · · Score: 5, Informative

      Correct pattern is:
      > cd /home && rm ...

      ie don't run rm unless cd worked.

    15. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 1

      Agreed, but there are occasions where this fails too. I had ssh'd into an old server and proceeded to delete a few folders of old data before redoing the backup, the ssh connection had dropped while I was in another window and I ran the rm -rf command on the main server. Still gives me nightmares...

    16. Re: I feel that lone sysadmin's pain by saloomy · · Score: 2, Informative

      I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!

    17. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      I always like to move items to a temp directory before deleting them and double-checking...

    18. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      when /trash is empty the cron job removes /trash and 48 hours later it does rm -rf /

      Right?

    19. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      So THAT'S what happened to my GeoCities page!

    20. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      There's no need to become all PC now.

    21. Re:I feel that lone sysadmin's pain by Megol · · Score: 1

      Or perhaps the operating system (shell) should prevent these kinds of errors? I guess it isn't macho enough...

    22. Re:I feel that lone sysadmin's pain by jez9999 · · Score: 1

      Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.

    23. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 2, Insightful

      Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?

    24. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P

    25. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Or, as Red Foreman would say, "Check my foot in your ass!"

    26. Re:I feel that lone sysadmin's pain by Angstroem · · Score: 3, Insightful

      The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P

      Then you have never worked on a repository with users of TortoiseSVN and the likes.

      "Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."

    27. Re:I feel that lone sysadmin's pain by Applehu+Akbar · · Score: 1

      Moral: the command line is too powerful for puny humans who might not be totally attentive to every character being entered at all times.

    28. Re:I feel that lone sysadmin's pain by Applehu+Akbar · · Score: 1

      Including that purist-hated trash can/recycle bin.

    29. Re: I feel that lone sysadmin's pain by gsslay · · Score: 5, Insightful

      This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.

      Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".

    30. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Something similar happened here with a RAID1 array. New disk rebuilded over the old one instead of the other way around. Backups worked without issue, but damn stressful.

    31. Re:I feel that lone sysadmin's pain by 140Mandak262Jamuna · · Score: 1

      This. I do this.

      --
      sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
    32. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      I once accidentally formated the wrong hard hard drive.
      I could feel a cold sweat when I thought "wait, was that the wrong one?"
      Luckily, it was mostly data I had on other drives as well plus some porn, so not much of value was lost.
      Thank fuck for windows backup.

    33. Re:I feel that lone sysadmin's pain by quenda · · Score: 1

      GUI? We don't need no stinkin' GUI!

      # mkdir junk
      # mv file1 dir2 .... junk
      # ls -la junk

        Look carefully!!
      # rm -rf junk

    34. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      I agree cd into home then delete users ditectories. Much safer

    35. Re:I feel that lone sysadmin's pain by Zontar+The+Mindless · · Score: 1

      This is when tab completion is your friend...t

      This. Very first thing I thought of as well.

      --
      Il n'y a pas de Planet B.
    36. Re: I feel that lone sysadmin's pain by rgbatduke · · Score: 5, Interesting

      Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.

      With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.

      Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...

      --
      Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
    37. Re: I feel that lone sysadmin's pain by joboss · · Score: 1

      That's not the pro way to do it. You can have snapshots on the FS so you can do something like restore a file as it was an hour ago. You can do similar with database replication.

    38. Re:I feel that lone sysadmin's pain by donaldm · · Score: 1

      Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.

      May I ask what if you are required to do housekeeping on a corporate server that does not have a GUI?

      Answer: In the case of a corporate server whether it is classified as production, development or test, you raise a change request and get it signed off before you do anything.

      If you own the machine and are not answerable to anyone then any mistakes on your part will hopefully be a good lesson for you.

      --
      There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
    39. Re: I feel that lone sysadmin's pain by Joce640k · · Score: 5, Insightful

      I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!

      This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.

      If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.

      --
      No sig today...
    40. Re: I feel that lone sysadmin's pain by jeremyp · · Score: 1

      This is a bad idea. rm is a sharp tool and you should never do anything to it that makes you think it isn't. One day you'll be working on somebody else's system but you'll have forgotten that rm can be dangerous and you'll merrily delete something career ending, go look for it in /trash and then have to commit ritual suicide.

      --
      All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
    41. Re:I feel that lone sysadmin's pain by Joce640k · · Score: 2

      Exactly right.

      Exactly wrong.

      Learn to use "mv" instead of "rm -rf".

      eg. Create a folder called /trash and move the files there.

      When you see the system is still working and you need some disk space then you can empty the trash. Not before.

      --
      No sig today...
    42. Re: I feel that lone sysadmin's pain by RabidReindeer · · Score: 2

      There are many tricks. Personally, I like to tar stuff or do a ZIP-with-delete and keep it for a day or 3 before removal. For large quantities of data, that can take a while, though, so another possibility if one is working with snapshot-capable storage management is to snapshot it and work "offline" on the snapshot. I do this on VM images, for example.

      Hot mirrors updated just infrequently enough that you can break the link before the damage propagates isn't a bad idea, either. Filesystems with "time machine" rollback capabilities, too.

      You can pretty well bet that a backup is never as usable as it's supposed to be. It's going to be outdated, corrupted, or something critical won't be in it.

      To be really viable, you need to devote the same level of attention to backup/restore (accent on the restore) as you do on security management. There is a very strong case for keeping an entire server or set of servers to run frequent checks on backups, including bare-metal restores, and these days, a spare computer or 6 is not a bank-breaking investment any more if your data means anything to you. I also like to employ multiple types of backup media on the premise that not all types of hardware will be affected equally by most failures.

      At least apparently they only lost a few hours worth of work, and although potentially a large amount of data, it's a job that is inherently distributed among many disgruntled clients. Other organizations haven't been so fortunate.

    43. Re:I feel that lone sysadmin's pain by jabuzz · · Score: 2

      Oh I wish that where really the case. Unfortunately where a single run of a job on an HPC facility can produce 1TB of files that is not actually the case in the real world for everyone.

    44. Re:I feel that lone sysadmin's pain by rholtzjr · · Score: 1
      My major whoops early in my career.

      Brand new install of Slackware with Kernel 1.2.8 (circa late 1994) which was a statically linked build. Thought I was in /usr/local/lib (shell only had current level directory not the full path) but was really in /lib. Proceeded to rm -rf * to get rid of a test build (or so I thought). Well then I was wondering after about 10 sec the rm command was throwing errors. Seems that once the rm command hit libc.a any and all operations ceased.

      After that I always had the root user have the full path in the shell. Luckily no data was lost that a quick reinstall did not fix. But people did start asking me why they were getting a bizarre error when trying to get their mail with their pop client.

    45. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Or don't use Linux and cl because it's not fucking 1998 anymore?

    46. Re:I feel that lone sysadmin's pain by Calydor · · Score: 1

      I think it was a patch to EVE Online that did the same thing, accidentally deleting / instead of some specific directory within the game.

      --
      -=This sig has nothing to do with my comment. Move along now=-
    47. Re: I feel that lone sysadmin's pain by Zero__Kelvin · · Score: 0

      Then you won't be able to accidentally delete it, will you? DOH!

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    48. Re:I feel that lone sysadmin's pain by John+Allsup · · Score: 1

      Also rm -rf /home/{user1,user2,user3} is safer: if you accidentally include a space, the braces don't get expanded at all:

      rm -rf /home/{user[12]}

      is equivalent to

      rm -rf /home/user1 /home/user2

      but rm -rf /home/{user1, user2}

      is equivalent to

      rm -rf "/home/{user1, user2}"

      so 'rm -ageddon' is avoided.

      --
      John_Chalisque
    49. Re:I feel that lone sysadmin's pain by John+Allsup · · Score: 1

      Seriously, though, much more thought needs to be given to two things: one is making accidents harder, the other is making effective backups a no-brainer.

      --
      John_Chalisque
    50. Re:I feel that lone sysadmin's pain by BlackHawk-666 · · Score: 1

      Accenture made exactly this blunder on the London Stock Exchange website root folder (running on IIS). Some nimrod came in and accidentally deleted all the files from that folder taking about 30 different financial products offline. We noticed pretty quick and scrambled to restore from a backup.

      Funny thing is...some other nimrod or the same one did almost the same thing a month later, this time only removing a few key products :-)

      --
      All those moments will be lost in time, like tears in rain.
    51. Re:I feel that lone sysadmin's pain by BlackHawk-666 · · Score: 1

      I habitually shift-delete things because it saves a lot of time moving large folders with massive numbers of files into the recycle bin. I have been caught out by this once or twice over the years, but always had a recent backup and so have never lost anything that way.

      --
      All those moments will be lost in time, like tears in rain.
    52. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      It deleted boot.ini

    53. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      This. There is absolutely no reason for any backup / archiving process or system to have write / modify access to anything. Disaster recovery is a separate process and should be able to write to production only during approved recovery operations.

    54. Re:I feel that lone sysadmin's pain by sinij · · Score: 1

      Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

      Actually: Check your privilege!

      Sudo is a real victim here. Let not make it worse by engaging in victim-blaming.

    55. Re: I feel that lone sysadmin's pain by PincushionMan · · Score: 2

      You might want to look into Squashfs. The archive command for a single directory (or file) is:

      mksquashfs source_dir target_image.sqfs

      If you want to do multiple directories or files, no problem:

      mksquashfs source_dir1 source_dir2 souce_file1 source_file2 target_image.sqfs

      Squashfs generation is comparable to that of tar.gz files. Not only does it do gzip compression natively, it can compress the inodes in the directory tree and also do fs level de-duplication. Squashfs is compatible with any kernel from 2009+ (maybe before), and newer kernels also have the ability to use lzo and xz compressors. It's intended to be used anywhere that you would use tar.gz or cpio, with the added benefit that you can mount it loopback and extract a file that you need without the overhead of sequentially scanning through the tape archive. I've heard the windows version of 7zip can access a squashfs archive as well (as of 16.04 it must be a gzip compressed sqfs image). Squashfs natively detects sparse files - unless you tell it not to.

      The only thing I'm not sure how well unsquashfs handles the extraction of sparse files. Linux tar is totally unsuitable when dealing with sparse files, as it requires the full amount of space to extract a sparse file. For Linux tar, there's a workaround for sparse files, and that is to install BSD tar, which seems to extract as sparse files correctly.

    56. Re:I feel that lone sysadmin's pain by aaarrrgggh · · Score: 1

      But for it to be effective you really need to do a mv /trash/$date/, which still makes a restore/recovery a complete pain in the ass... but hopefully avoids one delete from overwriting another.

      When you have a system that uses multiple levels of backup, making sure that all of them are always working takes serious commitment, and the trash concept doesn't change that. Doing good backups is hard, especially for large data sets and tight budgets. We had a fantastic system for our Linux server, but had to migrate to Windows and over a year or two our sensitive data grew from 3TB to 5TB, and the backup system couldn't accommodate our archive process for old data (it doesn't reduce the total backup size). So, you add another marginally tested layer of backups, just in case...

    57. Re: I feel that lone sysadmin's pain by TheRaven64 · · Score: 1

      That's going to make rm very slow unless everything is on a single filesystem, which makes backups difficult. We tend to put each user's home directory in a separate ZFS filesystem and have a cron job creating and pruning snapshots. If a user accidentally deletes anything, the snapshots are all automounted in their ~/.zfs directory so that they can just copy the older version out themselves. On the main network, home directories are all on the NetApp filer that does this automatically (though using their own filesystem and putting snapshots in the ~/.snap directory)

      --
      I am TheRaven on Soylent News
    58. Re:I feel that lone sysadmin's pain by TheRaven64 · · Score: 1

      And when your home directories are all mounted over NFS, your mv command copies a massive amount of data over the network, fills up the local disk and, if run as root, breaks the system by filling up the emergency part of the FS reserved for the root user. Good plan.

      --
      I am TheRaven on Soylent News
    59. Re: I feel that lone sysadmin's pain by mrchaotica · · Score: 2

      Call your little script something like rm2 instead

      Or better yet, something that doesn't even have the string "rm" in it, like trash.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    60. Re:I feel that lone sysadmin's pain by TheRaven64 · · Score: 2

      I have. It's just as easy to accidentally click on the wrong folder, or delete the foo folder from the window showing bar instead of the window showing baz. This is why good UIs are all about making sure that there's an undo button that works after you've done the stupid thing, not about trying to make the stupid thing impossible. Most GUI systems will move things to the trash, rather than deleting. The problem is that users then get into the habit of reflexively emptying the trash immediately after a delete. You really want a filesystem design that adds blocks from deleted files to the end of a reuse list, so that new file allocation will overwrite the oldest deleted data by default and you can always undelete recently deleted things if you haven't written significant amounts of data in between the delete and the 'oh crap' moment.

      --
      I am TheRaven on Soylent News
    61. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Please don't remove rm -rf. It is our only defense against an impending AI hegemony.

    62. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      oh, look a terabyte of data... how quaint. We generated 2pb per day and handled archival with a 90 day rotation of hdd trays.

    63. Re: I feel that lone sysadmin's pain by budgenator · · Score: 1

      Oh hell yeah,
      format C:\ press any key to continue
        is sooo much easier, safer and modern in powershell.

      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
    64. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      I often wondered about using inodes to create an effective trash can for Linux systems.

      Basically every file in the filesystem gets a double inode. One to a "trash" can directory and everyone else's. Rm remove the inode on the file you delete leaving it in the trash can.

      Once a week/month/whatever, a process goes through the "trash" looking for files with a single inode. If one exist - delete it.

    65. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      As a Sysadmin, you are in control.
      It is pretty easy to use the "alias" command if you, for some reason, don't want the "rm" command to do what it is supposed to do.

    66. Re:I feel that lone sysadmin's pain by budgenator · · Score: 1

      I've had the GUI choke on "File too Large" when deleting, which sucks when the reason you're deleting files is because GUI told you the remaining file system free space is 0 b. At that point all you can do is drop into a terminal and start using rm.

      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
    67. Re:I feel that lone sysadmin's pain by sunking2 · · Score: 1

      Maybe you should be doing server maintenance on the actual machine that needs the maintenance? If you don't understand where your files actually reside then you should maybe be in another job.

    68. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      I use a "sweeter rm" that avoids these problems: a script that moves the files to be deleted to a folder (in trash) with a timestamp for its name.

    69. Re: I feel that lone sysadmin's pain by tibit · · Score: 1

      I do precisely that: each invocation of rm creates a fresh timestamped folder under mountpoint/rm_saved

      --
      A successful API design takes a mixture of software design and pedagogy.
    70. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Just tell them you got hacked, it works for everyone else. Bonus points if you say Russia hacked you ;) nobody will think you are lying.

    71. Re:I feel that lone sysadmin's pain by fuzzywig · · Score: 1

      Sharp knives are safer than blunt ones.

    72. Re: I feel that lone sysadmin's pain by Penguinisto · · Score: 1

      This, right here... holy shit this!

      On critical stuff, you want to make it a habit to mv stuff you're not familiar with somewhere (/tmp works most cases), test the system, test affected applications, double-check once more, and *then* rm.

      On rm itself, I make it a habit to type the rm, double-check the command forwards and (literally) backwards, and only when satisfied hit enter. Ain't perfect, but I've caught potential disaster more times than I can count by bad regex, misplaces spacing, and other dumb tricks by reading it forwards and (literally!) backwards.

      PS: The very first time I screwed up on rm, I learned the hard way to never, ever, ever type rm -rf .* to blanket-remove hidden files. Tends to nuke your entire server, including NFS mounted disks.

      --
      Quo usque tandem abutere, Nimbus, patientia nostra?
    73. Re: I feel that lone sysadmin's pain by Jesus+H+Rolle · · Score: 1

      No, sharp knives are safer than dull knives. Blunt knives are safer than either.

    74. Re:I feel that lone sysadmin's pain by mmell · · Score: 1

      No - the SAN's internal wipe. It took nearly thirty minutes to wipe the filesystems. Unfortunately, the fact that I'd wiped the wrong device didn't become evident until four hours after that. In all honesty, I'd have fired me that day. I'm glad my manager was a more understanding fellow than myself.

    75. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Just don't use gitlab for that version control? :D

    76. Re: I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      If you actually try this on a modern Windows system, it will fail because you can't format the system drive.

      You can do rm -r -f -confirm:$false c:\ though.

    77. Re:I feel that lone sysadmin's pain by nasch · · Score: 1

      That is not a perfect substitute for the Windows (and I assume other OSes) trash/recycle bin. For example, try using both techniques to create foo.txt, delete it, create it again in the same place, and delete it again. The recycle bin will have both deleted files and you can restore either. The mv version will have only one copy.

    78. Re:I feel that lone sysadmin's pain by gosand · · Score: 1

      yep.
      I am not a sysadmin, except on my own linux machine at home. I have been since 1998.
      I have learned that when I write scripts to do things, which is quite often, I always echo the key commands before actually running them.

      for i in 1 2 3 4 5
      do
      echo "rm -f $i"
      done

      I run it, look at what the command is going to do, then remove the echo. When messing around with files that might have spaces in them, or using multiple functions/calculations/variables, there is always something that can go wrong.

      I still remember back in 1999, I was working at a startup and a new developer, as root, did "rm -rf /" on a test server. He didn't live that down.

      --

      My beliefs do not require that you agree with them.

    79. Re:I feel that lone sysadmin's pain by rthille · · Score: 1

      Your manager and company had already paid the cost of "training" you, why would they then fire you and have to hire someone who might not be as careful as you would then be?

      --
      Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
    80. Re: I feel that lone sysadmin's pain by dgatwood · · Score: 1

      Mine is even better because of its simplicity. On my production systems, rm is aliased to 'echo "Use /bin/rm if you really want to do this"' so that it forces me to take a second look at what I'm doing before I run the command in the first place.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    81. Re:I feel that lone sysadmin's pain by LordLimecat · · Score: 1

      I mean, when literally every chef who has ever cooked has a story about how KniveCo's knives chopped off one of their fingers because the handle was too small, maybe its worth looking at mitigations.

    82. Re:I feel that lone sysadmin's pain by N!k0N · · Score: 1

      The situation was handled though.

    83. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Has this literally happened to every sysadmin who deleted a folder?

      Ah, methinks that you need a dictionary or a basic understanding of false equivalence...either way you're an idiot.

    84. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Christ, working in a corporate environment for the last 20 years I haven't been able to go more than 2 months without some end user accidentally dragging and dropping a stack of folders into another, creating copies of entire network drives (again because they can't control a fucking mouse properly), or yes, even deleting the wrong directory/file. I have seen even experienced sysadmins click the wrong checkbox on monotonous tasks (or even just when they - as most of us - are overworked).

      We had an "experienced" SysAdmin delete the root group policy that tied our network together - it took at least 3 years for that fiasco to resolve...and you know, he didn't fucking do it through PowerShell.

    85. Re: I feel that lone sysadmin's pain by RabidReindeer · · Score: 1

      I'd never considered using squashfs as a backup mechanism, but if it works for you...

      My Red Hat tar utility does support sparse files, although it's turned off by default, I think. A compressed tarball wouldn't care about holes, since holes compress down to virtually nothing anyway. The real issue is more in how well the receiving filesystem will honor the holes when the files are transferred into it.

      My day-to-day backups are based on Bacula, which supports sparse files. Most of my alternative strategies are for short-term safety or long-term image storage.

      I'm sorry to say that most of the commercial backup products I've worked with over the years have let me down at the worst possible times. Linux tar has not, ZIP has not, as long as I don't have multi-volume ZIPs, and bacula files have not. In addition to being reliable and free, you can work with them on virtually any platform and not get nuked by OS, hardware, or data version issues.

    86. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      had a coworker wipe the config on a switch while he was comparing two of them...he simply clicked on the wrong window. it cost him his job.

    87. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      You are not alone, when i delete something its usually because i need the space. SHIFT+DEL and ENTER are now built into my fingers without even thinking!
      Goes to show that warning popups dont work, even for professionals.

    88. Re:I feel that lone sysadmin's pain by Cederic · · Score: 1

      Why would anybody not start with

      cd /home

      The moment you ever type 'rm -rf /' you've failed, no matter what you put after it.

    89. Re:I feel that lone sysadmin's pain by grep+-v+'.*'+* · · Score: 1

      With a 40TB user SAN system from years ago, to delete major files from users and groups I told them they were gone but actually moved them to a user inaccessible directory. Then I waited 3 weeks or so. If no one complained the next day (or the next week for vacations) then I was pretty much good to go.

      I also -- being paranoid -- checked the time stamps to make sure no file access had occurred. Then again the users couldn't realistically find or access files from within an open and wet paper bag so I wasn't worried much, but I still checked. (Never mind the 4 hour snapshots, the daily incrementals and weekly full backups. They all worked but GOD were restores slow.)

      --
      If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
    90. Re:I feel that lone sysadmin's pain by UnknownSoldier · · Score: 1

      Tab filename completion would have caught that.

      That, along with double-checking the command line before you do ANY rm stuff ...

    91. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Backup is difficult? WTF man, google this: ZFS.

      Don't tell anyone (not because they don't know ZFS already, but because they'll think you are an idiot to not be using it already)

    92. Re:I feel that lone sysadmin's pain by CanadianMacFan · · Score: 1

      Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

      It isn't a backup until you have verified that you can restore from it.

    93. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      tarsnap!

      It's a great, cheap cloud based backup solution. I run it on all my systems.

    94. Re: I feel that lone sysadmin's pain by david_thornley · · Score: 1

      For most of what I do, I type "ls whatever" and examine the output. If it's what I want to delete, I do a little command-line editing.

      Also, I have a little ritual. When I'm typing something potentially dangerous, I type it, sit on my hands, and examine it carefully. This means concentrating and not paying attention to anything else.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
    95. Re:I feel that lone sysadmin's pain by LinuxIsGarbage · · Score: 1

      Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

      IT always says "Don't store stuff on your local hard drive, store it on the network drive where it's regularly backed up!". I still think I'm better off storing on my hard drive, and managing local backups. Particularly where the user's private network drive has a retention policy of 1 year, after which it guarantees the files are deleted.

      And also after I filed a ticket with IT when a user deleted another user's file on the network drive. Two week's later IT STILL couldn't find the backup tape from the ninth... That was one file, imagine losing the whole server.

      Shared network drives also have a tendency of people accidentally dragging and dropping a subdir into another subdir.

    96. Re:I feel that lone sysadmin's pain by aaarrrgggh · · Score: 1

      ZFS is not backup just like RAID is not backup. I have used ZFS and BTRFS and do love what it can *add* to a backup system, but it still isn't a substitute for multiple offline revisions off site, and no, taking drives out of a zpool and archiving them isn't the same.

      The threats today are different than what we used tape for way back when. We often get by with the same mindset, but it isn't perfect.

    97. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      Except the look carefully is kind of equivalent to the look carefully at the rm -fr command - at some point you need to get shit right

    98. Re:I feel that lone sysadmin's pain by quenda · · Score: 1

      No, but if you cant be bothered to log in ...

    99. Re:I feel that lone sysadmin's pain by jabuzz · · Score: 1

      Yeah I said a *SINGLE* run mate.

    100. Re: I feel that lone sysadmin's pain by minstrelmike · · Score: 1

      I've heard good things about version control. I use the stuff they have on GitHub. It's incredible.

    101. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 0

      just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.

      That's what Facebook said.

    102. Re: I feel that lone sysadmin's pain by ploppy · · Score: 1

      > The only thing I'm not sure how well unsquashfs handles the extraction of sparse files.

      If the file is stored as a sparse file in the Squashfs filesystem (normally the case), then Unsquashfs will create it as a sparse file when extracting it. It doesn't need any more filesystem space than the filled parts of the file when doing so.

      I wrote the code and so I should know :-)

  3. Repeat after me (and others) by Nkwe · · Score: 5, Interesting

    If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

    1. Re:Repeat after me (and others) by dbIII · · Score: 2, Informative

      Good advice but it's a misleading headline above. It appears their real backup exists and is six hours old, so annoying but not catastrophic.
      It is a good example that replication is not a backup and is often a way to just mirror mistakes.

    2. Re:Repeat after me (and others) by glenebob · · Score: 0

      Truth.

    3. Re:Repeat after me (and others) by hcs_$reboot · · Score: 1

      Typical case of "we're unlikely to lose our data, and anyway we've got a backup which in turn is unlikely to fail ; so why test a unlikely x unlikely event?"

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    4. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      At some point, testing is more likely to introduce a subtle error than it is to detect one.

    5. Re: Repeat after me (and others) by Anonymous Coward · · Score: 0

      Go read the article, that snapshot was a staging copy for testing something else. Their backups failed due to using wrong pg_dump binaries and no one noticed.

      They royaly screwed up.

    6. Re:Repeat after me (and others) by hcs_$reboot · · Score: 1

      Test your backups, at least once, seriously!

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    7. Re:Repeat after me (and others) by MatthiasF · · Score: 5, Informative

      Uh, did you read the article?

      The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").

      It's like they thought out everything but never made sure any of it was working.

    8. Re:Repeat after me (and others) by dbIII · · Score: 5, Insightful

      Uh, did you read the article?

      No, and I got the wrong impression from skimming the article.
      You are correct and I am not.

    9. Re:Repeat after me (and others) by phantomfive · · Score: 1

      I think backups are surprisingly likely to fail. Just like RAID is surprisingly likely to have more than one disk fail at a time, even though intuitively that seems extremely unlikely.

      --
      "First they came for the slanderers and i said nothing."
    10. Re:Repeat after me (and others) by Opportunist · · Score: 5, Funny

      Sing with me, kids:

      One backup in my bunk
      One backup in my trunk
      One backup at the town's other end
      One backup on another continent

      All of them tested and verified sane
      now go to bed, you can sleep once again

      --
      We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    11. Re:Repeat after me (and others) by tonymercmobily · · Score: 5, Insightful

      "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
      OK, now that I have repeated it, let me add.

      As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.

      THEN you will see, for real, how your company reacts to real disasters.

      The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.

      Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.

      http://www.datacenterknowledge...

      Merc.

    12. Re:Repeat after me (and others) by JaredOfEuropa · · Score: 1

      In other words: IT fire drills. Smart companies conduct them... but somehow I have never seen them done, or even seen companies asking their outsourcing partners to produce some proof of recovery procedures having been tested. No, "they are ISO-over-9000 and that is good enough for us". Good enough to cover your arse when things go south, sure.

      We had plenty of actual fire drills, though.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
    13. Re:Repeat after me (and others) by Daimanta · · Score: 2

      RAID fails because hard disks (probably the same type and batch) running together get hit at the same rate as the matching disks do not fail with the same chance distribution. Their failure correlation is therefore to be quite high. This explains that rebuilding a RAID array after failure can be a very dangerous operation and could easily lead to total failure. Usually, doing (incremental) backups are the safer option when a single disk fails as that is not nearly as invasive as a complete RAID rebuild.

      --
      Knowledge is power. Knowledge shared is power lost.
    14. Re:Repeat after me (and others) by Applehu+Akbar · · Score: 2

      If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      Especially now that ransomware is overwriting online backups.

    15. Re:Repeat after me (and others) by Anonymous Coward · · Score: 1

      ...Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").

      It's like they thought out everything but never made sure any of it was working.

      It's one thing to not validate your backups with a test restore.

      It's another level of stupidity entirely when you don't even check to see that there are no fucking files at all.

      Perhaps not the best time to rub salt in the wound, but I find myself recalling the wise words of the immortal Red Foreman...

      "Dumbass!"

    16. Re:Repeat after me (and others) by Anonymous Coward · · Score: 1

      Uh, did you read the article?

      No, and I got the wrong impression from skimming the article.
      You are correct and I am not.

      This kind of humility does not belong on Slashdot. Honestly I don't know what's happening to this place...

    17. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
      OK, now that I have repeated it, let me add.

      As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.

      THEN you will see, for real, how your company reacts to real disasters.

      The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.

      Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.

      http://www.datacenterknowledge...

      Merc.

      There are very few CEOs who actually give a shit about that "IT stuff", and if anything really wrong happens, the company is going to lose more than fucking "downtime". Data loss usually holds a dollar value and impact. Doubly so for a public company, which the one getting fired for pulling dumbass stunts might be the CEO by the Board.

      And the example company was maintaining half a dozen data centers at the time of the test, with a literal army of support staff running pre-tests prior to the main DR test. That hardly qualifies as profiling the average company infrastructure who can hardly afford to run a second backup data center or personnel to properly run it. The smart CEO knows how cheap they are, which is why we almost never hear of them being executed by the executive staff.

    18. Re:Repeat after me (and others) by MatiasKiviniemi · · Score: 2

      The internetz council convened and decided we will none of that "admit my mistakes"-bullshit here. Please hand in your card and exit the premises immediately.

    19. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      More like at all times. You need to test the real restore only every so often, but the sanity checks (size, files included in the snapshots, and if at all possible, file contents crypto hash) you do for *all* backups a lot more often.

    20. Re:Repeat after me (and others) by 140Mandak262Jamuna · · Score: 1

      Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company

      You make the assumption they CXO want to save the company. Downtime costs happen this quarter. Benefit accrues to whoever is the CXO five years down the line. Why should current CEO save the a** of the next CEO. Squeeze the company dry, show as much revenue/profit as possible, cash the stock options and skip town. By the time they discover the shoddy backup vendor you hired to cut costs, had been saving the data in the "1TB" thumbdrives bought in some flea market in outer Mongolia, you are already well into wrecking the next company.

      --
      sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
    21. Re:Repeat after me (and others) by cdrudge · · Score: 4, Insightful

      "they are ISO-over-9000 and that is good enough for us"

      Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:

      1. Perform backup
      2. Pray nothing goes wrong.

      Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.

    22. Re:Repeat after me (and others) by drinkypoo · · Score: 1

      As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events.

      As an IT professional, and occasional admin, you MUST have backup for your hardware to switch to, which mitigates the pain of live testing. The hardware is typically a small portion of the total cost of the business, even if you double it.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    23. Re: Repeat after me (and others) by Anonymous Coward · · Score: 0

      This is not entirely correct because replication can be configured to apply with a delay, at least in PostgreSQL. Had there been a programmed 12 or 24 hour delay in one of the replicas, it may have helped. It is also incorrect because the master and replica can be configured store transaction logs for at least say 24 hours if not 72.

    24. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      You've got to be really confident in procedures to seriously do this, and know when you're simulating a disaster rather than causing one. I know one of our customers caused a genuine emergency during a real disaster recovery test like this, by unplugging network cables, and essentially partitioning their network. They then started up a pair of redundant servers that each thought they were the only running server, since they couldn't see the other. Unfortunately these are controlling physical cranes and conveyor systems, and each proceeded to move a lot of physical stuff around, each updating it to it's own copy of the database.

      Untangling the mess when they put the network back together was a big job.

    25. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      Yes. Please turn in your Poo-flinger card on the way out as well...

    26. Re:Repeat after me (and others) by Zontar+The+Mindless · · Score: 3, Funny

      Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)

      (A: "Too many backups".)

      --Actual question from the final exam for the Networking 100 class I took in 1998.

      --
      Il n'y a pas de Planet B.
    27. Re:Repeat after me (and others) by tomhath · · Score: 2

      If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      And don't trust someone else who says they made and tested the backup. Our DBAs had proof that the sysadmins told them the disk backups worked. But the DBAs never did a practice restore of their own. You can guess what happened when a failed update trashed the database.

    28. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      Nothing is every truly 0% given even data points. Taken enough backups and the law of large numbers will be in your favor. At some point one of them is going to work because of quantum mechanics.

    29. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      RAID has never been a substitute for backups. RAID is a method to improve uptime, especially if the server supports hotswap.
      At most RAID will save you from data loss caused by manufacturing defects.
      Externally caused disk failure is likely to impact both disks. Any file system error or user error will just be cloned.

      RAID does not keep you data safe. Backups to the same computer protects against some user errors. Backup to a separate computer protects against most hardware failures.
      Backup to another building protects your data against local disasters like fires or a broken waterpipe flooding the server room. Move it to another city and you can cover floods and other large disasters.
      Backups to another country can be good if you worry about data loss caused by a malicious government.

    30. Re: Repeat after me (and others) by Zero__Kelvin · · Score: 0

      Replication when combined with / facilitated by a good distributed SCM (such as git or mercury) is in fact an excellent backup system.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    31. Re:Repeat after me (and others) by TheRaven64 · · Score: 3, Informative

      Please mod the parent up. After the uptick in trolling and invective in the last couple of months, this post is a breath of fresh air around here.

      --
      I am TheRaven on Soylent News
    32. Re:Repeat after me (and others) by TheRaven64 · · Score: 1

      That sounds like a great idea, after you've tested that you can bring up a clone of your production system onto a spare [virtual] machine from the backups. If you don't do that first, then it sounds like an expensive way of discovering the bug that caused you to lose all of your customers' data.

      --
      I am TheRaven on Soylent News
    33. Re:Repeat after me (and others) by tonymercmobily · · Score: 1

      That sounds like a great idea, after you've tested that you can bring up a clone of your production system onto a spare [virtual] machine from the backups. If you don't do that first, then it sounds like an expensive way of discovering the bug that caused you to lose all of your customers' data.

      That should be a given. But, being able to do that doesn't mean that you WILL be able to recover quickly from a REAL outage (hence the voluntary, self-inflicted outage))

    34. Re:Repeat after me (and others) by aaarrrgggh · · Score: 1

      Ok, but what about the next level? I worked with a bank 10-15 years ago that dropped their data center due to old ups batteries, restarted the mainframes when the generator kicked on, and now had to make a difficult decision on switching to their DR site, or continue to run on generators until the batteries could be replaced in a couple weeks.

      It was a difficult decision because while the regularly tested going to DR, there was no way to roll back to primary. (This was true for nearly all banks at the time.) Using the DR site actively cost enough to make their CFO really squirm (more than the fact that he didn't authorize battery replacement in the first place).

      Point being, the best laid plans of mice and men...

    35. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      No, RAID fails because the event of recovering a single disk crash means reading the complete surface of all other disks.
      As some data is "stored once, read never" you actually don't know if you can reliably read that data.
      There is a chance a RAID recovery will end by discovery errors on the other disks as well, resulting in a complete RAID failure.

      Doing complete backups, in stead of incremental onces, will actually protect you better against this.

      If you use RAID, you need to do regular disk scrubbing, SMART surface scans, etc...

      You need to read the data in order to detect you can't...

    36. Re:Repeat after me (and others) by Notabadguy · · Score: 3, Funny

      Look at his user ID. Give him time, he'll come around.

    37. Re:Repeat after me (and others) by Hognoxious · · Score: 1

      The old joke was that lead lifejackets were 1SO 9000 compliant ... as long as you write down that they're made of lead.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    38. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      "Donald J. Trump"

    39. Re:Repeat after me (and others) by budgenator · · Score: 1

      Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)

      (A: "Too many backups".)

      --Actual question from the final exam for the Networking 100 class I took in 1998.

      I would argue that there is always a point where the expense of another backup exceeds the benefit of having it. Sometimes the expense is monetary, sometimes it's lost availability. I'm lucky, I can rsync the server once a day to another computer and backup from that machine, it's easier to replace any lost data by hand than to do a more fine-grained backup routine.

      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
    40. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      "they are ISO-over-9000 and that is good enough for us"

      Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:

      1. Perform backup
      2. Pray nothing goes wrong.

      Our factory in China forms electrical wire by handing the operator a rubber mallet and a wooden board, then telling the operator to hammer the wire using the board until it forms the correct shape. Since this is captured in the factory's documentation, this is also their ISO-9001 certified procedure. The ISO-9001 certification has little to do with best practices.

    41. Re:Repeat after me (and others) by sl3xd · · Score: 1

      Is there a tune that's supposed to be sung to?

      --
      -- Sometimes you have to turn the lights off in order to see.
    42. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. .

      DBA's should be fired, on the spot, once the system is back up. The end. I might even fire them before then. This is literally their job. Complete failure on the entire DBA team unless Network OPS was monitoring failures and failing to report (in which case its bad design/ownership).

    43. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      your meter is crap.

    44. Re:Repeat after me (and others) by h4ck7h3p14n37 · · Score: 1

      If you use RAID, you need to do regular disk scrubbing, SMART surface scans, etc...

      Or you could run ZFS and not worry about it.

    45. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      well this article is proof that you really only need one backup method, that is frequently tested, and this is better than five methods that aren't

    46. Re:Repeat after me (and others) by cdrudge · · Score: 1

      The ISO-9001 certification has little to do with best practices.

      Exactly. During my ISO-9001 internal auditor training, we had it drilled into us that the standard said nothing about what was the right or wrong way to do something, best practices, common sense, etc. It was all about documenting how something is done and doing something how it's documented.

    47. Re:Repeat after me (and others) by phantomfive · · Score: 1

      Yeah, I now think of RAID as a way to increase disk access times, rather than as a backup method.

      --
      "First they came for the slanderers and i said nothing."
    48. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      Quoted for truth.

      I worked, full time, as a backup/recovery sysadmin - meaning, backup and recovery was my entire job - for over five years. The tales I can tell of backup servers that weren't operating as they should... oy vey. In one case, I found that there was a common coding error in some database backup scripts. The backup server was reporting success on those backups. When I fixed the coding error, the backup server started reporting failures. "Why have our backups started failing?! What did you break?!" Took them a while to appreciate that I'd actually fixed a problem: the backups had been failing for a long time; all I'd done was fix the problem that made it look like they'd succeeded. (Yes, I did find and fix the problem that was causing the backups to fail. It wasn't particularly difficult to fix - but until it was fixed, they did not have complete, useful backups. I'd rather see an honest failure message than a dishonest 100% success rate.)

      Then there were the DR tests. When I took over on one particular company's account, I did things exactly as instructed in the DR documents. The DR test failed. "This test never failed before!" Turned out that previous guys had been fudging things. Which is great if your goal is to pass the DR test. Not so great if your goal is to verify that DR is fully functional.

      The Russians have a great saying: "Trust, but verify." It applies to backups. You can't verify everything. So verify that which is most important, and something else, picked out at random. It won't give you certainty... but it'll put you streets ahead of most people.

    49. Re:Repeat after me (and others) by KlomDark · · Score: 1

      Damn whippersnappers!

    50. Re:Repeat after me (and others) by Opportunist · · Score: 1

      Fffft. You whip up a rhyming routine up on the spot in a foreign language.

      --
      We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    51. Re:Repeat after me (and others) by Zontar+The+Mindless · · Score: 1

      You didn't even count the blanks. FAIL.

      --
      Il n'y a pas de Planet B.
    52. Re:Repeat after me (and others) by Anonymous Coward · · Score: 0

      One backup in my bunk
      One backup in my trunk
      One backup at the town's other end
      One backup on another continent

      Lotta good that will do you when it's time to build a hyperspatial express route through your star system.

    53. Re:Repeat after me (and others) by david_thornley · · Score: 1

      I don't really believe in backups unless I've demonstrated I can restore from them. When I'm dealing with anything important, I figure that data that isn't backed up doesn't exist, and data isn't backed up until restoration has been tested.

      I'm reminded of Knuth's comment that he hadn't tested certain code, but merely proved it to be correct.

      --
      "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
  4. Don't use rm! by subk · · Score: 2

    Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

    --
    Now, if you'll excuse me, I have backups to corrupt.
    1. Re:Don't use rm! by hcs_$reboot · · Score: 1

      # rm `which rm`

      --
      Slashdot, fix the reply notifications... You won't get away with it...
    2. Re:Don't use rm! by Anonymous Coward · · Score: 0

      Yeah just mv foo /dev/null

    3. Re:Don't use rm! by subk · · Score: 1

      Yeah just mv foo /dev/null

      No, you're missing the point. mv foo /some/safe/place and when everything is working again... and you're sure you don't need it.. Then and only then use rm.

      --
      Now, if you'll excuse me, I have backups to corrupt.
    4. Re:Don't use rm! by infolation · · Score: 5, Funny

      Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

      Don't tell the customer anything!! Geez... What's with these semi-pros?

    5. Re:Don't use rm! by Darinbob · · Score: 3, Interesting

      Boring job, doesn't pay as much as others. Everyone wants to be the rockstar since that's who the recruiters look for, nobody wants to be the janitor that cleans up after the concert. Turn that into a startup and seriously, no one at a startup wants to be the grunt, and (almost) no one at a startup has an ounce of experience with real world issues.

      This is why sysadmins were created, because the people actually using the computers didn't want to manage them.

    6. Re:Don't use rm! by Anonymous Coward · · Score: 0

      Or, you know. Have a few different rsyncs to other drives/machines. I have 3 x rsyncs at different days to other storages and always do one manually before any major work. First time you run them run it takes half a day but after that its pretty quick 5-20minutes . Also have a regular back-up to LTO robot.

    7. Re:Don't use rm! by ThunderBird89 · · Score: 1

      > Don't tell the customer anything until the dust settles!

      That's one way to handle a major crisis, but if you're transparent about an issue, it puts a lot more minds at ease than it upsets, since then at least your customers know that you're aware of the problem, that you're working to fix it, and that they can communicate with you.

      --
      Hyperbole: I use it liberally!
    8. Re:Don't use rm! by Anonymous Coward · · Score: 0

      Everyone wants to be the rockstar

      "You know, I don't know if this means anything to you, but I actually enjoy playing music, and it don't matter who it's for."

    9. Re:Don't use rm! by sodul · · Score: 3, Insightful

      Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.

    10. Re:Don't use rm! by Anonymous Coward · · Score: 1

      Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

      Don't tell the customer anything!! Geez... What's with these semi-pros?

      Tell the customers everything is A-OK, then blame everyone else for everything!!! Geez... What's with these upstart pros?

    11. Re:Don't use rm! by c4757p · · Score: 1

      Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

      I've never even used them before, and this transparency has moved them to the top of my list for the future.

      Fuckups happen to everybody, despite all the 20/20 Captain Hindsights here pointing out everything that went wrong. I like to see how people handle their fuckups, and they're handling this one with grace.

    12. Re:Don't use rm! by Anonymous Coward · · Score: 0

      Agile is fly by the seat of your pants, blindfolded! In in the fuck equated "agile" with masterful? lol

    13. Re:Don't use rm! by Anonymous Coward · · Score: 0

      The old school Unix admins do know how to program in C. As for shell, I can't believe anyone could be considered anything above an entry-level sysadmin if that person could not read and write shell scripts.

    14. Re: Don't use rm! by Anonymous Coward · · Score: 0

      This week, Jackie moon attempts to wrestle a bear, come see it folks.

  5. Test your backups! by djinn6 · · Score: 2

    Two things:
    1. Test your backups
    2. TEST your BACKUPS!

    1. Re:Test your backups! by Anonymous Coward · · Score: 0

      When the auditors came to ask about the backups, I told them I had written the backup scripts myself, and I had restored a deleted file for a user just last week, so yes the backups are working. The audtors concluded two things:

      (1) Lack of commercial backup solution means, "There are no backups."
      (2) Lack of procedure to test backups means, "Backups have never been tested."

      I was fired at the end of the month.

    2. Re:Test your backups! by Anonymous Coward · · Score: 0

      If you work in a situation where you could deal with external auditors, and you haven't documented the procedures for everything that you do - or you haven't bothered to ask what the requirements are to cleanly pass an audit - then you deserve to be fired.

    3. Re:Test your backups! by Anonymous Coward · · Score: 0

      "Jesus H Christ, we have too many awkward pencilneck geeks working here. Let's have an audit to clean house and get rid of those idiots. Appoint some auditors for the first time ever, and make up some bullshit requirements to be sure they all fail the audit. Then fire the geeks, hire my golfing buddies to replace them, and suspend audits indefinitely. Good work all around, folks!"

    4. Re:Test your backups! by Anonymous Coward · · Score: 2, Funny

      but NOT on your production hardware running live services.

      me thinks gitlab should have browsed their hosted repos for some backup software.

    5. Re:Test your backups! by asylumx · · Score: 1

      but NOT on your production hardware running live services.

      There are plenty who disagree with this. Right or wrong, their arguments have merit.

    6. Re:Test your backups! by Anonymous Coward · · Score: 0

      That sounds quite harsh.

      What I would have expected to happen is: the CEO given the auditors' report -> the relevant report sections then handed to department heads -> the head of IT would review everything related to technology, including the points you mentioned -> IT actions the auditors' concerns.

      You only need to action the auditors' concerns if you agree. Oftentimes auditors can be very, very picky. However if something goes pear-shaped and you chose not to implement their recommendations, then that can get ... awkward.

    7. Re:Test your backups! by aaarrrgggh · · Score: 1

      All I can say is I sure as hell preferred SnapBack to Veeam. While I originally crafted a solution that provided the same net result, SnapBack was so painless. Throw in btrfs snapshots and you have a fast, robust, reliable system that you can run from a NAS unit doing pull backups.

      I get the benefits of Veeam, but it is an awful tool for hourly backups and deleted file recovery.

    8. Re:Test your backups! by AK+Marc · · Score: 1

      The CIO who demanded we not test backups (the email, printed and filed) was promoted after the server failed, and we found out that the "never errored" backups didn't actually back up the systems in question, but the wrong sets of files from a selection of servers, set up long before I got there, but since the BackupExec job completed successfully every day for years, there couldn't be a problem, and it would be a waste of time to check them. I wasn't fired for that, but I was thrown under the bus by the guy that caused the problem.

  6. Inexcusable for a hosting provider by Anonymous Coward · · Score: 0

    If you haven't attempted a restore and verify, you don't have backups. Everybody in IT knows this.

    1. Re:Inexcusable for a hosting provider by glenebob · · Score: 1

      The first sentence is true. The second one only achieves "should be true" status.

  7. Not quite surprised by Anonymous Coward · · Score: 0

    Having set up an instance of github CE by myself recently, i must say I'm not necessarily surprised. The whole thing just feels weird from a sysadmin perspective and I get the feeling they lack perspective on backend matters. Poor documentation, weird architecture, a general kludginess. I was going to write it off as the usual divide between free/paid versions, but somehow that doesn't have enough explaining power. Maybe it's just a general problem with rails people, i don't know, but yeah, not surprised by TFA.

    1. Re: Not quite surprised by Anonymous Coward · · Score: 1

      Github is not Gitlab

  8. Trust the Cloud by Anonymous Coward · · Score: 0

    No need to host your own stuff!

  9. Simple by Anonymous Coward · · Score: 1

    Recycle bin, restore.

  10. They already started laying off H1Bs by Anonymous Coward · · Score: 0

    Be careful what you wish for.

  11. haha by Anonymous Coward · · Score: 0

    cowsay lol | lolcat

  12. Git are Gits ? by Anonymous Coward · · Score: 0

    I'm shocked, to funded morans will repeatedly fail and you will pay for it

  13. The missing disclaimer to this article by rodia · · Score: 0

    Reading "meltdown", remeber that:
    1) GitLab competes with Sourceforge
    2) Sourceforge and /. are both owned by BizX

    1. Re:The missing disclaimer to this article by Anonymous Coward · · Score: 0

      Sourceforge is still a thing..?

    2. Re:The missing disclaimer to this article by rodia · · Score: 1

      Sourceforge is still a thing..?

      That made me so insecure that I actually had to check.. :O)
      Yes, they are still there.

  14. In defence of lone sysadmins by Anonymous Coward · · Score: 0

    There should be some charity or movement defending lone,overworked sysadmins victim of hawkish and success-hungry startup business leaders without the faintest clue about how complex their technical backends may have become and the effort it takes to run and scale those. Somebody kicking hard these kind of arrogant kids calling themselves CEO or CTO out of the blue without even having finished college (or, worse, just because "they have an MBA"), who think and convince others that they can roll out an IT service worldwide to a billion users by just three clicks on "the cloud" once and forever, ignoring everythng that doesn't sound "business" nor "cool", from storage to capacity to backups to monitoring alerts, if any. Good "lesson learned" come also for those greasy, white haired venture caps to feel a sudden pain in their lower backs and lose some of their penauts million dollars every once in a while, so they stop thinking you can make money from kids without any proper knowledge or education, that see IT as fancy showbiz and sysadmins and devs as "extras" of which to dispose at will, whilst IT is a science, and technical, educated people should be at its core and its most valuable asset. And be given some time to think to foster creativity, instead than having it killed by the sleep deprivation due to constantly fixing something that was badly rushed to production the night before launch day.

    PS the rant is -IN NO WAY- directed towards GitLab, neither referring to this specific case (of which I don't know anything), but more describing a sick industry trend of the last decennia it would be A Good Thing(TM) to start to revert.

    1. Re: In defence of lone sysadmins by Anonymous Coward · · Score: 0

      There is no excuse for this kind of mistake, it was simply negligence. There are things you are expected to do as a professional at all times, being over worked as a professional to the point you can not do those things puts it on you to make your employer aware or find new employment. Failure to achieve either of those two things means you are not a professional.

      Regardless if they are over worked, they rm -rf on the wrong machine, so dumb.

    2. Re: In defence of lone sysadmins by Anonymous Coward · · Score: 0

      There's no excuse for some people to be overworked while other people are underworked. Why is it considered acceptable for a society to have two kinds of people: those who work 100 hours per week, and those who work 0 hours per week. You know if you reduced the full time work week to 10 hours then there wouldn't be tired people making stupid mistakes caused by fatigue.

    3. Re: In defence of lone sysadmins by Anonymous Coward · · Score: 1

      libtard

    4. Re: In defence of lone sysadmins by Anonymous Coward · · Score: 0

      When someone is overworked, their IQ drops... so you're right in a way, that was dumb. On the other hand, you're a dick, and no matter how well you sleep tonight, you're still going to be a dick tomorrow.

  15. At least it wasn't github.com by jtara · · Score: 2

    At least it wasn't github.com.

    So, it didn't break the Internet.

    And practically everything else.

    1. Re:At least it wasn't github.com by Actually,+I+do+RTFA · · Score: 0

      Dang, not a programmer (github) or person (facebook). What else am I missing out on?

      --
      Your ad here. Ask me how!
    2. Re:At least it wasn't github.com by Anonymous Coward · · Score: 0

      But it should still be a wake-up call to people who trust github with their source.
      Letting someone else handling backups for you is not a substitute for making your own backups.
      It is a great complement, but any backup you didn't do yourself should be considered unreliable.

    3. Re:At least it wasn't github.com by phantomfive · · Score: 1

      Github goes down from time to time, too. Self-hosting code is so easy (that's what git was designed to do), that there's really no reason to have your company depend on Github. Unless you're early stage startup and don't even have an office or something.

      --
      "First they came for the slanderers and i said nothing."
    4. Re:At least it wasn't github.com by Anonymous Coward · · Score: 0

      It uses git, so if your the type of person that would complain of gitlabs ineffectual backups, you would have your git repo locally anyay and effective local backups.

    5. Re:At least it wasn't github.com by Richard_at_work · · Score: 1

      Github isnt just code - there is a heck of a lot there which you dont get locally without lots of third party tools and the hassle that comes with them.

    6. Re:At least it wasn't github.com by Anonymous Coward · · Score: 0

      Nope. Issue tracker is disabled because we don't like whiny idiots. Pull requests are ignored unless the contributor is a rockstar and then the commit is merged immediately. All is code.

    7. Re:At least it wasn't github.com by Anonymous Coward · · Score: 0

      Hassle which you absolutely should take when access to these services are critical to your company's business.

      Otherwise, you are either a crazy startup, criminally negligent [if you have responsibility towards shareholders, business partners, or customers], or just plain incompetent.

      The cloud is not there to create a massive business risk. And when your operation is going down under, with no hope of fast restore/recovery, SLAs in a piece of paper mean nothing. You are still out of a company, a lot of people are still going through hell (and those might include your customers, not just co-workers).

    8. Re:At least it wasn't github.com by guruevi · · Score: 1

      That's only the systemd repo, most repos don't do that.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    9. Re:At least it wasn't github.com by Richard_at_work · · Score: 1

      Good for you ... now the vast majority of everyone else, however, doesn't think that because they don't use something its worthless...

    10. Re:At least it wasn't github.com by phantomfive · · Score: 1

      If you can handle the downtime and have good local backups, them fine, go with that. Otherwise you're going to be in pain and regret, sooner rather than later.

      --
      "First they came for the slanderers and i said nothing."
  16. Made this mistake once... by daid303 · · Score: 2

    I've made this mistake, deleted all attachments on a life system once.

    After this, I made all the prompts for critical servers a different color:
    export PS1='\e[41m\u@\h:\w\$\e[49m'

    1. Re:Made this mistake once... by serviscope_minor · · Score: 4, Funny

      Good choice. But, I always use this prompt:

      PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\\([^\\]\\{6\\}\\)[^\\]\\{2,\\}/\\1~1/g" ) >'

      --
      SJW n. One who posts facts.
    2. Re:Made this mistake once... by Anonymous Coward · · Score: 0

      I was pretty sure I knew what it was going to do, and I was not disappointed. I have not seen a prompt that fancy since the turn of the millennium.

      For the humor impaired, mod parent funny.

    3. Re:Made this mistake once... by Pascoea · · Score: 1

      Haha. Now that's just awesome enough that I'm going to keep it how you suggested. My Linux skills would be classified as "Knows enough to be dangerous", and I have to admit I had no idea exactly what that would do, but the "C:" intrigued me enough to try...

    4. Re:Made this mistake once... by Anonymous Coward · · Score: 0

      Once I intended rm -rf * [enter], but I accidentally hit / on a numeric keyboard, resulting in a crippled system. Fortunately there wasn't too much valuable stuff there, but still, it meant a reinstall of my system.

  17. How can this keep happening? by Anonymous Coward · · Score: 3, Interesting

    I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?

      http://jefferai.org/2013/03/29/distillation/

    When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.

    1. Re: How can this keep happening? by Anonymous Coward · · Score: 1

      The KDE incident could reasonably be called a flaw with Git (I don't know if it's been fixed since then), but this time it's just a case of someone deleting the wrong data, and that's hardly Git's fault. If anything, distributed systems like Git are more robust against that than centralised ones, because more of data is copied to the client whenever they clone or update.

    2. Re:How can this keep happening? by Entrope · · Score: 3, Informative

      KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.

      If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.

    3. Re:How can this keep happening? by Anonymous Coward · · Score: 0

      If you don't understand how git works, you should seriously check out the Git for Ages 4 and Up video floating around on youtube. It's far and away the best explanation of Git I've ever come across and at least for me made me extremely comfortable with working with git and pushed me into the "git as my preferred DVCS" camp

    4. Re:How can this keep happening? by Anonymous Coward · · Score: 0

      This problem was not the fault of GIT... This was the fault of poor system's processes. GIT just happens to be the application running on those poorly managed servers...

  18. If only there was another copy of the repo by HxBro · · Score: 5, Funny

    Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this

    1. Re:If only there was another copy of the repo by gweihir · · Score: 2

      Just imagine if you had actually read the story. The git-repos are not affected.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    2. Re:If only there was another copy of the repo by Anonymous Coward · · Score: 0

      You mean just the way git works?

    3. Re:If only there was another copy of the repo by twdorris · · Score: 0

      whoosh

    4. Re:If only there was another copy of the repo by AmiMoJo · · Score: 1

      Repos will be okay, it's all the ancillary stuff, i.e. the things that make them worth using over other git hosting companies. User management, wikis, release management, issue tracking etc.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  19. But it's git, folks just need to push again by Anonymous Coward · · Score: 0

    Little important should be lost. Developers have full copies of active work.

  20. Woldn't have happened... by Anonymous Coward · · Score: 0

    if they had containerization, agile development, kanban, sprints, ansible, jenkins and all that:

    the sysad wouldn't have the time to issue the rm -Rf :-D

    Continuous masturbation FTW!

  21. Unfortunately common by CustomSolvers2 · · Score: 4, Interesting

    I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.

    If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).

    These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.

    --
    Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
    1. Re:Unfortunately common by Anonymous Coward · · Score: 0

      Their database replication stopped because the replicating server fell behind during a spam-attack.
      In the process of getting that replication working again the admin was clearing out the data directories on the replicating server and accidentally performed the delete in the wrong console window executing it on the production server and not the replicating server.

      The person who executed the command -is- their high-level admin person who I presume would have this sort of access even in a more restrictive environment?

    2. Re:Unfortunately common by CustomSolvers2 · · Score: 1

      The person who executed the command -is- their high-level admin person

      If this is the case, the fact of these files being deletable would certainly make (some) sense.

      Although there are many alternatives to further constrain certain deletions (even the ones done by the most-privileged users) and to minimise the chances of problems. For example, by asking for additional confirmation or (as suggested in various posts above) always keeping a copy of the deleted items for some days or always checking whether an actual backup of the deleted files exists before going ahead with the deletion, etc. I am not saying that it is impossible, but with a proper system in place giving a very special treatment to the most important parts, it would be really difficult.

      Their database replication stopped because the replicating server....

      Nothing of this sounds as a valid excuse to me. A proper backup system shouldn't be affected by almost anything. Some examples to minimise these risks: automatically duplicating each user input in the moment, always having backups to the backups to the backups (e.g., programs running 24/7 checking that the main backup is OK and, if not, starting using the alternative 1 and then 2, etc. After having clearly warned everyone about such an issue!), etc.

      I don't know the exact situation and that's why cannot deliver a worthy enough assessment. Additionally, I don't like talking in abstract terms and do firmly believe that there are lots of exceptions everywhere and every time. But even by bearing all this in mind, I cannot think of many good reasons for a company, whose main business is dealing with user data, to not be able to avoid a data lost by doing everything properly. I can even add that if I were ever personally involved in such a situation, I would feel really ashamed and wouldn't even think about trying to (somehow) justify my behaviour.

      --
      Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
  22. Or you use scripts by perpenso · · Score: 2

    That's why you always always run ls first.

    ls -ld /home/user1 /home/user2 /home/ user3

    Then edit the command to rm. Always.

    Or you use scripts.

    somescript user1 user2 user3

    1. Re:Or you use scripts by jbrown.za · · Score: 1

      Scripts can be very dangerous as well.

      Many problems have been caused by scripts that are tested while logged in as a user and then run under the root crontab, where the starting directory or environment variables are not the same.

    2. Re:Or you use scripts by perpenso · · Score: 1

      Scripts can be very dangerous as well. Many problems have been caused by scripts that are tested while logged in as a user and then run under the root crontab, where the starting directory or environment variables are not the same.

      You can have a typo or mistake in the script and have it occur once (in a test environment usually). Or you can have a potential typo or mistake (typing from wrong directory here too) every time you manually execute commands on a production system. There are potential typos and mistakes on either path, but one reduces the risks.

    3. Re:Or you use scripts by grep+-v+'.*'+* · · Score: 1

      Or you use scripts. somescript user1 user2 user3

      Certainly: somescript . /user1

      I was originally going to say: .. /user1 but figured that would just be mean on rm /home/$user success. You could always try to make the script smarter but that just breeds more intelligent idiots.

      Signed: Bobby Tables

      --
      If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
  23. Every tech company by Anonymous Coward · · Score: 1

    goes through this kind of incident. It's part of growing up.
    We lost 30TB of data from a flaky NAS a few year back, still there !
    (and good luck, Gitlab is great software)

    1. Re:Every tech company by Anonymous Coward · · Score: 0

      Who gives a fuck if you lose a billion customer records. There are 7 billion suckers out there. Market harder bro and every one of those suckers will want to buy your shit.

  24. All my sympathy... by Gumbercules!! · · Score: 4, Insightful

    I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.

    1. Re:All my sympathy... by malkavian · · Score: 2

      Definitely feel for 'em.. And really feel for the guy who was on the keyboard..

    2. Re:All my sympathy... by Pascoea · · Score: 1

      I think everybody with any kind of root access has done it. I found out the hard way that "delete * from userSettings where username=whatever" is a significantly different query than "delete * from userSettings where username-whatever". Seeing a result of "23134 records affected" when expecting "1 records affected" will wake a guy up in a hurry.

  25. Agile by Anonymous Coward · · Score: 0

    Agile backup... fail and fail often.

  26. An that is why you run BCM and recovery tests by gweihir · · Score: 3, Interesting

    Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.

    Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:An that is why you run BCM and recovery tests by Entrope · · Score: 1

      BCM? Bravo Company, manufacturer of firearm parts so you can shoot your servers? Buzzword-Centric Methodology? The SourceForge "BCM" project, a file compression utility? Baylor College of Medicine? Bear Creek Mining? Bacau International Airport? Broadcom?

    2. Re:An that is why you run BCM and recovery tests by gweihir · · Score: 0

      Don't play if you have no clue about the game....

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    3. Re: An that is why you run BCM and recovery tests by Entrope · · Score: 1

      You first. If your head is so far up your ass that you can't tell when you're using a buzzword acronym with little exposure in the tech world and a lot of plausible meanings, you might be a tool.

    4. Re: An that is why you run BCM and recovery tests by Anonymous Coward · · Score: 0

      My best guess is "business continuity management" plans. I had the same question since I was unfamiliar with that acronym--I work in university IT where we have DR/DRP (disaster recovery plan). Hope that helps

    5. Re:An that is why you run BCM and recovery tests by msauve · · Score: 1

      He also mentioned bean counters, so maybe it means Bean Counting Management. But based on his later post, it's clear that even he doesn't know what it means.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    6. Re: An that is why you run BCM and recovery tests by gweihir · · Score: 1

      You need both. BCM to continue operating while the DR activities are ongoing and DR to get back to normal. For example, a fail-over system is a BCM measure, while restoring from backups is DR. For a university it may be acceptable to just do without IT until DR is completed. Both are absolute standard terms in enterprise IT.

      @Entrope: Incidentally, a civil question gets a civil answer. The first 2 Google hits also decipher BCM in the context here and if you add "outage" it is a few more. So I guess you were not even trying, except trying to be an ass.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    7. Re: An that is why you run BCM and recovery tests by Anonymous Coward · · Score: 0

      Incidentally, a civil question gets a civil answer.

      For many people, that indeed tends to be the case.

      In your case, however, it is fairly often a different story. Really, go back and check your own posts, where a civil question not seldom gets a snarky response.

    8. Re: An that is why you run BCM and recovery tests by Anonymous Coward · · Score: 0

      >buzzword acronym with little exposure in the tech world

      Look, just because you don't know what an acronym is, don't assume others are as ignorant as you. BCM or BCP is very well known to professional sysadmins.

    9. Re: An that is why you run BCM and recovery tests by AK+Marc · · Score: 1

      Nope. DR is a "solution" to the question of BCM. But you are speaking as if you assume BCM is some manner of redundancy (or diversity). They had BCM. BCM doesn't help. They had backups. They didn't properly test them. That means they had BCM, and BCM was worthless. So bringing it up looks to be more a way to talk buzzwords than help people.

      BCM is "when the site goes down, we spin up the backups in AWS" or something like that. That the backups don't work, and aren't tested is unrelated to BCM, or DR, or any of your other worthless buzzwords.

    10. Re: An that is why you run BCM and recovery tests by gweihir · · Score: 1

      That is really not how this works. DR is how you re-establish normal operations. BCM is what you do before you reach that state.

      Also, have you missed the part were I said "An that is why you run BCM and recovery tests"? You do not just need plans for BCM and DR, you need to test them and that was my whole point.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    11. Re: An that is why you run BCM and recovery tests by gweihir · · Score: 1

      Indeed. There are however a lot of amateurs around and quite a few BCM and DR plans that are not very good or not adequately tested. This story demonstrates that nicely. The thing is that both BCM and DR and respective tests are costly and do not directly create value and so the bean-counters always try to reduce them, because they do not grasp risk management.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    12. Re:An that is why you run BCM and recovery tests by gweihir · · Score: 1

      Your powers of deduction are amazing in their ineffectiveness. By now everybody else in this thread has probably seen that it is of course "Business Continuity Management". Google has about 521'000 hits for it (in quotes, i.e. the full term), hence it can hardly be an obscure concept.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    13. Re: An that is why you run BCM and recovery tests by AK+Marc · · Score: 1

      You don't "run" BCM. Business Continuity Management is about calculating the costs and risks. It is done by the CFO and COO, not the CIO. The CIO (or grunts below) come up with a BCP to meet the BCM. DR is one option for a BCP to meet the BCM.

      Yes, I read your words. They didn't make sense. Expand the acronym, and it's not even proper English. "An that is why you run Business Continuity Management (BCM) and recovery tests". You don't run BCM tests. You run BCP tests.

      Or do you not know the difference between BCM and BCP, and are lecturing others for using the terms wrong?

    14. Re: An that is why you run BCM and recovery tests by gweihir · · Score: 1

      This is slashdot. You will always find inaccuracies because of lack of detail as nobody writes a dissertation here. (And even in a dissertation, that problem would exist. I know, I did one.) It is completely clear what I meant and were I simplified. Your argument has no merit.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    15. Re: An that is why you run BCM and recovery tests by AK+Marc · · Score: 1

      You over simplified to the point it was incorrect. That makes you wrong. Your argument has no merit. If you want to stop being wrong and looking like an argumentative idiot, spend more effort in being accurate. It wouldn't have taken any more words to have been more accurate.

  27. BOFH by Anonymous Coward · · Score: 0

    Our old BOFH at his best.

  28. Service reliability by Anonymous Coward · · Score: 0

    Having supported numerous telecommunications and computer systems for which more than professional pride was at stake if year-to-year availability of five-nines (99.99999%) was not achieved, it is astonishing that any service platform supporting content created by paying customers would NOT have a fully-tested plan and viable backups in place for bare-metal restores for all servers and databases, virtual and otherwise. This incident reveals negligence and failure at GitLab to provide support that should be implicit with the service they offer.

    Currently we support several large PostgreSQL instances and have scripts that backup on local storage, as well as to network and cloud storage (S3, etc). Just in case. If backups are NOT successfully created each time scripts run, then DBAs and sysadmins are notified immediately. Furthermore, backup scripts are regularly tested, and all scripts and processes are logged and include detailed log entries that allow for quick identification of failure components, both hardware and software. In 2017, this is not rocket-science! Plan, test, and document. Constantly.

  29. DR Testing as a business model by swb · · Score: 1

    Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?

    Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.

    I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that they would hand over their backup media and then see what happens when they try to restore their data to your equipment.

    The customer would provide all software and media, just as they would in a real disaster.

    It would eliminate the "we can't restore everything" capacity issue most places have, the fact that the equipment would differ from what they have (even if its only slight model derivations) would be the kind of variation likely in a real DR scenario -- if you have to physically replace hardware, it likely won't be the same model stuff you have now.

    An option would exist to have/not have the staff participate in the process -- I'm sure many CxOs are curious if their "system" can survive being restored by someone else.

    1. Re:DR Testing as a business model by Anonymous Coward · · Score: 0

      I worked in a small computer shop which sold kit to a business just like this. They had a huge truck full of tape drives, storage arrays, servers and even a mobile office. They would pull up in the car park post-disaster and provide DR functions for the business. As part of the service they would take backup tapes and run test restores on their kit to prove they were viable.

    2. Re:DR Testing as a business model by coofercat · · Score: 1

      As a sysadmin, this sounds great (a bit 'brown trousers' for me personally, but great). However, one of my clients is entirely 'in the cloud', so no need for your truck of kit - just provide as many VMs as we like somewhere on t'internet. Ideally you'd be able to do this in a 'little internet' which has a VPN to get into it, has it's own DNS servers, and maybe ways to 'bend' or alter requests to other cloudy services, such as Google or Amazon such that the app 'thinks' its talking to the real, live production service, but actually it's talking to a test account or some such. That means I can spin up my clients world in your environment and have it think it was on the internet, but actually not interact with anything real - and I don't need to change every account and password baked into the code and config so I don't do any damage to real data.

      Secondly, just like the backups and drills that most companies don't bother to do, they won't bother to hire a service like this either. You'll probably be able to make a few top-dollar sales to some big shops who already have very good DR procedures, but the little place (or even medium place) probably won't bother.

      One way I could imagine this working would be to gain some sort of certification. Say for example, the fiduciary regulations of Elbonia were changed to say that all app providers must have externally verified DR capability, then your business would fit right in and solve that need - and you'd probably get lots of work, and hopefully lots of repeat work too. Short of regulations though, whatever certification you could come up with on your own wouldn't be worth enough to have people want to pay to get it.

    3. Re:DR Testing as a business model by swb · · Score: 1

      Secondly, just like the backups and drills that most companies don't bother to do, they won't bother to hire a service like this either.

      Maybe, but often the real problem is that they don't have the facilities to do it in. It becomes kind of an existential question they can't answer. I think if you attacked the CIO/CFO with the idea of this service and why your staff can't do it now and what they don't know, you'd get more uptake than you might think. You might even get line staff on board with it, too, since a successful restore or the ability to adjust procedures to get a successful restore might (a) make them sleep better and night and (b) be an ace in their pocket if something does go wrong in an actual disaster -- "we hired the service, and tested the system as completely as possible and it worked. This failure is act-of-god/statistical improbability that you can't blame on us."

      Say for example, the fiduciary regulations of Elbonia were changed to say that all app providers must have externally verified DR capability, then your business would fit right in and solve that need

      I'd bet between SOX, HIPPA, partner agreements, insurance, etc, there's already enough soft requirements that you could say "Sure, you're not *mandated* to have more than "just" a DR plan, but if your plan is shit and non-viable your civil liability it limitless. A proven and certified execution of your DR plan is a get out of jail free card if it doesn't work for act-of-god reasons."

      The cloud part is tougher, but to be honest, I don't really know how people protect themselves in those environments, and I'd wager a lot don't besides making redundant data copies and hoping that the cloud has them covered -- which it might, from a lot of physical failures, but I think they impart too much faith in cloud systems from a recovery perspective, but that's almost a different discussion.

    4. Re:DR Testing as a business model by _Sharp'r_ · · Score: 1

      An expensive way, which is also pretty bulletproof:
      At least two geographically separate production environments, run in each for approximately half the year total, switching periodically which is the target DR setup and which is the Prod environment.

      Then you always know your backups to your DR are working (hint: use snapshoting/versioning as well, to avoid the replicating the disaster issue), because you are periodically forced to actually use it as a real production environment. You know your switchover and switchback processes work and how long they really take, because you routinely follow them. It's not just data. In these days of Internetworking, you need to be sure your IP space, firewall rules, partner's firewall rules, routing, proxies, DDOS, VPNs, etc... will all function properly if you need to fail-over during a disaster. It also helps to go live in an environment with only a set or two of patching/upgrade cycles having passed, rather than hoping years of OS and firmware changes were also properly applied to your backup environment.

      If you've never run in an environment, then you may have some hardware and such, but you don't quite have an actual environment yet.

      --
      The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
    5. Re:DR Testing as a business model by AK+Marc · · Score: 1

      SOX and HIPAA have no requirements around "uptime" other than being able to provide historic records within a matter of weeks, if required. DR and uptime are unrelated to most statutory requirements.

    6. Re:DR Testing as a business model by AK+Marc · · Score: 1

      Sounds like someone who hasn't heard of Blue/Green discovering Blue/Green. You don't have "prod" and "dev" but you build in dev (or test, whatever you like to call it), then promote dev to prod, and down-grade prod to dev. A short hold-down to be able to roll back to previous prod, if problems occur, then old prod becomes new dev. Switching between environments happen every release.

      But you could do that without any backups or DR.

    7. Re:DR Testing as a business model by _Sharp'r_ · · Score: 1

      Sounds like a similar concept in a lot of ways, but some of us have regulatory requirements for separation of duties which (among other things) prohibit the use of production data in test environments.

      --
      The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
  30. Solution by aliquis · · Score: 1

    Maybe they could had uploaded any new changes to some sort of online repository.

  31. HAMMERFS by Anonymous Coward · · Score: 1

    Allows exactly that, as does I am told ZFS.

    Assuming you have a large enough disk, and small enough space requirements to leave the room necessary for it, a filesystem solution like either of the above allows exactly this, and barring filesystem corruption due to cornercases or hardware issues, should allow reverting all but the largest PEBKAC issues.

  32. Only perform reversible actions by marko123 · · Score: 1

    A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!

    --
    http://pcblues.com - Digits and Wood
  33. did they even try a recovery? by Anonymous Coward · · Score: 0

    take the disk off from active duty and could probably have recoverd it straight off - i mean it's not like rm overwrites the actual data......

    1. Re:did they even try a recovery? by ChrisMaple · · Score: 1

      That was my first thought. Recovery isn't guaranteed, and the process might be labor intensive. They'd have to notify their customers that their data might be corrupted.

      --
      Contribute to civilization: ari.aynrand.org/donate
  34. It all boils down to trust. by Anonymous Coward · · Score: 0

    In every IT organisation their most valuable asset, right before the people that works there is the Trust of their Customers.
    The whole process has been very transparent, perhaps too much, and it is clear that technical disaster has met naivety.

    Seriously guys? Empty files of DB dumps?

  35. Devops by funkymonkjay · · Score: 1

    Clearly a case of too much deving and not enough oping.
    I see no mention of alarms. With all those failures there must have been some notification.
    Chances are they were lost in the nested folders of outlook with millions of other false alarm alerts.
    They must have a mirrored test zone that they can copy from.
    Lesson learned. Test your backups, regularly!

  36. All filesystems suck... YUP by Anonymous Coward · · Score: 0

    well, users need versioning like VMS so when they overwrite something they can go back to another version, and not the MS Volume Shadow copy is not good enough and slows the system down a LOT. The other thing that all systems need is ZFS like FreeNAS(I am sure other OSes Linus and UNUIX have it as well) has so we can just expose the snapshot and copy. To be useful we need and extra file system to copy to so reads and writes never go random and we always need lots of slack space.

    So why was the admin so concerned about disk space that he used the rm command? My guess was not enough space....

  37. And of course by John+Allsup · · Score: 1

    And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.

    What is frustrating is that, given all the progress in hardware reliability since when I grew up, people take reliability for granted, whereas way-back-when, people who did that learned pretty f'ing quickly that stuff can and does go wrong.

    --
    John_Chalisque
  38. Go ahead...yawn, but by Provocateur · · Score: 1

    That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!

    So don't blame the guy, praise him and thank him for saving us all!

    --
    WARNING: Smartphones have side effects--most of them undocumented.
  39. Six hours of loss is a "melt-down"? by thesandbender · · Score: 4, Insightful

    Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".

    1. Re:Six hours of loss is a "melt-down"? by sysrammer · · Score: 2

      I see your point but I'd guess you are not a professional sysadmin. TFA should have been prefaced "For SysAdmins only". Most don't care about losing data: this far along in the computer revolution, most of us have lost years of data due to a disk or pebcak failure.

      Most of the time it is not a deal-breaker, or "melt-down" in this case. A company might have to spend some money, or a worker has to spend a lot of time, or the two dozen drafts of your "Great American Novel" goes gone.

      But sometimes it's the entire financial transaction or contractual history. Or the the finished version of your novel. Or the priceless-to-you pictures of your baby/SO/parent/nana.

      Pro sysadmins pretty much find that a "no data loss mindset" is career enhancing.

      --
      His ignorance covered the whole earth like a blanket, and there was hardly a hole in it anywhere. - Mark Twain
    2. Re:Six hours of loss is a "melt-down"? by Anonymous Coward · · Score: 0

      I agree, specially considering Git's decentralized nature. Most users probably won't notice any loss short of recent merges. The biggest positive in tagging this event as a "meltdown" is that it will hopefully spur better QC among similar managed cloud services.

  40. Um, to clarify: by rickb928 · · Score: 1

    "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."

    Or, more accurately, less than 5 backup/replication techniques were deployed.

    I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.

    I do not miss sysadmin work so much.

    --
    deleting the extra space after periods so i can stay relevant, yeah.
  41. so... by JustNiz · · Score: 1

    What kind of moron just deploys a new backup strategy then just sits back and trusts their entire infrastructure to it, without ever having actually performed a test recovery?

  42. Tickets/issues/tasks in repository by Lennie · · Score: 1

    This is why I think it would be good to keep the tickets in the repository or a second repository. Easy to replicate, easy to keep history, easy to backup.

    --
    New things are always on the horizon
  43. Comand line fans by Anonymous Coward · · Score: 0

    It seldom occurs to command line fanboys they are always subject to typos sometimes will serious consequences no matter how clever they imagine themselves to be.

  44. Oh Gitlab... by SpencerWilliams · · Score: 1

    I love you, but when will you stop sucking?

  45. Management by Anonymous Coward · · Score: 0

    "a tired sysadmin", an equal part of the blame should probably go to the management team for allowing their staff to be overworked and "tired".

    Similar to the NASA disasters, VW emissions scandal, and countless other examples where the management chooses an engineer, developer, or some other so called managerial underlings to take the full force of the blame.

    Wouldn't surprise me at all if the management didn't allow for a regular backup process validation due to the "costs" involved, also.

  46. Same story, different day by darkain · · Score: 1

    I preach the same thing every time.

    ZFS snapshots.
    ZFS Send/Recv to other data centers.

    Is it really that hard? That is literally all you have to do. Delete a folder? Copy it from snapshot. Things are more fucked then that? Revert to snapshot. Entire server is nuked? You have 100% replication off-site with snapshotting intact. Don't know how to set it all up? Install FreeNAS and use the built in web UI for it. No longer are any other excuses viable.

  47. Good job by Anonymous Coward · · Score: 0

    Hey, let's take a distributed, naturally fault-tolerant content management system and centralize it. Good job.

  48. Linus' shit is still safe tho by Anonymous Coward · · Score: 0

    Linus said...

    Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)

  49. check your precision! by gosand · · Score: 1

    To be more precise... test your RESTORE PROCESS.

    It is important to not only know that your backups are good, but that your process of restoring them is sound and that you have at least tried it.

    --

    My beliefs do not require that you agree with them.

  50. You should see what they are saying... by Anonymous Coward · · Score: 0

    Apparently not a single backup was correctly configured or deployed, site wide. 5 different backup systems all failed.

  51. I don't feel for the DBA's one bit by Anonymous Coward · · Score: 0

    They failed to do their jobs and ignored failures which is another part of their job. Imagine if this was a vendor used in critical life-saving procedures. OOOO I feel for the guy. Fuck you. These fucks did not do their jobs. They don't deserve any sympathy unless they are sending back their paychecks. They got paid to do work they failed to do.

  52. ouch by D,Petkow · · Score: 1

    a fellow admin once wiped a live prod box serving lots of customers - but it was load balanced and we managed to rsync it from the other host but i still remember his face

  53. Code Source 101 by Anonymous Coward · · Score: 0

    Never allow rm -rf on repos servers.

    You're supposed to stage deletes as well. If not your backup strategy is plain stupid.

    I remember rm -rf was a big issue back in 1998, appears it's still is an issue.

  54. If you are not testing restores... by DdJ · · Score: 1

    ...then you are not performing backups.

  55. COW filesystem by DrYak · · Score: 1

    A much simpler way to do it, that won't require you to hack standard system command-line tools,
    would be to use some copy-on-write or log-structured files system (e.g.: BTRFS, ZFS, etc. depending of your taste),
    and use snapshots to keep older versions of your file tree.
    If anything goes wrong you can still recover from a previous snapshot.

    Some Linux distributions (like: opensuse) have tools (like snapper) that can automate this task for you (and opensuse uses snapper to similarily snapshot system upgrades).

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]