Slashdot Mirror


GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

16 of 356 comments (clear)

  1. Yawn... by Anonymous Coward · · Score: 5, Insightful

    No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.

    This has been going on since the dawn of computing and it seems there's no end in sight.

     

    1. Re: Yawn... by Anonymous Coward · · Score: 3, Insightful

      No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.

      He's just a racist.

    2. Re:Yawn... by Big+Hairy+Ian · · Score: 2, Insightful

      Clearly their DR Plan didn't get any form of QA. It's no good having five forms of backup/replication if non of them work!

      --

      Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.

    3. Re:Yawn... by Anonymous Coward · · Score: 2, Insightful

      And will continue to happen because experienced admins are expensive. Cheaper to hire the new grad who knows the buzzwords than the experienced admin who has lived through a couple of catastrophes and now knows how to plan for them.

  2. I feel that lone sysadmin's pain by sixdrum · · Score: 5, Insightful

    A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

    rm -rf /home/user1 /home/user2 /home/ user3

    Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

    1. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 5, Insightful

      That's why you always always run ls first.

      ls -ld /home/user1 /home/user2 /home/ user3

      Then edit the command to rm. Always.

    2. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 2, Insightful

      Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?

    3. Re:I feel that lone sysadmin's pain by Angstroem · · Score: 3, Insightful

      The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P

      Then you have never worked on a repository with users of TortoiseSVN and the likes.

      "Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."

    4. Re: I feel that lone sysadmin's pain by gsslay · · Score: 5, Insightful

      This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.

      Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".

    5. Re: I feel that lone sysadmin's pain by Joce640k · · Score: 5, Insightful

      I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!

      This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.

      If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.

      --
      No sig today...
  3. Re:Repeat after me (and others) by dbIII · · Score: 5, Insightful

    Uh, did you read the article?

    No, and I got the wrong impression from skimming the article.
    You are correct and I am not.

  4. Re:Don't use rm! by sodul · · Score: 3, Insightful

    Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.

  5. Re:Repeat after me (and others) by tonymercmobily · · Score: 5, Insightful

    "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
    OK, now that I have repeated it, let me add.

    As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.

    THEN you will see, for real, how your company reacts to real disasters.

    The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.

    Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.

    http://www.datacenterknowledge...

    Merc.

  6. All my sympathy... by Gumbercules!! · · Score: 4, Insightful

    I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.

  7. Re:Repeat after me (and others) by cdrudge · · Score: 4, Insightful

    "they are ISO-over-9000 and that is good enough for us"

    Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:

    1. Perform backup
    2. Pray nothing goes wrong.

    Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.

  8. Six hours of loss is a "melt-down"? by thesandbender · · Score: 4, Insightful

    Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".