GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

← Back to Stories (view on slashdot.org)

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

Posted by BeauHD on Tuesday January 31, 2017 @07:00PM from the put-in-a-hard-day's-work dept.

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

16 of 356 comments (clear)

Min score:

Reason:

Sort:

Yawn... by Anonymous Coward · 2017-01-31 19:04 · Score: 5, Insightful

No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
1. Re: Yawn... by Nutria · 2017-01-31 20:35 · Score: 5, Funny
  
  paki chimps in jungle
  Someone failed geography class...
  
  --
  "I don't know, therefore Aliens" Wafflebox1
I feel that lone sysadmin's pain by sixdrum · 2017-01-31 19:06 · Score: 5, Insightful

A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

rm -rf /home/user1 /home/user2 /home/ user3

Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
1. Re:I feel that lone sysadmin's pain by Anonymous Coward · 2017-01-31 19:28 · Score: 5, Insightful
  
  That's why you always always run ls first.
  ls -ld /home/user1 /home/user2 /home/ user3
  Then edit the command to rm. Always.
2. Re:I feel that lone sysadmin's pain by mmell · 2017-01-31 19:35 · Score: 5, Interesting
  
  Sadly, I remember personally making a similar mistake about a decade ago. Upgrading SAN hardware, preparing the old hardware for decommissioning (deleting data prior to sending the units to vendor). Even with offsite data replication, I survived several uncomfortable days and never did fully live down my error. Could've been worse - I thought I had a career change opportunity on my hands. My only saving grace was that I was acting under direction from vendor tech support when the error occurred (although it was still my fingers on the keyboard).
3. Re:I feel that lone sysadmin's pain by stridebird · 2017-01-31 21:08 · Score: 5, Informative
  
  Correct pattern is:
  > cd /home && rm ...
  ie don't run rm unless cd worked.
4. Re: I feel that lone sysadmin's pain by gsslay · 2017-01-31 23:10 · Score: 5, Insightful
  
  This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.
  
  Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".
5. Re: I feel that lone sysadmin's pain by rgbatduke · 2017-02-01 00:10 · Score: 5, Interesting
  
  Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.
  With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.
  Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...
  
  --
  Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
6. Re: I feel that lone sysadmin's pain by Joce640k · 2017-02-01 00:22 · Score: 5, Insightful
  
  I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
  This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.
  If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.
  
  --
  No sig today...
Repeat after me (and others) by Nkwe · 2017-01-31 19:12 · Score: 5, Interesting

If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
1. Re:Repeat after me (and others) by MatthiasF · 2017-01-31 20:24 · Score: 5, Informative
  
  Uh, did you read the article?
  
  The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
  
  It's like they thought out everything but never made sure any of it was working.
2. Re:Repeat after me (and others) by dbIII · 2017-01-31 20:30 · Score: 5, Insightful
  
  Uh, did you read the article?
  No, and I got the wrong impression from skimming the article.
  You are correct and I am not.
3. Re:Repeat after me (and others) by Opportunist · 2017-01-31 20:36 · Score: 5, Funny
  
  Sing with me, kids:
  One backup in my bunk
  One backup in my trunk
  One backup at the town's other end
  One backup on another continent
  All of them tested and verified sane
  now go to bed, you can sleep once again
  
  --
  We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
4. Re:Repeat after me (and others) by tonymercmobily · 2017-01-31 21:16 · Score: 5, Insightful
  
  "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
  OK, now that I have repeated it, let me add.
  As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.
  THEN you will see, for real, how your company reacts to real disasters.
  The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.
  Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.
  http://www.datacenterknowledge...
  Merc.
Re:Don't use rm! by infolation · 2017-01-31 19:58 · Score: 5, Funny

Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Don't tell the customer anything!! Geez... What's with these semi-pros?
If only there was another copy of the repo by HxBro · 2017-01-31 20:28 · Score: 5, Funny

Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this