GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

← Back to Stories (view on slashdot.org)

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

Posted by BeauHD on Tuesday January 31, 2017 @07:00PM from the put-in-a-hard-day's-work dept.

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

23 of 356 comments (clear)

Min score:

Reason:

Sort:

Yawn... by Anonymous Coward · 2017-01-31 19:04 · Score: 5, Insightful

No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
1. Re: Yawn... by Nutria · 2017-01-31 20:35 · Score: 5, Funny
  
  paki chimps in jungle
  Someone failed geography class...
  
  --
  "I don't know, therefore Aliens" Wafflebox1
I feel that lone sysadmin's pain by sixdrum · 2017-01-31 19:06 · Score: 5, Insightful

A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

rm -rf /home/user1 /home/user2 /home/ user3

Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
1. Re:I feel that lone sysadmin's pain by Anonymous Coward · 2017-01-31 19:28 · Score: 5, Insightful
  
  That's why you always always run ls first.
  ls -ld /home/user1 /home/user2 /home/ user3
  Then edit the command to rm. Always.
2. Re:I feel that lone sysadmin's pain by mmell · 2017-01-31 19:35 · Score: 5, Interesting
  
  Sadly, I remember personally making a similar mistake about a decade ago. Upgrading SAN hardware, preparing the old hardware for decommissioning (deleting data prior to sending the units to vendor). Even with offsite data replication, I survived several uncomfortable days and never did fully live down my error. Could've been worse - I thought I had a career change opportunity on my hands. My only saving grace was that I was acting under direction from vendor tech support when the error occurred (although it was still my fingers on the keyboard).
3. Re:I feel that lone sysadmin's pain by arglebargle_xiv · 2017-01-31 20:04 · Score: 4, Funny
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  Actually: Check your privilege! (Especially if rm -rf is involved).
4. Re:I feel that lone sysadmin's pain by AmiMoJo · 2017-01-31 21:07 · Score: 4, Interesting
  
  mkdir ./trash
  mv file_to_delete ./trash
  If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
5. Re:I feel that lone sysadmin's pain by stridebird · 2017-01-31 21:08 · Score: 5, Informative
  
  Correct pattern is:
  > cd /home && rm ...
  ie don't run rm unless cd worked.
6. Re: I feel that lone sysadmin's pain by gsslay · 2017-01-31 23:10 · Score: 5, Insightful
  
  This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.
  
  Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".
7. Re: I feel that lone sysadmin's pain by rgbatduke · 2017-02-01 00:10 · Score: 5, Interesting
  
  Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.
  With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.
  Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...
  
  --
  Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
8. Re: I feel that lone sysadmin's pain by Joce640k · 2017-02-01 00:22 · Score: 5, Insightful
  
  I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
  This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.
  If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.
  
  --
  No sig today...
Repeat after me (and others) by Nkwe · 2017-01-31 19:12 · Score: 5, Interesting

If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
1. Re:Repeat after me (and others) by MatthiasF · 2017-01-31 20:24 · Score: 5, Informative
  
  Uh, did you read the article?
  
  The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
  
  It's like they thought out everything but never made sure any of it was working.
2. Re:Repeat after me (and others) by dbIII · 2017-01-31 20:30 · Score: 5, Insightful
  
  Uh, did you read the article?
  No, and I got the wrong impression from skimming the article.
  You are correct and I am not.
3. Re:Repeat after me (and others) by Opportunist · 2017-01-31 20:36 · Score: 5, Funny
  
  Sing with me, kids:
  One backup in my bunk
  One backup in my trunk
  One backup at the town's other end
  One backup on another continent
  All of them tested and verified sane
  now go to bed, you can sleep once again
  
  --
  We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
4. Re:Repeat after me (and others) by tonymercmobily · 2017-01-31 21:16 · Score: 5, Insightful
  
  "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
  OK, now that I have repeated it, let me add.
  As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.
  THEN you will see, for real, how your company reacts to real disasters.
  The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.
  Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.
  http://www.datacenterknowledge...
  Merc.
5. Re:Repeat after me (and others) by cdrudge · 2017-02-01 00:37 · Score: 4, Insightful
  
  "they are ISO-over-9000 and that is good enough for us"
  Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:
  1. Perform backup
  2. Pray nothing goes wrong.
  Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.
Re:Don't use rm! by infolation · 2017-01-31 19:58 · Score: 5, Funny

Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Don't tell the customer anything!! Geez... What's with these semi-pros?
If only there was another copy of the repo by HxBro · 2017-01-31 20:28 · Score: 5, Funny

Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
Unfortunately common by CustomSolvers2 · 2017-01-31 20:56 · Score: 4, Interesting

I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.

If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).

These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
Re:Made this mistake once... by serviscope_minor · 2017-01-31 21:00 · Score: 4, Funny

Good choice. But, I always use this prompt:
PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\$[^\\]\\{6\\}\$[^\\]\\{2,\\}/\\1~1/g" ) >'

--
SJW n. One who posts facts.
All my sympathy... by Gumbercules!! · 2017-01-31 22:23 · Score: 4, Insightful

I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
Six hours of loss is a "melt-down"? by thesandbender · 2017-02-01 01:53 · Score: 4, Insightful

Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".