GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

← Back to Stories (view on slashdot.org)

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

Posted by BeauHD on Tuesday January 31, 2017 @07:00PM from the put-in-a-hard-day's-work dept.

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

34 of 356 comments (clear)

Min score:

Reason:

Sort:

Yawn... by Anonymous Coward · 2017-01-31 19:04 · Score: 5, Insightful

No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
1. Re: Yawn... by Nutria · 2017-01-31 20:35 · Score: 5, Funny
  
  paki chimps in jungle
  Someone failed geography class...
  
  --
  "I don't know, therefore Aliens" Wafflebox1
2. Re: Yawn... by Anonymous Coward · 2017-01-31 21:16 · Score: 3, Insightful
  
  No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.
  He's just a racist.
3. Re: Yawn... by Qzukk · 2017-02-01 06:02 · Score: 3, Interesting
  
  There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).
  Then, there's "oh my god the datacenter is on fire!". This is what people usually call "Disaster Recovery". One dead server isn't a disaster when you have failovers, but when your entire datacenter is dead, THAT's a disaster. It's tough as nails to automate too, since without having at least three datacenters, it's inherently a split-brain issue. If Datacenter A stops responding to Datacenter B, which one is actually down? If you aren't an AS and can't just republish your IPs at Datacenter B with a BGP routing change, that means you're going to have to publish new DNS records and wait one TTL for everyone to see them. If you had an authoritative DNS server at Datacenter A, then hopefully it was able to recognize that its down and shot itself (or at least updated its zone files with B's IPs) or you can somehow get to it and kill it, otherwise when Datacenter A comes back online, it'll be serving up A's IPs again and conflict with the other DNS server. This also is setting aside replicating your data between datacenters and how much of that is lost when you switch back and forth.
  
  --
  If I have been able to see further than others, it is because I bought a pair of binoculars.
I feel that lone sysadmin's pain by sixdrum · 2017-01-31 19:06 · Score: 5, Insightful

A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

rm -rf /home/user1 /home/user2 /home/ user3

Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
1. Re:I feel that lone sysadmin's pain by Anonymous Coward · 2017-01-31 19:28 · Score: 5, Insightful
  
  That's why you always always run ls first.
  ls -ld /home/user1 /home/user2 /home/ user3
  Then edit the command to rm. Always.
2. Re:I feel that lone sysadmin's pain by mmell · 2017-01-31 19:35 · Score: 5, Interesting
  
  Sadly, I remember personally making a similar mistake about a decade ago. Upgrading SAN hardware, preparing the old hardware for decommissioning (deleting data prior to sending the units to vendor). Even with offsite data replication, I survived several uncomfortable days and never did fully live down my error. Could've been worse - I thought I had a career change opportunity on my hands. My only saving grace was that I was acting under direction from vendor tech support when the error occurred (although it was still my fingers on the keyboard).
3. Re:I feel that lone sysadmin's pain by arglebargle_xiv · 2017-01-31 20:04 · Score: 4, Funny
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  Actually: Check your privilege! (Especially if rm -rf is involved).
4. Re:I feel that lone sysadmin's pain by AmiMoJo · 2017-01-31 21:07 · Score: 4, Interesting
  
  mkdir ./trash
  mv file_to_delete ./trash
  If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
5. Re:I feel that lone sysadmin's pain by stridebird · 2017-01-31 21:08 · Score: 5, Informative
  
  Correct pattern is:
  > cd /home && rm ...
  ie don't run rm unless cd worked.
6. Re:I feel that lone sysadmin's pain by Angstroem · 2017-01-31 22:45 · Score: 3, Insightful
  
  The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P
  Then you have never worked on a repository with users of TortoiseSVN and the likes.
  "Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."
7. Re: I feel that lone sysadmin's pain by gsslay · 2017-01-31 23:10 · Score: 5, Insightful
  
  This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.
  
  Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".
8. Re: I feel that lone sysadmin's pain by rgbatduke · 2017-02-01 00:10 · Score: 5, Interesting
  
  Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.
  With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.
  Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...
  
  --
  Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
9. Re: I feel that lone sysadmin's pain by Joce640k · 2017-02-01 00:22 · Score: 5, Insightful
  
  I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
  This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.
  If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.
  
  --
  No sig today...
Repeat after me (and others) by Nkwe · 2017-01-31 19:12 · Score: 5, Interesting

If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
1. Re:Repeat after me (and others) by MatthiasF · 2017-01-31 20:24 · Score: 5, Informative
  
  Uh, did you read the article?
  
  The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
  
  It's like they thought out everything but never made sure any of it was working.
2. Re:Repeat after me (and others) by dbIII · 2017-01-31 20:30 · Score: 5, Insightful
  
  Uh, did you read the article?
  No, and I got the wrong impression from skimming the article.
  You are correct and I am not.
3. Re:Repeat after me (and others) by Opportunist · 2017-01-31 20:36 · Score: 5, Funny
  
  Sing with me, kids:
  One backup in my bunk
  One backup in my trunk
  One backup at the town's other end
  One backup on another continent
  All of them tested and verified sane
  now go to bed, you can sleep once again
  
  --
  We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
4. Re:Repeat after me (and others) by tonymercmobily · 2017-01-31 21:16 · Score: 5, Insightful
  
  "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
  OK, now that I have repeated it, let me add.
  As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.
  THEN you will see, for real, how your company reacts to real disasters.
  The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.
  Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.
  http://www.datacenterknowledge...
  Merc.
5. Re:Repeat after me (and others) by cdrudge · 2017-02-01 00:37 · Score: 4, Insightful
  
  "they are ISO-over-9000 and that is good enough for us"
  Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:
  1. Perform backup
  2. Pray nothing goes wrong.
  Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.
6. Re:Repeat after me (and others) by Zontar+The+Mindless · 2017-02-01 01:08 · Score: 3, Funny
  
  Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)
  (A: "Too many backups".)
  --Actual question from the final exam for the Networking 100 class I took in 1998.
  
  --
  Il n'y a pas de Planet B.
7. Re:Repeat after me (and others) by TheRaven64 · 2017-02-01 02:47 · Score: 3, Informative
  
  Please mod the parent up. After the uptick in trolling and invective in the last couple of months, this post is a breath of fresh air around here.
  
  --
  I am TheRaven on Soylent News
8. Re:Repeat after me (and others) by Notabadguy · 2017-02-01 03:39 · Score: 3, Funny
  
  Look at his user ID. Give him time, he'll come around.
Re:Don't use rm! by infolation · 2017-01-31 19:58 · Score: 5, Funny

Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Don't tell the customer anything!! Geez... What's with these semi-pros?
Re:Don't use rm! by Darinbob · 2017-01-31 20:07 · Score: 3, Interesting

Boring job, doesn't pay as much as others. Everyone wants to be the rockstar since that's who the recruiters look for, nobody wants to be the janitor that cleans up after the concert. Turn that into a startup and seriously, no one at a startup wants to be the grunt, and (almost) no one at a startup has an ounce of experience with real world issues.
This is why sysadmins were created, because the people actually using the computers didn't want to manage them.
How can this keep happening? by Anonymous Coward · 2017-01-31 20:17 · Score: 3, Interesting

I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?
http://jefferai.org/2013/03/29/distillation/
When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.
1. Re:How can this keep happening? by Entrope · 2017-02-01 00:46 · Score: 3, Informative
  
  KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.
  If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.
If only there was another copy of the repo by HxBro · 2017-01-31 20:28 · Score: 5, Funny

Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
Re:Don't use rm! by sodul · 2017-01-31 20:43 · Score: 3, Insightful

Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.
Unfortunately common by CustomSolvers2 · 2017-01-31 20:56 · Score: 4, Interesting

I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.

If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).

These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
Re:Made this mistake once... by serviscope_minor · 2017-01-31 21:00 · Score: 4, Funny

Good choice. But, I always use this prompt:
PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\$[^\\]\\{6\\}\$[^\\]\\{2,\\}/\\1~1/g" ) >'

--
SJW n. One who posts facts.
All my sympathy... by Gumbercules!! · 2017-01-31 22:23 · Score: 4, Insightful

I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
An that is why you run BCM and recovery tests by gweihir · 2017-01-31 22:36 · Score: 3, Interesting

Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.
Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Six hours of loss is a "melt-down"? by thesandbender · 2017-02-01 01:53 · Score: 4, Insightful

Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".