Slashdot Mirror


GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

6 of 356 comments (clear)

  1. Re:Repeat after me (and others) by dbIII · · Score: 2, Informative

    Good advice but it's a misleading headline above. It appears their real backup exists and is six hours old, so annoying but not catastrophic.
    It is a good example that replication is not a backup and is often a way to just mirror mistakes.

  2. Re:Repeat after me (and others) by MatthiasF · · Score: 5, Informative

    Uh, did you read the article?

    The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").

    It's like they thought out everything but never made sure any of it was working.

  3. Re:I feel that lone sysadmin's pain by stridebird · · Score: 5, Informative

    Correct pattern is:
    > cd /home && rm ...

    ie don't run rm unless cd worked.

  4. Re: I feel that lone sysadmin's pain by saloomy · · Score: 2, Informative

    I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!

  5. Re:How can this keep happening? by Entrope · · Score: 3, Informative

    KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.

    If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.

  6. Re:Repeat after me (and others) by TheRaven64 · · Score: 3, Informative

    Please mod the parent up. After the uptick in trolling and invective in the last couple of months, this post is a breath of fresh air around here.

    --
    I am TheRaven on Soylent News