GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)
An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.
No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:
/home/user1 /home/user2 /home/ user3
rm -rf
Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Don't tell the customer anything!! Geez... What's with these semi-pros?
Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.
If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).
These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
Good choice. But, I always use this prompt:
PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\\([^\\]\\{6\\}\\)[^\\]\\{2,\\}/\\1~1/g" ) >'
SJW n. One who posts facts.
I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".