GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)
An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.
No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:
/home/user1 /home/user2 /home/ user3
rm -rf
Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Now, if you'll excuse me, I have backups to corrupt.
Two things:
1. Test your backups
2. TEST your BACKUPS!
The first sentence is true. The second one only achieves "should be true" status.
Recycle bin, restore.
Github is not Gitlab
At least it wasn't github.com.
So, it didn't break the Internet.
And practically everything else.
I've made this mistake, deleted all attachments on a life system once.
After this, I made all the prompts for critical servers a different color:
export PS1='\e[41m\u@\h:\w\$\e[49m'
I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?
http://jefferai.org/2013/03/29/distillation/
When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.
Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
libtard
I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.
If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).
These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
That's why you always always run ls first.
ls -ld /home/user1 /home/user2 /home/ user3
Then edit the command to rm. Always.
Or you use scripts.
somescript user1 user2 user3
goes through this kind of incident. It's part of growing up.
We lost 30TB of data from a flaky NAS a few year back, still there !
(and good luck, Gitlab is great software)
I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.
Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?
Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.
I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that they would hand over their backup media and then see what happens when they try to restore their data to your equipment.
The customer would provide all software and media, just as they would in a real disaster.
It would eliminate the "we can't restore everything" capacity issue most places have, the fact that the equipment would differ from what they have (even if its only slight model derivations) would be the kind of variation likely in a real DR scenario -- if you have to physically replace hardware, it likely won't be the same model stuff you have now.
An option would exist to have/not have the staff participate in the process -- I'm sure many CxOs are curious if their "system" can survive being restored by someone else.
Maybe they could had uploaded any new changes to some sort of online repository.
Allows exactly that, as does I am told ZFS.
Assuming you have a large enough disk, and small enough space requirements to leave the room necessary for it, a filesystem solution like either of the above allows exactly this, and barring filesystem corruption due to cornercases or hardware issues, should allow reverting all but the largest PEBKAC issues.
A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!
http://pcblues.com - Digits and Wood
Clearly a case of too much deving and not enough oping.
I see no mention of alarms. With all those failures there must have been some notification.
Chances are they were lost in the nested folders of outlook with millions of other false alarm alerts.
They must have a mirrored test zone that they can copy from.
Lesson learned. Test your backups, regularly!
And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.
What is frustrating is that, given all the progress in hardware reliability since when I grew up, people take reliability for granted, whereas way-back-when, people who did that learned pretty f'ing quickly that stuff can and does go wrong.
John_Chalisque
That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!
So don't blame the guy, praise him and thank him for saving us all!
WARNING: Smartphones have side effects--most of them undocumented.
Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".
"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."
Or, more accurately, less than 5 backup/replication techniques were deployed.
I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.
I do not miss sysadmin work so much.
deleting the extra space after periods so i can stay relevant, yeah.
What kind of moron just deploys a new backup strategy then just sits back and trusts their entire infrastructure to it, without ever having actually performed a test recovery?
This is why I think it would be good to keep the tickets in the repository or a second repository. Easy to replicate, easy to keep history, easy to backup.
New things are always on the horizon
Sourceforge is still a thing..?
That made me so insecure that I actually had to check.. :O)
Yes, they are still there.
I love you, but when will you stop sucking?
I preach the same thing every time.
ZFS snapshots.
ZFS Send/Recv to other data centers.
Is it really that hard? That is literally all you have to do. Delete a folder? Copy it from snapshot. Things are more fucked then that? Revert to snapshot. Entire server is nuked? You have 100% replication off-site with snapshotting intact. Don't know how to set it all up? Install FreeNAS and use the built in web UI for it. No longer are any other excuses viable.
To be more precise... test your RESTORE PROCESS.
It is important to not only know that your backups are good, but that your process of restoring them is sound and that you have at least tried it.
My beliefs do not require that you agree with them.
That was my first thought. Recovery isn't guaranteed, and the process might be labor intensive. They'd have to notify their customers that their data might be corrupted.
Contribute to civilization: ari.aynrand.org/donate
a fellow admin once wiped a live prod box serving lots of customers - but it was load balanced and we managed to rsync it from the other host but i still remember his face
...then you are not performing backups.
A much simpler way to do it, that won't require you to hack standard system command-line tools,
would be to use some copy-on-write or log-structured files system (e.g.: BTRFS, ZFS, etc. depending of your taste),
and use snapshots to keep older versions of your file tree.
If anything goes wrong you can still recover from a previous snapshot.
Some Linux distributions (like: opensuse) have tools (like snapper) that can automate this task for you (and opensuse uses snapper to similarily snapshot system upgrades).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]