GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)
An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.
No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:
/home/user1 /home/user2 /home/ user3
rm -rf
Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Now, if you'll excuse me, I have backups to corrupt.
Two things:
1. Test your backups
2. TEST your BACKUPS!
If you haven't attempted a restore and verify, you don't have backups. Everybody in IT knows this.
Having set up an instance of github CE by myself recently, i must say I'm not necessarily surprised. The whole thing just feels weird from a sysadmin perspective and I get the feeling they lack perspective on backend matters. Poor documentation, weird architecture, a general kludginess. I was going to write it off as the usual divide between free/paid versions, but somehow that doesn't have enough explaining power. Maybe it's just a general problem with rails people, i don't know, but yeah, not surprised by TFA.
No need to host your own stuff!
Recycle bin, restore.
Be careful what you wish for.
cowsay lol | lolcat
I'm shocked, to funded morans will repeatedly fail and you will pay for it
Reading "meltdown", remeber that: /. are both owned by BizX
1) GitLab competes with Sourceforge
2) Sourceforge and
There should be some charity or movement defending lone,overworked sysadmins victim of hawkish and success-hungry startup business leaders without the faintest clue about how complex their technical backends may have become and the effort it takes to run and scale those. Somebody kicking hard these kind of arrogant kids calling themselves CEO or CTO out of the blue without even having finished college (or, worse, just because "they have an MBA"), who think and convince others that they can roll out an IT service worldwide to a billion users by just three clicks on "the cloud" once and forever, ignoring everythng that doesn't sound "business" nor "cool", from storage to capacity to backups to monitoring alerts, if any. Good "lesson learned" come also for those greasy, white haired venture caps to feel a sudden pain in their lower backs and lose some of their penauts million dollars every once in a while, so they stop thinking you can make money from kids without any proper knowledge or education, that see IT as fancy showbiz and sysadmins and devs as "extras" of which to dispose at will, whilst IT is a science, and technical, educated people should be at its core and its most valuable asset. And be given some time to think to foster creativity, instead than having it killed by the sleep deprivation due to constantly fixing something that was badly rushed to production the night before launch day.
PS the rant is -IN NO WAY- directed towards GitLab, neither referring to this specific case (of which I don't know anything), but more describing a sick industry trend of the last decennia it would be A Good Thing(TM) to start to revert.
At least it wasn't github.com.
So, it didn't break the Internet.
And practically everything else.
I've made this mistake, deleted all attachments on a life system once.
After this, I made all the prompts for critical servers a different color:
export PS1='\e[41m\u@\h:\w\$\e[49m'
I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?
http://jefferai.org/2013/03/29/distillation/
When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.
Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
Little important should be lost. Developers have full copies of active work.
if they had containerization, agile development, kanban, sprints, ansible, jenkins and all that:
the sysad wouldn't have the time to issue the rm -Rf :-D
Continuous masturbation FTW!
I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.
If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).
These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
That's why you always always run ls first.
ls -ld /home/user1 /home/user2 /home/ user3
Then edit the command to rm. Always.
Or you use scripts.
somescript user1 user2 user3
goes through this kind of incident. It's part of growing up.
We lost 30TB of data from a flaky NAS a few year back, still there !
(and good luck, Gitlab is great software)
I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
Agile backup... fail and fail often.
Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.
Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Our old BOFH at his best.
Having supported numerous telecommunications and computer systems for which more than professional pride was at stake if year-to-year availability of five-nines (99.99999%) was not achieved, it is astonishing that any service platform supporting content created by paying customers would NOT have a fully-tested plan and viable backups in place for bare-metal restores for all servers and databases, virtual and otherwise. This incident reveals negligence and failure at GitLab to provide support that should be implicit with the service they offer.
Currently we support several large PostgreSQL instances and have scripts that backup on local storage, as well as to network and cloud storage (S3, etc). Just in case. If backups are NOT successfully created each time scripts run, then DBAs and sysadmins are notified immediately. Furthermore, backup scripts are regularly tested, and all scripts and processes are logged and include detailed log entries that allow for quick identification of failure components, both hardware and software. In 2017, this is not rocket-science! Plan, test, and document. Constantly.
Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?
Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.
I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that they would hand over their backup media and then see what happens when they try to restore their data to your equipment.
The customer would provide all software and media, just as they would in a real disaster.
It would eliminate the "we can't restore everything" capacity issue most places have, the fact that the equipment would differ from what they have (even if its only slight model derivations) would be the kind of variation likely in a real DR scenario -- if you have to physically replace hardware, it likely won't be the same model stuff you have now.
An option would exist to have/not have the staff participate in the process -- I'm sure many CxOs are curious if their "system" can survive being restored by someone else.
Maybe they could had uploaded any new changes to some sort of online repository.
Allows exactly that, as does I am told ZFS.
Assuming you have a large enough disk, and small enough space requirements to leave the room necessary for it, a filesystem solution like either of the above allows exactly this, and barring filesystem corruption due to cornercases or hardware issues, should allow reverting all but the largest PEBKAC issues.
A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!
http://pcblues.com - Digits and Wood
take the disk off from active duty and could probably have recoverd it straight off - i mean it's not like rm overwrites the actual data......
In every IT organisation their most valuable asset, right before the people that works there is the Trust of their Customers.
The whole process has been very transparent, perhaps too much, and it is clear that technical disaster has met naivety.
Seriously guys? Empty files of DB dumps?
Clearly a case of too much deving and not enough oping.
I see no mention of alarms. With all those failures there must have been some notification.
Chances are they were lost in the nested folders of outlook with millions of other false alarm alerts.
They must have a mirrored test zone that they can copy from.
Lesson learned. Test your backups, regularly!
well, users need versioning like VMS so when they overwrite something they can go back to another version, and not the MS Volume Shadow copy is not good enough and slows the system down a LOT. The other thing that all systems need is ZFS like FreeNAS(I am sure other OSes Linus and UNUIX have it as well) has so we can just expose the snapshot and copy. To be useful we need and extra file system to copy to so reads and writes never go random and we always need lots of slack space.
So why was the admin so concerned about disk space that he used the rm command? My guess was not enough space....
And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.
What is frustrating is that, given all the progress in hardware reliability since when I grew up, people take reliability for granted, whereas way-back-when, people who did that learned pretty f'ing quickly that stuff can and does go wrong.
John_Chalisque
That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!
So don't blame the guy, praise him and thank him for saving us all!
WARNING: Smartphones have side effects--most of them undocumented.
Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".
"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."
Or, more accurately, less than 5 backup/replication techniques were deployed.
I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.
I do not miss sysadmin work so much.
deleting the extra space after periods so i can stay relevant, yeah.
What kind of moron just deploys a new backup strategy then just sits back and trusts their entire infrastructure to it, without ever having actually performed a test recovery?
This is why I think it would be good to keep the tickets in the repository or a second repository. Easy to replicate, easy to keep history, easy to backup.
New things are always on the horizon
It seldom occurs to command line fanboys they are always subject to typos sometimes will serious consequences no matter how clever they imagine themselves to be.
I love you, but when will you stop sucking?
"a tired sysadmin", an equal part of the blame should probably go to the management team for allowing their staff to be overworked and "tired".
Similar to the NASA disasters, VW emissions scandal, and countless other examples where the management chooses an engineer, developer, or some other so called managerial underlings to take the full force of the blame.
Wouldn't surprise me at all if the management didn't allow for a regular backup process validation due to the "costs" involved, also.
I preach the same thing every time.
ZFS snapshots.
ZFS Send/Recv to other data centers.
Is it really that hard? That is literally all you have to do. Delete a folder? Copy it from snapshot. Things are more fucked then that? Revert to snapshot. Entire server is nuked? You have 100% replication off-site with snapshotting intact. Don't know how to set it all up? Install FreeNAS and use the built in web UI for it. No longer are any other excuses viable.
Hey, let's take a distributed, naturally fault-tolerant content management system and centralize it. Good job.
Linus said...
Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)
To be more precise... test your RESTORE PROCESS.
It is important to not only know that your backups are good, but that your process of restoring them is sound and that you have at least tried it.
My beliefs do not require that you agree with them.
Apparently not a single backup was correctly configured or deployed, site wide. 5 different backup systems all failed.
They failed to do their jobs and ignored failures which is another part of their job. Imagine if this was a vendor used in critical life-saving procedures. OOOO I feel for the guy. Fuck you. These fucks did not do their jobs. They don't deserve any sympathy unless they are sending back their paychecks. They got paid to do work they failed to do.
a fellow admin once wiped a live prod box serving lots of customers - but it was load balanced and we managed to rsync it from the other host but i still remember his face
Never allow rm -rf on repos servers.
You're supposed to stage deletes as well. If not your backup strategy is plain stupid.
I remember rm -rf was a big issue back in 1998, appears it's still is an issue.
...then you are not performing backups.
A much simpler way to do it, that won't require you to hack standard system command-line tools,
would be to use some copy-on-write or log-structured files system (e.g.: BTRFS, ZFS, etc. depending of your taste),
and use snapshots to keep older versions of your file tree.
If anything goes wrong you can still recover from a previous snapshot.
Some Linux distributions (like: opensuse) have tools (like snapper) that can automate this task for you (and opensuse uses snapper to similarily snapshot system upgrades).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]