Too Perfect a Mirror
Carewolf writes "Jeff Mitchell writes on his blog about what almost became 'The Great KDE Disaster Of 2013.' It all started as simple update of the root git server and ended up with a corrupt git repository automatically mirrored to every mirror and deleting every copy of most KDE repositories. It ends by discussing what the problem is with git --mirror and how you can avoid similar problems in the future."
Preferably, before using them? This sounds very much like plain old incompetence, possibly coupled with plain old arrogance. Thinking that using a version control system does absolve one from making backups is just plain old stupid. Then, with what I have seen from the KDE project, that would be consistent.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
This is not a problem with git --mirror: rsync or any other mirroring tool would end up in the same situation.
It's up to the master to deliver the goods and upgrading a master should include performing a test run as well as making a backup prior to the real upgrade. This was a procedural failure, not a software failure. But good to hear disaster was averted.
You know, calling it a disaster really depends on your point of view.
Good grief!
After all of that, not a single proposed solution is a proper, rotational backup.
This is what rotational backups are FOR. They let you go back months in time, and even do post-corruption, or post-cracking examination of the machine that went down!
Backups do *not* need to be done to tape, but a mirror or a raid card is NOT a backup. This is actually simple, simple stuff, and it seems like the admins at KDE are a bit wet behind the ears, in terms of backups.
They probably think that because backups used to mean tape, that's old tech, and no one does that.
Not so! Many organizations I admin, and many others I know of, simply do off-site rotational backups using rsync + rotation scripts. This is the key part, copies of the data as it changes over time. You *never* overwrite your backups, EVER.
And with proper rotational backups, only the changed data is backed up, so the daily backup size is not as large as you might think. I doubt the entire KDE git tree changes by even 0.1% every day.
Rotational backups -- works like a charm, would completely prevent any concern or issue with a problem like this, and IT IS WHAT YOU NEED TO BE DOING, ALWAYS!
There is nothing wrong with using the internet as a backup machine - with the caveat that you know what you're doing and you're using the right service/tool properly.
Personally, I have all my very important documents in an encrypted archive labelled "Area_51_Aliens_Proof.rar" with the note "It is dangerous for me to provide the key, but in the event of my death or imprisonment, a key will be provided EXPOSING EVERYTHING!!!" and uploaded to various paranormal bittorrent trackers and mirrored by various denizens of /x/.
I expect my documents to be archived in perpetuity.
--
BMO
The files were corrupted, Git didn't report squat about the problems. The sync got different versions each time. Sure there are two layers of failure here, but one of them certainly is Git.
What he's saying is simple, Torvalds comment is not completely true:
"If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, git will notice them. It’s not a question of if. It’s a guarantee. You can have people who try to be malicious. They won’t succeed. You need to know exactly 20 bytes, you need to know 160-bit SHA-1 name of the top of your tree, and if you know that, you can trust your tree, all the way down, the whole history. You can have 10 years of history, you can have 100,000 files, you can have millions of revisions, and you can trust every single piece of it. Because git is so reliable and all the basic data structures are really really simple. And we check checksums."
He's saying that if the commits are corrupted:
"If a commit object is corrupt, you can still make a mirror clone of the repository without any complaints (and with an exit code of zero). Attempting to walk the tree at this point will eventually error out at the corrupt commit. However, there’s an important caveat: it will error out only if you’re walking a path on the tree that contains that commit. "
So there's a clear room for improvement. Sure the fault was a corrupt file, but the second layer of protection, Git's checking, ALSO FAILED. Denial isn't helpful here, Git should also be fixed.
And another amateur-level solution. Does nobody know how to do backups anymore? O.k., here is the very basics of mandatory characteristics of a backup:
- Backup data storage independent of the system being backed up
- Several generation of backups kept for long enough to be absolutely sure you can recover (yes, that can mean years) and frequently enough that loss is acceptable.
- Expect that one backup generation can be faulty and ensure that even then, recovery is possible and data-losses are acceptable.
- Full disaster recovery possible, even if your original system is stolen by aliens.
- Disaster recovery is tested regularly
- Data is verified (full compare or 2-sided crypto-hash compare) on backup
This really is "IT operations 101". Forget about all these halve-ba(c)ked amateur stuff, IT DOES NOT WORK.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Git does not have the magic "integrity check" on making mirrors. If they had bothered to look at the documentation they would have known. If they has thought about it for a second, they would have realized that expensive integrity checks might be switched off on a fast mirror operation. If they had even be a bit careful, they would have checked the documentation and known. They failed in every way possible.
Stop blaming the tool. This is correct and documented behavior. Start blaming the people that messed up badly.
And no, nothing done within the system being backed up is a backup. A backup needs to be stored independent of the system being backed up. Stop spreading nonsense.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
I believe you are not talking about backup. A backup allows system recovery after a disaster and cannot ever be stored in the system itself. What you are talking about is availability improvement. That _can_ be part of the primary system. RAID, for example, exclusively serves this purpose (except RAID0). But backups must also protect against user and administrator error, software errors, the data-center burning down, sabotage, etc.
Replication is not the tool for that. The problem is that any data copy part of the system itself can be corrupted by the system as the system still has access to it. That is why a backup must be both removed from the system so it is independent, and allow full reconstruction, even if the original system is completely destroyed.
Now, improving uptime and reducing downtimes is important, but it is not what a backup does. A backup makes sure you do not lose your data permanently. What uptime improvement does is to make it less likely that you need to go back to the backup.
Or to put it differently, backup is for Disaster Recovery. Uptime improvement is for reducing DR cost reduction by reducing the probability of it becoming necessary and for reducing downtime cost.
I do agree to the political angle though.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
It is UNIX-style design where the user is expected to actually understand what they are doing.
No, it is not, and never was. It is infact the opposite of that. man pages, as one obvious example, are there so people who don't know what they are doing can figure it out. It is designed to be intuitive and provide you with the information needed to get the job done. It was built to have small, simple tools that were easy to understand. They can perform simple tasks on their own or when working together, perform some complex ones ... hence the powerful unix command line. The original UNIX design considered but new, inexperienced users and how to bring them up to speed as well as how to empower users with more knowledge of the system.
What you are referring to is a Linux/OSS attribute, not a UNIX attribute. Linux/OSS developers typically expect the user of the software to be a developer as well. This is the result of everyone scratching their own itch only and most code being written by people for themselves without any consideration of others. No one WANTS to write the things that makes it intuitive or easy for someone else who doesn't understand all the quirks. Obviously this isn't true for some of the paid developers, but the majority of them aren't.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager