Too Perfect a Mirror

← Back to Stories (view on slashdot.org)

Posted by timothy on Sunday March 24, 2013 @01:20AM from the computer-behaving-badly dept.

Carewolf writes "Jeff Mitchell writes on his blog about what almost became 'The Great KDE Disaster Of 2013.' It all started as simple update of the root git server and ended up with a corrupt git repository automatically mirrored to every mirror and deleting every copy of most KDE repositories. It ends by discussing what the problem is with git --mirror and how you can avoid similar problems in the future."

4 of 192 comments (clear)

Min score:

Reason:

Sort:

No Git also failed by Anonymous Coward · 2013-03-24 02:03 · Score: 5, Informative

The files were corrupted, Git didn't report squat about the problems. The sync got different versions each time. Sure there are two layers of failure here, but one of them certainly is Git.
What he's saying is simple, Torvalds comment is not completely true:
"If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, git will notice them. It’s not a question of if. It’s a guarantee. You can have people who try to be malicious. They won’t succeed. You need to know exactly 20 bytes, you need to know 160-bit SHA-1 name of the top of your tree, and if you know that, you can trust your tree, all the way down, the whole history. You can have 10 years of history, you can have 100,000 files, you can have millions of revisions, and you can trust every single piece of it. Because git is so reliable and all the basic data structures are really really simple. And we check checksums."
He's saying that if the commits are corrupted:
"If a commit object is corrupt, you can still make a mirror clone of the repository without any complaints (and with an exit code of zero). Attempting to walk the tree at this point will eventually error out at the corrupt commit. However, there’s an important caveat: it will error out only if you’re walking a path on the tree that contains that commit. "
So there's a clear room for improvement. Sure the fault was a corrupt file, but the second layer of protection, Git's checking, ALSO FAILED. Denial isn't helpful here, Git should also be fixed.
Re:delayed update to servers.. by gweihir · 2013-03-24 02:22 · Score: 4, Informative

And another amateur-level solution. Does nobody know how to do backups anymore? O.k., here is the very basics of mandatory characteristics of a backup:
- Backup data storage independent of the system being backed up
- Several generation of backups kept for long enough to be absolutely sure you can recover (yes, that can mean years) and frequently enough that loss is acceptable.
- Expect that one backup generation can be faulty and ensure that even then, recovery is possible and data-losses are acceptable.
- Full disaster recovery possible, even if your original system is stolen by aliens.
- Disaster recovery is tested regularly
- Data is verified (full compare or 2-sided crypto-hash compare) on backup
This really is "IT operations 101". Forget about all these halve-ba(c)ked amateur stuff, IT DOES NOT WORK.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:But it is SUPPOSED to by gweihir · 2013-03-24 02:44 · Score: 4, Informative

Git does not have the magic "integrity check" on making mirrors. If they had bothered to look at the documentation they would have known. If they has thought about it for a second, they would have realized that expensive integrity checks might be switched off on a fast mirror operation. If they had even be a bit careful, they would have checked the documentation and known. They failed in every way possible.
Stop blaming the tool. This is correct and documented behavior. Start blaming the people that messed up badly.
And no, nothing done within the system being backed up is a backup. A backup needs to be stored independent of the system being backed up. Stop spreading nonsense.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:A thousand times. (Unless online mirrors roll b by gweihir · 2013-03-24 03:18 · Score: 4, Informative

I believe you are not talking about backup. A backup allows system recovery after a disaster and cannot ever be stored in the system itself. What you are talking about is availability improvement. That _can_ be part of the primary system. RAID, for example, exclusively serves this purpose (except RAID0). But backups must also protect against user and administrator error, software errors, the data-center burning down, sabotage, etc.
Replication is not the tool for that. The problem is that any data copy part of the system itself can be corrupted by the system as the system still has access to it. That is why a backup must be both removed from the system so it is independent, and allow full reconstruction, even if the original system is completely destroyed.
Now, improving uptime and reducing downtimes is important, but it is not what a backup does. A backup makes sure you do not lose your data permanently. What uptime improvement does is to make it less likely that you need to go back to the backup.
Or to put it differently, backup is for Disaster Recovery. Uptime improvement is for reducing DR cost reduction by reducing the probability of it becoming necessary and for reducing downtime cost.
I do agree to the political angle though.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.