Too Perfect a Mirror

← Back to Stories (view on slashdot.org)

Posted by timothy on Sunday March 24, 2013 @01:20AM from the computer-behaving-badly dept.

Carewolf writes "Jeff Mitchell writes on his blog about what almost became 'The Great KDE Disaster Of 2013.' It all started as simple update of the root git server and ended up with a corrupt git repository automatically mirrored to every mirror and deleting every copy of most KDE repositories. It ends by discussing what the problem is with git --mirror and how you can avoid similar problems in the future."

10 of 192 comments (clear)

Min score:

Reason:

Sort:

No backup of the KDE sources! by Anonymous Coward · 2013-03-24 01:58 · Score: 2, Informative

They had/have no fucking backup! And complain about some git mirror issues. I can't fucking believe it that they can be so stupid.
The solution: MAKE BACKUPS!
No Git also failed by Anonymous Coward · 2013-03-24 02:03 · Score: 5, Informative

The files were corrupted, Git didn't report squat about the problems. The sync got different versions each time. Sure there are two layers of failure here, but one of them certainly is Git.
What he's saying is simple, Torvalds comment is not completely true:
"If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, git will notice them. It’s not a question of if. It’s a guarantee. You can have people who try to be malicious. They won’t succeed. You need to know exactly 20 bytes, you need to know 160-bit SHA-1 name of the top of your tree, and if you know that, you can trust your tree, all the way down, the whole history. You can have 10 years of history, you can have 100,000 files, you can have millions of revisions, and you can trust every single piece of it. Because git is so reliable and all the basic data structures are really really simple. And we check checksums."
He's saying that if the commits are corrupted:
"If a commit object is corrupt, you can still make a mirror clone of the repository without any complaints (and with an exit code of zero). Attempting to walk the tree at this point will eventually error out at the corrupt commit. However, there’s an important caveat: it will error out only if you’re walking a path on the tree that contains that commit. "
So there's a clear room for improvement. Sure the fault was a corrupt file, but the second layer of protection, Git's checking, ALSO FAILED. Denial isn't helpful here, Git should also be fixed.
Re:No backups?! by Blymie · 2013-03-24 02:08 · Score: 3, Informative

Git has no rotational backup ability in it. You can't do rotational backups of the machine, on the machine for starters!
ZFS is not a rotational backup as well!
Failure, 101, backups. Go back to school.
Both of the above solutions do not prevent slow corruption, and they do not prevent issues where the machine is suspect. (Yes, ZFS can have bugs). They also do not help if the machine has been hacked into. They don't help if there is a fire, flood, or theft of the local box.
Modern backup methodology has been developed over decades of people suffering JUST THROUGH THIS VERY THING. If you plan to just throw all that away, and pretend everyone doing backups is an idiot -- MAKE SURE YOU KNOW WHAT YOU ARE DOING.
Because -- this very issue would not have been even a tiny concern, if proper, off machine, rotational backups were being done. And, if you aren't going to follow proper backup methodology, then you'd better sit down in a quite place for a few hours, and think of every possible disaster scenario, AND issues with the code you're going to be using for those backups.
Hell, this whole KDE problem started, because the people using it did not even know how git works, 100%! Now, you're suggesting that using another tool, ON THE SAME BOX, is the answer? What will someone miss on ZFS?
No, please, think about this more carefully.
Re:No backups?! by Doc+Hopper · 2013-03-24 02:20 · Score: 3, Informative

I do storage & backup for a living on an extremely large scale. Your post is correct in the main, except for this:

You *never* overwrite your backups, EVER.
You must overwrite tapes if you want to keep media costs reasonable. In our enterprise, we typically use $30,000 T10Kc tape drives with $300 T10K "t2" tapes. Destroyed/broken/worn-out media costs already eat the equivalent of several well-paid sysadmin salaries each year. Adding additional cost for indefinite retention is a huge and unnecessary cost.
Agreed, though, this KDE experience isn't quite like that. Source code repositories commonly have 7-year-retention backups for SLA reasons with customers; most of my work deals with customer Cloud data, which kind of by definition is more ephemeral and we typically only provide 30, 60, or 90-day backups at most, in addition to typical snapshotting & near-line kinds of storage.
No reasonable-cost disk-based storage solution in the world today provides a cost-effective way to store over a hundred petabytes of data on site, available within a couple of hours, and consuming just a trickle of electricity. But if you have a million bucks, a Sun SL8500 silo with 13,000+ tape capacity in the silo will do so. All for the cost of a little extra real-estate, and a power bill that's a tiny fraction of disk-based online storage.
Tape has a vital place in the IT administration world. Ignore this fact to your peril and future financial woes.

--
Matthew P. Barnson
I learn what I think when I read what I write
Re:delayed update to servers.. by gweihir · 2013-03-24 02:22 · Score: 4, Informative

And another amateur-level solution. Does nobody know how to do backups anymore? O.k., here is the very basics of mandatory characteristics of a backup:
- Backup data storage independent of the system being backed up
- Several generation of backups kept for long enough to be absolutely sure you can recover (yes, that can mean years) and frequently enough that loss is acceptable.
- Expect that one backup generation can be faulty and ensure that even then, recovery is possible and data-losses are acceptable.
- Full disaster recovery possible, even if your original system is stolen by aliens.
- Disaster recovery is tested regularly
- Data is verified (full compare or 2-sided crypto-hash compare) on backup
This really is "IT operations 101". Forget about all these halve-ba(c)ked amateur stuff, IT DOES NOT WORK.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Welcome to "rsnapshot" by Anonymous Coward · 2013-03-24 02:28 · Score: 2, Informative

Rsnapshot provides cheap, userland hardlinked rotating snapshots work very well. Simply do the rsnapshots in one location, and three are dozen ways to make the completed, synchronized content accessible for download or other mirrors when the mirror is complete.
The only thing I dislike about it is the often requested, always refused feature of using "daily.YYYYMMDD-HHMMSS" or a similar naming scheme, instead of the rotating "daily.0, daily.1, daily.2" names which are quite prone to rotating in mid-download for anyone accessing the snapshots via NFS or a web browser. The only way you can tell the rotations apart is by the timestamp on the top level directory, and that's very confusing when it rotates out from under you in mid-operations.
Re:But it is SUPPOSED to by gweihir · 2013-03-24 02:44 · Score: 4, Informative

Git does not have the magic "integrity check" on making mirrors. If they had bothered to look at the documentation they would have known. If they has thought about it for a second, they would have realized that expensive integrity checks might be switched off on a fast mirror operation. If they had even be a bit careful, they would have checked the documentation and known. They failed in every way possible.
Stop blaming the tool. This is correct and documented behavior. Start blaming the people that messed up badly.
And no, nothing done within the system being backed up is a backup. A backup needs to be stored independent of the system being backed up. Stop spreading nonsense.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:A thousand times. (Unless online mirrors roll b by gweihir · 2013-03-24 03:18 · Score: 4, Informative

I believe you are not talking about backup. A backup allows system recovery after a disaster and cannot ever be stored in the system itself. What you are talking about is availability improvement. That _can_ be part of the primary system. RAID, for example, exclusively serves this purpose (except RAID0). But backups must also protect against user and administrator error, software errors, the data-center burning down, sabotage, etc.
Replication is not the tool for that. The problem is that any data copy part of the system itself can be corrupted by the system as the system still has access to it. That is why a backup must be both removed from the system so it is independent, and allow full reconstruction, even if the original system is completely destroyed.
Now, improving uptime and reducing downtimes is important, but it is not what a backup does. A backup makes sure you do not lose your data permanently. What uptime improvement does is to make it less likely that you need to go back to the backup.
Or to put it differently, backup is for Disaster Recovery. Uptime improvement is for reducing DR cost reduction by reducing the probability of it becoming necessary and for reducing downtime cost.
I do agree to the political angle though.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:programming != IT by vurian · 2013-03-24 05:16 · Score: 1, Informative

"an organization as large as KDE have backups?" You mean one full-time secretary and a couple of volunteer sysadmins? That's how large KDE's support organization is. How much money do you think KDE has? It is less than 200k euros. That's how large the budget is -- and it has to pay for everything.
response from a core Git developer by nluv4hs · 2013-03-24 08:44 · Score: 3, Informative
Jeff King responded on Git's mailing list:

Jeff King at 2013-03-24 18:31:33 GMT
propagating repo corruption across clone

"So I think at the very least we should:
1. Make sure clone propagates errors from checkout to the final exit code.
2. Teach clone to run check_everything_connected.
"