Too Perfect a Mirror
Carewolf writes "Jeff Mitchell writes on his blog about what almost became 'The Great KDE Disaster Of 2013.' It all started as simple update of the root git server and ended up with a corrupt git repository automatically mirrored to every mirror and deleting every copy of most KDE repositories. It ends by discussing what the problem is with git --mirror and how you can avoid similar problems in the future."
Preferably, before using them? This sounds very much like plain old incompetence, possibly coupled with plain old arrogance. Thinking that using a version control system does absolve one from making backups is just plain old stupid. Then, with what I have seen from the KDE project, that would be consistent.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
This is not a problem with git --mirror: rsync or any other mirroring tool would end up in the same situation.
It's up to the master to deliver the goods and upgrading a master should include performing a test run as well as making a backup prior to the real upgrade. This was a procedural failure, not a software failure. But good to hear disaster was averted.
You know, calling it a disaster really depends on your point of view.
Good grief!
After all of that, not a single proposed solution is a proper, rotational backup.
This is what rotational backups are FOR. They let you go back months in time, and even do post-corruption, or post-cracking examination of the machine that went down!
Backups do *not* need to be done to tape, but a mirror or a raid card is NOT a backup. This is actually simple, simple stuff, and it seems like the admins at KDE are a bit wet behind the ears, in terms of backups.
They probably think that because backups used to mean tape, that's old tech, and no one does that.
Not so! Many organizations I admin, and many others I know of, simply do off-site rotational backups using rsync + rotation scripts. This is the key part, copies of the data as it changes over time. You *never* overwrite your backups, EVER.
And with proper rotational backups, only the changed data is backed up, so the daily backup size is not as large as you might think. I doubt the entire KDE git tree changes by even 0.1% every day.
Rotational backups -- works like a charm, would completely prevent any concern or issue with a problem like this, and IT IS WHAT YOU NEED TO BE DOING, ALWAYS!
There is nothing wrong with using the internet as a backup machine - with the caveat that you know what you're doing and you're using the right service/tool properly.
Personally, I have all my very important documents in an encrypted archive labelled "Area_51_Aliens_Proof.rar" with the note "It is dangerous for me to provide the key, but in the event of my death or imprisonment, a key will be provided EXPOSING EVERYTHING!!!" and uploaded to various paranormal bittorrent trackers and mirrored by various denizens of /x/.
I expect my documents to be archived in perpetuity.
--
BMO
They had/have no fucking backup! And complain about some git mirror issues. I can't fucking believe it that they can be so stupid.
The solution: MAKE BACKUPS!
The files were corrupted, Git didn't report squat about the problems. The sync got different versions each time. Sure there are two layers of failure here, but one of them certainly is Git.
What he's saying is simple, Torvalds comment is not completely true:
"If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, git will notice them. It’s not a question of if. It’s a guarantee. You can have people who try to be malicious. They won’t succeed. You need to know exactly 20 bytes, you need to know 160-bit SHA-1 name of the top of your tree, and if you know that, you can trust your tree, all the way down, the whole history. You can have 10 years of history, you can have 100,000 files, you can have millions of revisions, and you can trust every single piece of it. Because git is so reliable and all the basic data structures are really really simple. And we check checksums."
He's saying that if the commits are corrupted:
"If a commit object is corrupt, you can still make a mirror clone of the repository without any complaints (and with an exit code of zero). Attempting to walk the tree at this point will eventually error out at the corrupt commit. However, there’s an important caveat: it will error out only if you’re walking a path on the tree that contains that commit. "
So there's a clear room for improvement. Sure the fault was a corrupt file, but the second layer of protection, Git's checking, ALSO FAILED. Denial isn't helpful here, Git should also be fixed.
No. Backup is out of scope for version control. Anybody with actual common sense would not expect it to make backups "magically" by itself and check to make sure. Then they would implement backups. But that does actually require said common sense.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
And another amateur-level solution. Does nobody know how to do backups anymore? O.k., here is the very basics of mandatory characteristics of a backup:
- Backup data storage independent of the system being backed up
- Several generation of backups kept for long enough to be absolutely sure you can recover (yes, that can mean years) and frequently enough that loss is acceptable.
- Expect that one backup generation can be faulty and ensure that even then, recovery is possible and data-losses are acceptable.
- Full disaster recovery possible, even if your original system is stolen by aliens.
- Disaster recovery is tested regularly
- Data is verified (full compare or 2-sided crypto-hash compare) on backup
This really is "IT operations 101". Forget about all these halve-ba(c)ked amateur stuff, IT DOES NOT WORK.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
"Not the fault of git but those that did not bother to find out"
No, Git has the integrity check, the integrity check didn't work. If the integrity check had worked as claimed then their backups were solid.
I know people are saying "keep backups", but they're really missing the point. A backup is a copy of something, the more up to date the better, better still if it keeps a historic set of backups. Perhaps with some sort of software to minimize the size, perhaps only keep changes..... you can see where I'm going with this.
Git sync to a lot of drives IS A BACKUP. It is exactly what an ideal backup should be, historic, up to date, minimizes storage. What is that system if it isn't an automatic backup!
Except for this bug, which needs to be fixed, and a little less faith in git too would also be a good thing.
It's really no different than if you use the backup software, and it made careful backups and kept historic copies, and then one day your disk got corrupted, you promptly went to your backups only to find the backup software had been chomping those because it didn't notice the integrity was corrupt and had happily been corrupting the backups it was keeping.
So I see comments saying they didn't have backups OMG! But no, their problem was they only used ONE TYPE OF BACKUP SOFTWARE Git sync. I bet all of you use only ONE type of backup software and are equally vulnerable to this failure.
Rsnapshot provides cheap, userland hardlinked rotating snapshots work very well. Simply do the rsnapshots in one location, and three are dozen ways to make the completed, synchronized content accessible for download or other mirrors when the mirror is complete.
The only thing I dislike about it is the often requested, always refused feature of using "daily.YYYYMMDD-HHMMSS" or a similar naming scheme, instead of the rotating "daily.0, daily.1, daily.2" names which are quite prone to rotating in mid-download for anyone accessing the snapshots via NFS or a web browser. The only way you can tell the rotations apart is by the timestamp on the top level directory, and that's very confusing when it rotates out from under you in mid-operations.
you ALWAYS have incremental backups on MULTIPLE MEDIUMS.
If you think your Git repositories are your backup, then you need to learn what the word Backup means.
Do not look at laser with remaining good eye.
I believe you are not talking about backup. A backup allows system recovery after a disaster and cannot ever be stored in the system itself. What you are talking about is availability improvement. That _can_ be part of the primary system. RAID, for example, exclusively serves this purpose (except RAID0). But backups must also protect against user and administrator error, software errors, the data-center burning down, sabotage, etc.
Replication is not the tool for that. The problem is that any data copy part of the system itself can be corrupted by the system as the system still has access to it. That is why a backup must be both removed from the system so it is independent, and allow full reconstruction, even if the original system is completely destroyed.
Now, improving uptime and reducing downtimes is important, but it is not what a backup does. A backup makes sure you do not lose your data permanently. What uptime improvement does is to make it less likely that you need to go back to the backup.
Or to put it differently, backup is for Disaster Recovery. Uptime improvement is for reducing DR cost reduction by reducing the probability of it becoming necessary and for reducing downtime cost.
I do agree to the political angle though.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
They should be backing up daily and, even if not, they should certainly have done a backup before doing a software upgrade.
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
Jeff King at 2013-03-24 18:31:33 GMT
propagating repo corruption across clone
"So I think at the very least we should:
"
Your remark is typically said by the guy who doesn't understand that a project like KDE is not an organization comparable to a Fortune 500 company. It is not a company. There are no employees. There is no significant income. Everything is done by volunteers. Everything. All of it. It is a large open source community, but it is not a company. There is no one responsible for telling anyone what to to do. There is no one who said "you have this budget", because there is no budget. This is completely outside your experience. There are no "they" who take care of things -- there is just an "us" -- and if you think your experience can be of use, you can be part of the "us", but you won't be paid, and every bit of hardware and bandwidth you use, you'll have to beg for. And it still works. Isn't that effing amazing?
Very good point. Many, many programmers do not get how to operate IT competently.
Yes. And this is a problem.
It leads to the atrocities that are the Adobe and Apple installers, among other things. Apparently an "application developer" these days doesn't need to trouble himself* with how his priceless treasures actually interact with the operating system they will be installed on. Because that's, like, the IT grunt's job? And anyway isn't some file copies and maybe a few registry hacks just a small matter of scripting, and not really coding at all?
I'd like to dream that one day IT will be taught in computer science courses, with the same level of theoretical abstraction, and given the same kind of functional-programming toolsets that... well, haven't made it into mainstream "software engineering" either... but at least could get us all talking in the same room again. You know, like some lectures about how just tossing a bunch of files into a filesystem is sorta like coding in raw assembler in the 1960s where we had global variables for everything? And maybe couldn't there be a slightly smarter way of organising our lives so that we didn't....? And maybe how we could apply some of that "object oriented" and "functional" stuff that exists inside a running process, to the OS layer? At a slightly finer level of granularity than "spin up an emulated image of an entire server"? And maybe even the network infrastructure guys could have some kind of version control system for all the text config files for their DHCP servers and routers? Pretty please?
Well, not next year. But maybe by 2030?
* Theoretically that could be "herself", except that this level of arrogance/ignorance really does seem to be a uniquely male failure mode . Most females are smarter than to believe that they know everything about subjects they haven't learned.
You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
From my 34 years of constructing, coding and maintaining applications on computers I learned by the hard way the 4 most important points:
1. Backup.
2. Backup.
3. Backup.
4. The rest.
Mundus Vult Decipi