Too Perfect a Mirror
Carewolf writes "Jeff Mitchell writes on his blog about what almost became 'The Great KDE Disaster Of 2013.' It all started as simple update of the root git server and ended up with a corrupt git repository automatically mirrored to every mirror and deleting every copy of most KDE repositories. It ends by discussing what the problem is with git --mirror and how you can avoid similar problems in the future."
Preferably, before using them? This sounds very much like plain old incompetence, possibly coupled with plain old arrogance. Thinking that using a version control system does absolve one from making backups is just plain old stupid. Then, with what I have seen from the KDE project, that would be consistent.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
This is not a problem with git --mirror: rsync or any other mirroring tool would end up in the same situation.
It's up to the master to deliver the goods and upgrading a master should include performing a test run as well as making a backup prior to the real upgrade. This was a procedural failure, not a software failure. But good to hear disaster was averted.
You know, calling it a disaster really depends on your point of view.
Neither are online mirrors.
A thousand times this. Say it with me - a mirror is not a backup. A RAID mirror is not a backup, a cluster mirror is not a backup, and a git mirror is not a backup.
Unless of course the mirroring system integrates rollback to earlier mirrors, something like Clonebox for example.
Good grief!
After all of that, not a single proposed solution is a proper, rotational backup.
This is what rotational backups are FOR. They let you go back months in time, and even do post-corruption, or post-cracking examination of the machine that went down!
Backups do *not* need to be done to tape, but a mirror or a raid card is NOT a backup. This is actually simple, simple stuff, and it seems like the admins at KDE are a bit wet behind the ears, in terms of backups.
They probably think that because backups used to mean tape, that's old tech, and no one does that.
Not so! Many organizations I admin, and many others I know of, simply do off-site rotational backups using rsync + rotation scripts. This is the key part, copies of the data as it changes over time. You *never* overwrite your backups, EVER.
And with proper rotational backups, only the changed data is backed up, so the daily backup size is not as large as you might think. I doubt the entire KDE git tree changes by even 0.1% every day.
Rotational backups -- works like a charm, would completely prevent any concern or issue with a problem like this, and IT IS WHAT YOU NEED TO BE DOING, ALWAYS!
...someone has been using Internets as a backup machine? :)
"With great power comes great responsibility" - Spider Man, issue #1.
--
BMO
Rollbacks are also not backups.
Practice what you preach.
Common sense would dictate that git manages its own backup automatically anyways, so you don't need additional ones. Well, that didn't work out that great in this case.
Set up servers that gets a delayed update.. i.e 1 day delayed copy,1 week delayed copy and perhaps 1 month delayed copy.. Hopefully someone will notice an stop sync between servers before everything is gone.. Even if some part is lost.. Not all is lost..
They had/have no fucking backup! And complain about some git mirror issues. I can't fucking believe it that they can be so stupid.
The solution: MAKE BACKUPS!
Also, a SCM is not a backup, not even git! Every software can fuck up.
The files were corrupted, Git didn't report squat about the problems. The sync got different versions each time. Sure there are two layers of failure here, but one of them certainly is Git.
What he's saying is simple, Torvalds comment is not completely true:
"If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, git will notice them. It’s not a question of if. It’s a guarantee. You can have people who try to be malicious. They won’t succeed. You need to know exactly 20 bytes, you need to know 160-bit SHA-1 name of the top of your tree, and if you know that, you can trust your tree, all the way down, the whole history. You can have 10 years of history, you can have 100,000 files, you can have millions of revisions, and you can trust every single piece of it. Because git is so reliable and all the basic data structures are really really simple. And we check checksums."
He's saying that if the commits are corrupted:
"If a commit object is corrupt, you can still make a mirror clone of the repository without any complaints (and with an exit code of zero). Attempting to walk the tree at this point will eventually error out at the corrupt commit. However, there’s an important caveat: it will error out only if you’re walking a path on the tree that contains that commit. "
So there's a clear room for improvement. Sure the fault was a corrupt file, but the second layer of protection, Git's checking, ALSO FAILED. Denial isn't helpful here, Git should also be fixed.
No. Backup is out of scope for version control. Anybody with actual common sense would not expect it to make backups "magically" by itself and check to make sure. Then they would implement backups. But that does actually require said common sense.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Isn't that what every major release is called? Except for the "2013" part?
"Not the fault of git but those that did not bother to find out"
No, Git has the integrity check, the integrity check didn't work. If the integrity check had worked as claimed then their backups were solid.
I know people are saying "keep backups", but they're really missing the point. A backup is a copy of something, the more up to date the better, better still if it keeps a historic set of backups. Perhaps with some sort of software to minimize the size, perhaps only keep changes..... you can see where I'm going with this.
Git sync to a lot of drives IS A BACKUP. It is exactly what an ideal backup should be, historic, up to date, minimizes storage. What is that system if it isn't an automatic backup!
Except for this bug, which needs to be fixed, and a little less faith in git too would also be a good thing.
It's really no different than if you use the backup software, and it made careful backups and kept historic copies, and then one day your disk got corrupted, you promptly went to your backups only to find the backup software had been chomping those because it didn't notice the integrity was corrupt and had happily been corrupting the backups it was keeping.
So I see comments saying they didn't have backups OMG! But no, their problem was they only used ONE TYPE OF BACKUP SOFTWARE Git sync. I bet all of you use only ONE type of backup software and are equally vulnerable to this failure.
Rsnapshot provides cheap, userland hardlinked rotating snapshots work very well. Simply do the rsnapshots in one location, and three are dozen ways to make the completed, synchronized content accessible for download or other mirrors when the mirror is complete.
The only thing I dislike about it is the often requested, always refused feature of using "daily.YYYYMMDD-HHMMSS" or a similar naming scheme, instead of the rotating "daily.0, daily.1, daily.2" names which are quite prone to rotating in mid-download for anyone accessing the snapshots via NFS or a web browser. The only way you can tell the rotations apart is by the timestamp on the top level directory, and that's very confusing when it rotates out from under you in mid-operations.
Could be worse - Unity, Gnome 3, ...
I'm playing this on KDE 4, trying it out. All I really want to do is run Compiz and some other stuff in my highly tuned environment - I use the Desktop Cube, with a transparent desktop, and Cairo Dock. I left KDE back about 6-7 years ago, but right now it's closer to what I want and am used to than anything else. I have Bodhi/Enlightenment running on another machine. It's nice too, but right now I'm like a man without a country.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
Replicated systems need regular backups too. No shit, sherlock...
If only Linux had a filesystem that checksummed all you data, and check the checksum at every read. we could call it better FS, or something like that.
you ALWAYS have incremental backups on MULTIPLE MEDIUMS.
If you think your Git repositories are your backup, then you need to learn what the word Backup means.
Do not look at laser with remaining good eye.
"Anyway, you can adjust just your sync scripts to include the fsck and carry on."
And what if the corruption occurs after the fsck and before the sync?
The git sync shouldn't return OK if the commit object is corrupt. It's a bug, it needs fixed, no big deal, and no reason to defend a simple bug as though its a feature! Adding an fsck call is a temp workaround, but for solid faith in git this needs to be fixed.
But also I think a healthy lack of faith in a backup software (even if git's making the backup) is important. How many of those nightly backups could be silently corrupted by a bug in the backup software! Your disk fails, you try the backups and ...
Most IT people would have said "Where are your backups?" When the programmers say "We're using mirrors", the IT person would say, "Where are your backups?" a second time.
$50 says that whoever handles IT for KDE said "Hey guys, we need backups" and the programmers all said "Nah, we've got mirroring."
Seriously: why doesn't an organization as large as KDE have backups? I understand if Safe the Fuzzy Wuzzies doesn't have good IT, but a major open source project?
Always amazes me how I don't tell programmers how to do their job, yet I've had a decade and a half of programmers arguing with me about how to do mine. Which is particularly funny, since if the server under their desk dies, it's magically my fault/responsibility.
Please help metamoderate.
The article suggests using ZFS because of its protections against bad hardware.
It implies that ZFS protects against bad RAM but *this is not the case*. The ZFS developers recommend using ECC memory.
May I respectfully disagree? I've often seen such focus on what is "out of scope" used to limit cost and to limit the "turf" on which an employer or contractor needs access. But backup is _certainly_ a critical part of source control, just as security is. The ability to replicate a working source control system to other hardware or environments due to failure or corruption of the primary server is critical to any critical source tree. Calling it "out of scope" is like calling security "out of scope". By ignoring the consequences at the design stages of a source control system, very real risks are often taken without even thinking of the possible consequences, and the resources necessary to provide such critical features later can, and often do, multiply the cost of a project in unexpected ways.
A nightly mirror on low-cost hardware with snapshot capability, for example, can provide very useful fallback capability. Even hardlink based softwaer snapshots can work well.. It requires thought to configure correctly, and to schedule the mirrors and make sure they don't conflict with other high bandwidth operations such as tape backup, and to handle "churn" diskspace requirements. And I've had some very good success with partners and clients who took such modest backup tools and saved enormous cost on high-speed tape backup systems high bandwidth connections for remote mirroring facilities, or who had difficulti4es meeting very short backup windows by using the mirror, or the snapshots, to do the tape backups for archival. It does inject a phase delay into the tape backups, and recovery from tape has to be tested, but it's been extremely effective.
Several times, I've found that the problem is a political one. The backup system is often a very expensive, high performance capital cost, or some kind of proprietary "turf" of a manager who is very comfortable with and enamored of it, and they're concerned that adding this layer will make them look foolish for spending the money, or cost them their job as a proprietary owner of critical infrastructure. They already had the political battle purchasing the hardware in the first place and don't care to rehash their previous work. But it's often amazing what staging the backups this way can do for performance and user access to their backed up data. Most restoration cases are due to accidental file deletion or editing, and the users no longer need access to the tape backup system or off-site archival, and only to the snapshots which have read-only access with the same privileges as the original source material.
Because, you know you are a redneck !!
Just use it. Write in place filesystems are obsolete from an integrity point of view.
you had me at #!
"Git does not have the magic "integrity check" on making mirrors"
Right, so, it returns OK (0), yet the commit may be corrupt, it hasn't walked the full tree, and it may corrupt all copies. Good job you warned me about this flaw! I know to stick with p4s!
"Stop blaming the tool. This is correct and documented behavior. Start blaming the people that messed up badly."
Your backup tool is taking backups of the corrupt archive. Keep it independent or not, its corrupted when you come back to it.
The lesson here is not to trust one piece of software.
THIS.
But while we're at it - from TFA: "The root of both bugs was a design flaw: the decision that git.kde.org was always to be considered the trusted, canonical source. The rationale behind this decision is relatively obvious; itâ(TM)s a locked-down, authenticated resource that runs customized hooks to validate the code being pushed to it. Itâ(TM)s perfectly reasonable to decide that it should be considered to be correct."
If, at the end of the day, we do what TFA suggests, and propose that one machine be considered "the" authoritative centralized source, we've just given the backup-dude/sysadmin his job back.
The elephant in the room here is back in that section of TFA that refers to "the trusted, canonical source."
Congratulations, now that you've migrated from git, you discover you still need something that functions as the "centralized" part of a centralized version control system. There are many reasons to argue for DVCS over centralized, but eliminating big iron central server and the concept of backups "because the source is on everybody's laptops!" isn't one of them.
I believe you are not talking about backup. A backup allows system recovery after a disaster and cannot ever be stored in the system itself. What you are talking about is availability improvement. That _can_ be part of the primary system. RAID, for example, exclusively serves this purpose (except RAID0). But backups must also protect against user and administrator error, software errors, the data-center burning down, sabotage, etc.
Replication is not the tool for that. The problem is that any data copy part of the system itself can be corrupted by the system as the system still has access to it. That is why a backup must be both removed from the system so it is independent, and allow full reconstruction, even if the original system is completely destroyed.
Now, improving uptime and reducing downtimes is important, but it is not what a backup does. A backup makes sure you do not lose your data permanently. What uptime improvement does is to make it less likely that you need to go back to the backup.
Or to put it differently, backup is for Disaster Recovery. Uptime improvement is for reducing DR cost reduction by reducing the probability of it becoming necessary and for reducing downtime cost.
I do agree to the political angle though.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Oh, and I should say that backup is very much in scope for a version control system installation! (We do nightly full and hourly incremental backups, for example.) It is just not in scope for the version control system software itself, as it solves a different problem.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Except they suspect the corruption was there a long time unnoticed and so your rotation copies have the corruption too! Worse, because its rotational, sooner or later the oldest one has gone....
Really, you're putting your faith in MAGICSOFTWAREBACKUP, and saying "well Git mirrors aren't proper mirrors", except they ARE proper mirrors and they do keep historic backups! That what distributed server versioning software *IS*, it too never overwrites old versions, it too only stores differences, it too only syncs the differences, it too is physically distributed among many machines and locations!
The problem here, is git has a flaw, and your MAGICSOFTWAREBACKUP could equally have a flaw. Perhaps it's not copying files ending in _fred, who knows, software is software, bugs are bugs! Don't assume your software (whatever it is) that describes itself as backup software is somehow less problematic than a git sync!
I hate incremental backups (the kind you describe) particularly because I've had a corrupt root file and couldn't recover from a backup. I had 2 months of data back, even if the backup had worked, it would still have been a disaster to lose more than 2 months.
IMHO this is a simple git bug, the synch'd copies were not only corruped BUT NOT EVEN IDENTICAL, so there's clearly a problem here. Oh well, software is software, find the bug fix it, and don't rely on one type of backup, ever again, even your rotational backups.
A git sync to multiple machines, plus a second type of backup is the way to go. The git mirror counts as one type of backup, you need another type, some other software some other way. It could be rotational backups, it could be as simple as filecopy on a cron job, it could be a second versioning server, (e.g. a Perforce repo mirrored from git ), but some *second* backup strategy.
But backup is _certainly_ a critical part of source control, just as security is.
Interesting example, given that git also doesn't do security or authentication (hence the need for gitolite)
It was, shall we say "surprising" to discover that having commit access to a git repository allowed you to delete the history of other peoples' work.
There are many reasons to argue for DVCS over centralized, but eliminating big iron central server and the concept of backups "because the source is on everybody's laptops!" isn't one of them.
Well, sort of. If they had done full repo updates on the "mirrors", this issue would likely not have happened. The core problem was that they did el-cheapo mirroring without understanding what the consequences are. They would still have to do full checkouts and detach them afterwards to make them proper backups. After all, the git software could have flaws. So while it does not need to be a "big iron central server", setting up several systems specifically doing backups is non-optional. In a sense they will be "central" systems then.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Doesn't everyone knows about the file system corruption that happens often on linux ext4 formatted systems?
Oh well.. I guess the neckbeards have been successful in blaming the victims instead of ext4 devs.
Especially backup software.
The real mirrors in my house are also too perfect. Reflecting precisely what I put in front of them, rather than what I want to see. What they need is a copy-on-write file system for their source code servers, not an adaptive mirror.
Do we have yet another case of someone who makes an IT related product thinking they are IT? The mistake highlighted by the article and a lot of the comments thinking version control = backup remind me of the many time some vendor tried to sell an IT product to a company while in my mind the whole time the developer or consultant are talking I keep yelling "you don't get IT, you are not IT, go talk to YOUR IT back at your company...you know, the guys that pull their hair out every time you trash your PC installing dev tool de jour"
developer != IT .
AB HOC POSSUM VIDERE DOMUM TUUM
Jeff King at 2013-03-24 18:31:33 GMT
propagating repo corruption across clone
"So I think at the very least we should:
"
Jeff Mitchell tried to respond to the criticism in Hacker News (a bit similar to the criticism made in Slashdot) in this post on his blog. I don't think he's successfully answered everything said here, but it is good to read his rationale.
Now, improving uptime and reducing downtimes is important, but it is not what a backup does.
Well, a backup does contribute to reducing downtimes, albeit not so much as RAID/etc. Compared to doing a full reinstall/reconfiguration restoring from backup is likely to be much faster. That is why backup can be useful even on systems that do not contain unreproducible data. There are other strategies that have other advantages (like automatic builds/etc) which are also effective if there is no data involved.
I do agree that the primary purpose of a backup is to prevent the loss of data in as many failure modes as possible/practical. Mirrors are definitely not backups (or at least, not very good ones - there is a continuum when it comes to backups, just as there is a continuum when it comes to disasters).
"backup is _certainly_ a critical part of source control"
Well, no, it isn't. Backup and version control certainly share some attributes: a history line, the ability to extract snapshots along that history line... but they go appart on other things (or else there wouldn't be specialized version control software: we all would be using backups for that).
The most obvious thing needed for backups that is not needed for version control is -despite the fact that you yourself seem not to understand it, is that on backups the historical snapshots need to be disconnected one from another, while that's not the case for version control. And that's the case for backups because as soon as you have any link among history points you can't guarantee the integrity of any one of them and so you lose one of the most needed abilities of a proper backup system: the ability to get to that snapshot as it was in the past. Version control expects a properly running system and rightly so; that's called separation of concerns and it is a good thing.
That means, for instance, that no, hardlinking is not a proper backup policy (by itself only) nor is rsync, nor is filesystem-level snapshotting. Tapes, on the other hand, do allow for proper backups because the contents of one backup set are totally disconnected from the contents of another one so any failure, tamper, or disaster on one of them doesn't automatically affect others (but certainly tapes is not the only way to achieve that goal).
At the highest level is not hard to come up with a proper backup design, no, really:
* At least two whole copy sets of the data to protect
* At least one of them totally disconnected from the system to be protected.
* At least one copy of the recover procedure documentation outside the system to be protected.
* At least two persons in the know of the procedure and a third about where to find the documentation. Make them not to work together at the same place.
* Finally, remeber that if you didn't try to recover data from it, you don't have a backup.
Now, go back that list and tell me if what the KDE people was doing fits the definition.
OK, I'll answer this for you: No, it doesn't. The mirroring script coupled all and every copy along there whole history path, so it isn't a backup.
See? Not so difficult.
How often would you do do complete backups of KDE? How many would you save? How much hardware would that require?
TFA says they have several GBs of data. Something like 89 GB. Since that's a rounding error to us, we volunteered to donate the necesary space. (EACH of our storage units for our backup service is at 14 TB, so donating 89 GB X 4 copies is nothing.)
You asked how often - most web servers we do daily. For their case, I'd probaly do the same as my desktop - daily off site, and four times per day lical snapshot.
It sounds like you have in mind rolling back to something very specific, something which is perhaps not a backup. What I'm talking about must certainly is a backup. I'm talking about rolling back to an offsite image made last month, last week, yesterday, or this morning.
" If they had bothered to look at the documentation they would have known"
So you read and familiarize yourself with the entire documentation on every piece of software you use?
Two years ago, Google erased the contents of hundred of thousands of GMail accounts. It was caused by a bug and corruption spread through their network even though it is normally highly redundant and fault tolerant.
The result : a few hours to a few days of downtime for the affected accounts and almost no data loss.
How did they manage to avert a disaster ? They had proper backups, on tapes.
I'm wondering whether the people commenting here have actually read the article the post links to. It's well explained there why part of the fault definitely lies with git, and why making backups of a repository being changed all the time isn't as simple as just copying stuff with a cronjob. If the mistake with the repo list from the main server would not have been made (and this was the only real mistake they made), and git had actually worked as documented, then the mirrors would have been a perfectly reasonable backup solution in my opinion.
I don't know, I think I'd respectfully disagree. Those are all backups. If you have an online mirror and a fire destroys your primary data source or it's stolen, you can restore from the online mirror. This, having at least one fully copy of the data and being able to restore it after a loss, is the very definition of a backup.
The problem is that mirrors are not very good backups, and are prone to having the same problems as the original. Using a mirror as a backup is perfectly reasonable. Using a mirror as your only backup is foolish.
From my 34 years of constructing, coding and maintaining applications on computers I learned by the hard way the 4 most important points:
1. Backup.
2. Backup.
3. Backup.
4. The rest.
Mundus Vult Decipi