Patch the Linux Kernel Without Reboots
evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'"
Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.
If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching. They probably would be taken out of the loop for the in-place patching anyway. So who is "clamoring"?
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
honestly how much downtime are we talking here? 30 seconds?
Sometimes, life itself is sarcasm...
Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.
The key sequence to access my Slashdot bookmark in Firefox is Alt-B-S. I don't believe this is a coincidence.
Theory of operation:
:-)
1. Build new_module
2. rmmod old_module
3. modprobe new_module
Gee, that was hard
I'd rather have at least two of anything important and have statefull failover between them.
If you've got this system that's so critical you can't reboot it for a kernel upgrade, what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter?
I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.
If you can't transfer the workload to a location at least a couple hundred miles away without users noticing then you're not in the big league.
And as long as the workload is in another datacenter, what's the big deal about rebooting for a kernel upgrade.
Once again, we have an over-engineered solution to a non-existent problem.
Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production. This includes testing kernels, hardware, networks, interaction, application, data and so on. One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period.
So you hot-patch a running Linux kernel. How do you Q&A that? How do you roll back if the patch fails? Where is your 'control'?
The answer? A duplicate machine. But wait, if you have two identical machines... isn't that... a cluster?
Exactly. And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades. You NEVER EVER touch a running, production system like that.
Well, not if you want any sort of data integrity or control and want to pass any level of quality validation on that physical environment.
Lots of people are saying, "100% uptime of a particular machine is neither necessary nor desirable, full failover is better. Full failover is the only way to handle catastrophic hardware failures." Or something to that extent.
But this isn't about 100% uptime. It's about not having to reboot for a kernel upgrade. You should still have hot failover if you want HA, this just removes one more thing that requires a reboot.
It's like people saying, "I don't mind rebooting after installing Office, I don't expect 100% uptime from my workstation." Of course you don't need to be able to do software installs without rebooting. But isn't it nice to have that option available?
Same with this. When (and if) it gets stabilized and standardized, you'll use it. Not for 100% uptime, just because it's nice to not be required to reboot to enable a particular software install.
Stop-Prism.org: Opt Out of Surveillance
My bad, I meant to say,
"A remote attacker who successfully executes a privilege escalation exploit and gains root access will have an easier time taking control of your server and hiding their tracks".
Thanks for pointing that out
- Roey
And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades
Hmmm. I happen to live by your words in an environment where this is theoretically possible, but practically impossible. Why? Because when the cluster rolls to a passive node, the application times out on the existing connections. The time outs have business ($$$$) implications. I wish it were okay to have infinite retries, but it's viewed as a violation of the service agreement. Telephony is like this too.
An academic ideal for sure, but please speak more humbly because it is no silver bullet.
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
I spent a decade in perimeter networking at a Fortune 50 US bank. My group didn't do the internal network, just the perimiter, and we still had dozens of network sites and thousands of pieces of equipment. The bank itself has hundreds of thousands of employees, millions of users. Online banking and brokerage are about as high availability as you can get save utilities (power, water, telephony, etc) or military. Seconds of online brokerage downtime equated to millions of dollars lost.
The idea that load balancing introduces inexplicable down time is completely unsupported by my experience.
"One-node clusters" seems like marketing speak for "single point of failure". A cluster by definition is two or more nodes.
Redundant routers, switches, firewalls, the works or you're not high-availability in my opinion. The fact that you're talking about Postgresql instead of Oracle or DB2 on mainframes makes me think that your idea of high availability is different than mine.
I would think that on top of the benefits of patching running high-uptime servers this would in the long run also result in yet another benefit to running Linux on your desktop instead of Windows. I don't see any reason RedHat, Ubuntu and everyone else wouldn't implement this type of kernel upgrade for convenience' sake.
I keep forgetting my place. Jesus is for losers. Why do I still play to the crowd?