Patch the Linux Kernel Without Reboots

← Back to Stories (view on slashdot.org)

Patch the Linux Kernel Without Reboots

Posted by kdawson on Thursday April 24, 2008 @03:00AM from the click-n-go dept.

evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'" Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.

19 of 286 comments (clear)

Min score:

Reason:

Sort:

Already been used by caluml · 2008-04-24 03:10 · Score: 4, Informative

There was a kernel exploit recently where someone submitted a patch that modified the running kernel using this technology. It didn't work for me, so I had to resort to patching the .c that was affected - but a lot of people reported that it worked.

--
Get your own free personal location tracker
1. Re:Already been used by ThisNukes4u · 2008-04-24 04:05 · Score: 2, Informative
  
  IIRC, that code was actually a modified version of the exploit where the payload was changed to fix the exploit instead of spawn a root shell. Pretty fucking ingenious if you ask me.
  
  --
  thisnukes4u.net
Impressive hack by EriktheGreen · 2008-04-24 03:18 · Score: 4, Informative

For those that haven't read the paper, the technique used is straightforward in concept, but the devil is in the details.
He basically compiles a patched and unpatched kernel with the same compiler, compares the ELF output, and uses that to generate a binary file that corresponds to the change. That gets wrapped in a generic module for use, another module installs it along with JMPs to bypass the old code and use the new, and he performs the checks needed to make sure he can safely install the redirects.
He also has to differentiate real changes from incidental ones (the example given is changing the address of a function - all references to it will change, but they don't really need to be included in the binary diff).
The only human work required is to check whether a patch makes semantic changes to a data structure... whether eg. an unsigned integer variable that was being used as a number is now a packed set of flags - the data declaration is the same, but it's being used differently.
Interesting paper. Also a useful new set of capabilities for any Linux user who can't handle downtime for quarterly patching... worth its weight in gold in some businesses.
Erik
And Microsoft claims to have invented it by davecb · 2008-04-24 03:29 · Score: 3, Informative

Tomasz Chmielewski wrote on LKML: the idea seem to be patented by Microsoft, i.e. this patent from December 2002: http://www.google.com/patents?id=cVyWAAAAEBAJ&dq=hotpatching In essence, they patented kexec ;)
Andi Kleen promptly provided prior art: The basic patching idea is old and has been used many times, long predating kexec. e.g. it's a common way to implement incremental linkers too.

--
davecb@spamcop.net
Re:Wrong way to solve the uptime problem by N1ck0 · 2008-04-24 03:30 · Score: 3, Informative

Mainly why people in the telecom industry have been clamoring for it. Its very difficult to take over the termination of a circuit switched system without some interruption for the end user. And its also not aways easy to busy out all channels on a line as calls drop off so you can free up a machine for patching.

Of course many of the reasons is a lot of commercial telecom apps are badly implemented and need better management controls.
Re:Unless it fails. by UnknowingFool · 2008-04-24 03:33 · Score: 2, Informative

For, your average computer and generic linux servers the downtime is small. But companies often have applications that they need to restart. That is the difference. Also linux is used on equipment other than generic servers: embedded systems, etc where loading isn't optimized cause the equipment should never go down.

--
Well, there's spam egg sausage and spam, that's not got much spam in it.
Re:Unless it fails. by Tychon · 2008-04-24 03:36 · Score: 4, Informative

A company that I once had dealings with was quite proud of their five nines. The motivation? It cost them $18,000 per second they were down. 30 seconds isn't just 30 seconds sometimes.
Re:The real test... by LinuxDon · 2008-04-24 03:39 · Score: 2, Informative

It's in the comment: "ksplice requires no kernel modifications"

So yes, ksplice can be installed/used without rebooting.
Re:Needed that bad? by Iphtashu+Fitz · 2008-04-24 03:44 · Score: 4, Informative

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.

If you have a load balanced environment then you have the ability to redirect new connections away from a given server. Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline. I've worked in a number of telephony environments and this was always the way we would patch systems. Stop accepting new connections, wait for existing ones to end, then perform the patch, reboot, verify, and start accepting connections again.

Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.

Any mission critical hardware, switches, routers, servers, etc. should be set up in redundant pairs (or triplets, ...) so that if a hardware failure occurs the remaining hardware can keep the service up. Single points of failure are avoided like the plague in datacenters that require 100% uptime. Part of that is to deal with hardware failures but part is also to provide an ability to perform software/firmware upgrades when necessary. Once again, you migrate all traffic off the system you're upgrading then apply the upgrades offline. Upgrading a kernel, especially, in an online environment, is something virtually any sysadmin would want to avoid if at all possible.

Redundancy is key, and any commercial datacenter will offer it all the way from their connections to the outside world to the connections they provide their customers. Every datacenter used by every company I ever worked for (about 10) offered redundant power and redundant network drops (using HSRP, VRRP, etc) for our equipment. If the datacenter needed to upgrade a router they'd move all traffic off one router so they could upgrade and test it, then move traffic off the other and repeat the process. Similarly if we needed to upgrade our firewalls, switches, etc. we'd fail over to the second redundant device first. In some cases we had bonded interfaces right on the end servers so as long as one path remained active we could power down an entire switch, router, firewall, etc. In other cases we relied on load balancing across servers that were alternately connected to one or another switch.
Re:Wrong way to solve the uptime problem by Anonymous Coward · 2008-04-24 04:22 · Score: 1, Informative

Big Banks (tm) - like the one I currently work in - can afford to and do have even the largest systems installed in fully redundant configurations. It's part of standard BCM (business continuity management) practice - we need to, and can survive an entire datacenter dropping of the network, for whatever reason up to and including getting bombed off the face of the earth. In normal day to day practice these machines can and are used for load-balancing, to allow primary boxes to get taken down for maintenance.

And as a sysadmin in a bank, the solution described in the story isn't that appealing. It strikes me as something inherently less reliable than doing a cold boot with a new kernel. Scheduled downtime is OK, unscheduled problems because someone wanted to do an upgrade on the fly are *bad*.
Re:Needed that bad? by Nkwe · 2008-04-24 04:51 · Score: 3, Informative

Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline.
This assumes that active connections will terminate in a timely fashion. I used to have internet service via an ISDN via a connection to my office. My ISDN calls would connected for a couple of months at a time. Yes, one connection lasting multiple months. There are other cases where a connection, context, or state between two systems would need to be maintained for extended periods of time. Many of these situations can not be solved by load balancing and would benefit greatly by the ability to make kernel changes without interrupting current work, or waiting for it to complete.
Re:Needed that bad? by mr_mischief · 2008-04-24 05:02 · Score: 4, Informative

If you change something in a configuration that requires a change to the startup script, then you also change the startup script.

A patch to the kernel almost never requires changes to startup scripts. They're not talking about adding new functionality with user-space-addressable interfaces with this tool. They're talking about being able to install about 84% of security hotfixes in a hurry outside your scheduled reboots then rebooting on your regular maintenance schedule.
Re:Wrong way to solve the uptime problem by mr_mischief · 2008-04-24 05:30 · Score: 3, Informative

Can we please kill the 24/7/265 phrasing? Where do you people live that there are 365 weeks in a year?

Why not 24/7/52 or 24/7/4.3/12 or just 24/365 (or 24/365.242 for the pedants).
Re:Needed that bad? by mopower70 · 2008-04-24 05:34 · Score: 2, Informative

A configuration change that renders a start up script ineffective is a sign that your sysadmin hasn't been doing his job, and that you have no concept of change control in your environment.

We have systems that run for years at a time because we have change management tools that guarantee that those systems are in the exact state of configuration they should be in, and these tools run every night. If you're running around making undocumented configuration changes that even have a ghost of a chance of affecting server operation, anyone that gave you root access needs to have their fingers shortened.
Re:Needed that bad? by mOdQuArK! · 2008-04-24 06:04 · Score: 4, Informative

There's a difference between what YOU as an end user consider to be an open connection, and what the telecom equipment consider as a connection.

For all you know, your apparent always-on connection was actually a virtual connection being frequently switched & reswitched over many different real physical connections. That would be a fairly standard architecture for having a network infrastructure which can have components being worked on while data is still flowing through the network.

When the telecom provider is "waiting for active connections to go away" on a particular device only means that all of the virtual connections that are momentarily being switched that device have been successfully switched to another device. It doesn't mean that any of those virtual connections have to be terminated.
Re:Needed that bad? by smallfries · 2008-04-24 08:02 · Score: 2, Informative

True, but I've been standing in switch rooms watching operators manually kill those circuits because they wanted to reboot a box. 5x 9s doesn't mean perfect service, and if anyone complained about it they were told that a ms interruption once every few months was in their SLA. By the time they reconnected they went through another box so how were they to know it was any longer than that.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:And Microsoft claims to have invented it by johannesg · 2008-04-24 08:04 · Score: 2, Informative

AmigaOS had its kernel in ROM, and could be patched on the fly. That was back in 1985, so even if it was patented, it isn't now.

The patching function was not an accident either; there was an OS-function for this purpose. Originally it was intended to allow bug-fixed to be installed without having to change the ROM, but it was quickly coopted into a mechanism for enhancing the OS in various other ways as well.
This was the smallest part of the interview... by tytso · 2008-04-24 10:46 · Score: 3, Informative

Funny thing... this was the smallest part of my oh, hour and twenty minute interview with the reporter. The reason for the call was to hear about what was up with the 2.6.25 release; she probably spent more time talking with me about KVM and Xen; and I mentioned ksplice just as an aside, as an example of lots of really interesting and exciting work that doesn't necessarily happen as part of a mainline kernel release. I spent maybe 2-3 minutes tops talking to her about ksplice --- and that's what she ends up writing about and getting slashdotted!
Microsoft has NOT patented this! by Anonymous+EPA · 2008-04-25 00:06 · Score: 2, Informative

Tomasz Chmielewski is wrong. Microsoft applied for a patent and their application was rejected by the examiner, as was their appeal in the USPTO. Check out the file history of application US 2004-0107416.
Their only resort is to appeal to court.
There are no applications in other countries.
A