Patch the Linux Kernel Without Reboots

← Back to Stories (view on slashdot.org)

Patch the Linux Kernel Without Reboots

Posted by kdawson on Thursday April 24, 2008 @03:00AM from the click-n-go dept.

evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'" Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.

11 of 286 comments (clear)

Min score:

Reason:

Sort:

Needed that bad? by MetalliQaZ · 2008-04-24 03:04 · Score: 5, Insightful

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching. They probably would be taken out of the loop for the in-place patching anyway. So who is "clamoring"?

--
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
1. Re:Needed that bad? by jelle · 2008-04-24 04:12 · Score: 5, Insightful
  
  So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.
  
  Methods like that usually suck in real-life, because right the day before you want to 'take it out of rotation', a circuit is opened through it that requires five nines (so you can't drop it), and it will remain open for months...
  
  You will end up with 99 boxes waiting to 'get out of rotation' for every
  single box that you don't need to update...
  
  Murphy will make sure of that.
  
  --
  --- Hindsight is 20/20, but walking backwards is not the answer.
2. Re:Needed that bad? by Anonymous Coward · 2008-04-24 04:43 · Score: 5, Insightful
  
  I have internal processing servers that have up times of over 3 years
  
  I've never understood this boasting about uptime. Long uptimes are a bad thing! How do you know a configuration change hasn't rendered one of your startup scripts ineffective? If you have to reboot for some unexpected reason, you could be stuck debugging unrelated problems at very inopportune moments.
  
  You need to schedule regular reboots so that you can test that your servers can start up fine at a moment's notice. Long uptimes are a sign a sysadmin hasn't been doing his job.
3. Re:Needed that bad? by Kookus · 2008-04-24 05:31 · Score: 5, Insightful
  
  Production systems are not for testing purposes. You want to test rebooting? Do it on a test box.
4. Re:Needed that bad? by Kymermosst · 2008-04-24 05:39 · Score: 4, Insightful
  
  How do you know a configuration change hasn't rendered one of your startup scripts ineffective?
  
  Isn't that what QA systems and effective approaches to change management are supposed to handle?
  
  If I am planning a change, I should discover problems with the startup scripts in QA, not in production, especially if a production reboot is not required to implement the change.
  
  --
  "Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
5. Re:Needed that bad? by adrianbaugh · 2008-04-24 05:53 · Score: 4, Insightful
  
  How do you know that your test boxes are configured precisely identically to the production boxes?
  
  dd your production box's system filesystems to another hard drive, put in an identically specced machine, boot that?
  
  --
  "'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
  - JRR Tolkien.
Wrong way to solve the uptime problem by Anon+E.+Muss · 2008-04-24 03:07 · Score: 4, Insightful

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.

--
The key sequence to access my Slashdot bookmark in Firefox is Alt-B-S. I don't believe this is a coincidence.
1. Re:Wrong way to solve the uptime problem by trybywrench · 2008-04-24 03:31 · Score: 4, Insightful
  
  Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers. People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle. From day 1 the server needs to be making money and never ever stop. For smaller general purpose servers like you can buy at Dell.com then yeah having a fail-over makes sense.
  
  --
  I came to the datacenter drunk with a fake ID, don't you want to be just like me?
If it's that critical, shouldn't you have two? by Paul+Carver · 2008-04-24 03:20 · Score: 4, Insightful

I'd rather have at least two of anything important and have statefull failover between them.

If you've got this system that's so critical you can't reboot it for a kernel upgrade, what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter?

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.

If you can't transfer the workload to a location at least a couple hundred miles away without users noticing then you're not in the big league.

And as long as the workload is in another datacenter, what's the big deal about rebooting for a kernel upgrade.
Re:Amazing by KeithJM · 2008-04-24 03:40 · Score: 5, Insightful

someone with root access could slip a rootkit right under your nose Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.
Re:No, No, No and No again. by hab136 · 2008-04-24 05:17 · Score: 4, Insightful

As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.

1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.

I spent a decade in perimeter networking at a Fortune 50 US bank. My group didn't do the internal network, just the perimiter, and we still had dozens of network sites and thousands of pieces of equipment. The bank itself has hundreds of thousands of employees, millions of users. Online banking and brokerage are about as high availability as you can get save utilities (power, water, telephony, etc) or military. Seconds of online brokerage downtime equated to millions of dollars lost.

The idea that load balancing introduces inexplicable down time is completely unsupported by my experience.

"One-node clusters" seems like marketing speak for "single point of failure". A cluster by definition is two or more nodes.

Redundant routers, switches, firewalls, the works or you're not high-availability in my opinion. The fact that you're talking about Postgresql instead of Oracle or DB2 on mainframes makes me think that your idea of high availability is different than mine.