Patch the Linux Kernel Without Reboots

In Soviet Russia, by finalnight · 2008-04-24 03:02 · Score: 0, Offtopic

In Soviet Russia, the kernel reboots you!

Re:In Soviet Russia, by oodaloop · 2008-04-24 03:31 · Score: 2, Funny

Let's get the rest of the usual jokes out of the way while we're at it.

If there were no kernel, it would necessary to create our non-rebooting robot overlords are belong to Chuck Norris.

--
Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
Re:In Soviet Russia, by Anonymous Coward · 2008-04-24 03:55 · Score: 0

You forgot: "Imagine a Beowulf cluster of those" and "But does it run linux?"
Re:In Soviet Russia, by oodaloop · 2008-04-24 04:26 · Score: 5, Funny

"But does it run linux?" That's a joke? I thought that was just one dedicated user who kept asking on every article.

--
Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
Re:In Soviet Russia, by A+nonymous+Coward · 2008-04-24 06:00 · Score: 1

In North Korea, only old kernels reboot only old people!

Or is that any kernels reboot only old people?

Or perhaps only old kernels reboot any people?

Or do any kernels reboot any people, but lately?

I am confoozed.

--
Infuriate left and right
Re:In Soviet Russia, by Anonymous Coward · 2008-04-24 08:54 · Score: 1

I, for one, welcome our hot-patchable kernel overlords!

Needed that bad? by MetalliQaZ · 2008-04-24 03:04 · Score: 5, Insightful

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching. They probably would be taken out of the loop for the in-place patching anyway. So who is "clamoring"?

--
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"

Re:Needed that bad? by tgatliff · 2008-04-24 03:22 · Score: 2, Funny

I guess a better way to put it would be "oh... Way Cool!!!!"... :)

Meaning, yes I agree that in most cases it is not needed, but I have internal processing servers that have up times of over 3 years, so if I had something like this probably all my servers would have up times of this long..
Re:Needed that bad? by Chris+Burke · 2008-04-24 03:26 · Score: 3, Interesting

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching.

Two things:

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.

Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.

As to how often these things collide, and how much of a pain it is to actually stop a server for some amount of time, I can't say. But I can see situations where being able to hot-swap a kernel would be useful.

--

The enemies of Democracy are
Re:Needed that bad? by garlicbready · 2008-04-24 03:28 · Score: 3, Interesting

I was about to say another idea might be virtulisation via xen for example
start up a new virtual machine with the new kernel, then when your sure it's working, just switch everything across from the old to the new, and shut down the old virtual instance
Re:Needed that bad? by diamondsw · 2008-04-24 03:43 · Score: 1

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine.

So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.

--
I don't know what kind of crack I was on, but I suspect it was decaf.
Re:Needed that bad? by Iphtashu+Fitz · 2008-04-24 03:44 · Score: 4, Informative

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.

If you have a load balanced environment then you have the ability to redirect new connections away from a given server. Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline. I've worked in a number of telephony environments and this was always the way we would patch systems. Stop accepting new connections, wait for existing ones to end, then perform the patch, reboot, verify, and start accepting connections again.

Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.

Any mission critical hardware, switches, routers, servers, etc. should be set up in redundant pairs (or triplets, ...) so that if a hardware failure occurs the remaining hardware can keep the service up. Single points of failure are avoided like the plague in datacenters that require 100% uptime. Part of that is to deal with hardware failures but part is also to provide an ability to perform software/firmware upgrades when necessary. Once again, you migrate all traffic off the system you're upgrading then apply the upgrades offline. Upgrading a kernel, especially, in an online environment, is something virtually any sysadmin would want to avoid if at all possible.

Redundancy is key, and any commercial datacenter will offer it all the way from their connections to the outside world to the connections they provide their customers. Every datacenter used by every company I ever worked for (about 10) offered redundant power and redundant network drops (using HSRP, VRRP, etc) for our equipment. If the datacenter needed to upgrade a router they'd move all traffic off one router so they could upgrade and test it, then move traffic off the other and repeat the process. Similarly if we needed to upgrade our firewalls, switches, etc. we'd fail over to the second redundant device first. In some cases we had bonded interfaces right on the end servers so as long as one path remained active we could power down an entire switch, router, firewall, etc. In other cases we relied on load balancing across servers that were alternately connected to one or another switch.
Re:Needed that bad? by Paul+Carver · 2008-04-24 03:45 · Score: 3, Insightful

If your load balancer can't take a server out of the pool while allowing current sessions to finish cleanly then you need to shop for a new load balancer.

A decent load balancer will obviously give you the choice of whether to take a server out of service immediately disrupting existing sessions or simply stop sending new sessions to it while allowing existing sessions to continue.

As for your comment about physical connections, that's what portchannels and multilink trunks are for. Or VRRP and HSRP depending on which level of "connected to" you mean.
Re:Needed that bad? by QuantumRiff · 2008-04-24 04:05 · Score: 2, Interesting

But what about the servers that are placed in remote sites like small cell towers, where space, and backup power are critical issues.

--

What are we going to do tonight Brain?
Re:Needed that bad? by jelle · 2008-04-24 04:12 · Score: 5, Insightful

So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.

Methods like that usually suck in real-life, because right the day before you want to 'take it out of rotation', a circuit is opened through it that requires five nines (so you can't drop it), and it will remain open for months...

You will end up with 99 boxes waiting to 'get out of rotation' for every
single box that you don't need to update...

Murphy will make sure of that.

--
--- Hindsight is 20/20, but walking backwards is not the answer.
Re:Needed that bad? by Colin+Smith · 2008-04-24 04:31 · Score: 2, Interesting

The very fact that there is load balancing means that every server is likely to have active connections going through it http://conntrack-tools.netfilter.org/

I hot-swap whole networks.

HTH.

--
Deleted
Re:Needed that bad? by Anonymous Coward · 2008-04-24 04:38 · Score: 1, Funny

what takes it out of rotation? it's turtles all the way down.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 04:43 · Score: 5, Insightful

I have internal processing servers that have up times of over 3 years

I've never understood this boasting about uptime. Long uptimes are a bad thing! How do you know a configuration change hasn't rendered one of your startup scripts ineffective? If you have to reboot for some unexpected reason, you could be stuck debugging unrelated problems at very inopportune moments.

You need to schedule regular reboots so that you can test that your servers can start up fine at a moment's notice. Long uptimes are a sign a sysadmin hasn't been doing his job.
Re:Needed that bad? by Nkwe · 2008-04-24 04:51 · Score: 3, Informative

Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline.
This assumes that active connections will terminate in a timely fashion. I used to have internet service via an ISDN via a connection to my office. My ISDN calls would connected for a couple of months at a time. Yes, one connection lasting multiple months. There are other cases where a connection, context, or state between two systems would need to be maintained for extended periods of time. Many of these situations can not be solved by load balancing and would benefit greatly by the ability to make kernel changes without interrupting current work, or waiting for it to complete.
Re:Needed that bad? by mr_mischief · 2008-04-24 04:59 · Score: 1

One of the things telephony servers do these days is handle ports, traffic, or both for live calls.

Hey, as long as it's your calls that get dropped and not mine, it's fine with me if the servers drop calls. If you'd rather not have any calls dropped, then this is nice.

You could take the server to be rebooted out of the load balancer's control, in which case existing calls would eventually end and no new calls would get assigned. You could then reboot once no calls would be effected. This solution, though, leaves the kernels unpatched on average (num_machines * time_to_idle) / 2.

If you combine this hotfix capability with scheduled simmer down reboots, then you can fix all your servers right now, then reboot them one at a time later as well. That gets your security fixes rolled out faster than just waiting for idling down then rebooting one server at a time.
Re:Needed that bad? by harry666t · 2008-04-24 05:00 · Score: 1

OTOH, short uptimes are not KRIEG!
Re:Needed that bad? by mr_mischief · 2008-04-24 05:02 · Score: 4, Informative

If you change something in a configuration that requires a change to the startup script, then you also change the startup script.

A patch to the kernel almost never requires changes to startup scripts. They're not talking about adding new functionality with user-space-addressable interfaces with this tool. They're talking about being able to install about 84% of security hotfixes in a hurry outside your scheduled reboots then rebooting on your regular maintenance schedule.
Re:Needed that bad? by RiotingPacifist · 2008-04-24 05:08 · Score: 1

Ok the real story is that hes doing it because its cool. And that's the way Linux works, sure load-balanced servers are nice, but getting rid of all expected downtime, means that you could theoretically run an uptodate system with 99.9999' %, sure. I was wondering when this would happen, i think predicted it a while ago (damn non-suscriber limit on comment history), i was expecting a different approach, but it was just a matter of time until somebody implement this.
Hopefully this will make it server distros ( a lot of small one companies/groups, cant afford load-balanced servers) and eventually even desktops (not much use other than boasting, but never having to reboot is better than always having to reboot as you do for windows updates).

Also this if I understand this right, it means you can also install patches to your kernel more easily, no need to reboot to add reiser4 support. All somebody needs to do is work out some complex dependency resolution tool, and even binary distro can offer much more kernel variation, obviously it wont hit the mainstream distros straight away, but i could see something like apt-kernel, letting general distros get lighter/more secure/more features, as the user requires.

I mean everybody knows your uptime is directly proportional to your penis length, meaning all those extra nines are really going to help!

--
IranAir Flight 655 never forget!
Re:Needed that bad? by mr_mischief · 2008-04-24 05:10 · Score: 3, Interesting

Yes, but you end up taking one machine out at a time, giving it time to simmer down, and then bringing it back up. If you have 100 boxes and it takes 30 minutes to simmer down to idle, you have 30 * 100, or 3,000 minutes to do the upgrades. On average, your boxes go unpatched for 1500 minutes.

So you have this security hotfix you really want to apply, but it's going to 25 hours on average to fix a box and 50 hours to fix them all.

You could, with ksplice and a good concurrent control system, make your average time to fix 5 minutes in over 80% of kernel upgrade scenarios rated "Critical". Your boxes could still be rebooted on a regular basis later.

Which do you prefer?
Re:Needed that bad? by Kookus · 2008-04-24 05:31 · Score: 5, Insightful

Production systems are not for testing purposes. You want to test rebooting? Do it on a test box.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 05:33 · Score: 0

so... What is more important, verifying you havent broken startup configs, or reaching 911 when you dial?

Clearly you havnt ever dealt with the carrier world. let me tell you... you are lucky :P
Re:Needed that bad? by mopower70 · 2008-04-24 05:34 · Score: 2, Informative

A configuration change that renders a start up script ineffective is a sign that your sysadmin hasn't been doing his job, and that you have no concept of change control in your environment.

We have systems that run for years at a time because we have change management tools that guarantee that those systems are in the exact state of configuration they should be in, and these tools run every night. If you're running around making undocumented configuration changes that even have a ghost of a chance of affecting server operation, anyone that gave you root access needs to have their fingers shortened.
Re:Needed that bad? by Kymermosst · 2008-04-24 05:39 · Score: 4, Insightful

How do you know a configuration change hasn't rendered one of your startup scripts ineffective?

Isn't that what QA systems and effective approaches to change management are supposed to handle?

If I am planning a change, I should discover problems with the startup scripts in QA, not in production, especially if a production reboot is not required to implement the change.

--
"Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
Re:Needed that bad? by quanticle · 2008-04-24 05:40 · Score: 1

How do you know that your test boxes are configured precisely identically to the production boxes?

--
We all know what to do, but we don't know how to get re-elected once we have done it
Re:Needed that bad? by Anonymous Coward · 2008-04-24 05:41 · Score: 0

giving it time to simmer down Simmah down now!

Actually, the technical term, or jargon if you prefer, is quiesce.
Re:Needed that bad? by Paul+Carver · 2008-04-24 05:44 · Score: 1

That depends. What if out of 100 boxes there's one of them that is just a tiny bit flaky and although it was running fine the "live" upgrade tickles it in just the wrong way and it drops all its current connections on the floor.

What if out of 100 there are a couple that don't go smoothly?

With the method we use the box gets tested after the upgrade but before being returned to service. With your method it generates a post-outage review and root cause analysis.
Re:Needed that bad? by adrianbaugh · 2008-04-24 05:53 · Score: 4, Insightful

How do you know that your test boxes are configured precisely identically to the production boxes?

dd your production box's system filesystems to another hard drive, put in an identically specced machine, boot that?

--
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
Re:Needed that bad? by mOdQuArK! · 2008-04-24 06:04 · Score: 4, Informative

There's a difference between what YOU as an end user consider to be an open connection, and what the telecom equipment consider as a connection.

For all you know, your apparent always-on connection was actually a virtual connection being frequently switched & reswitched over many different real physical connections. That would be a fairly standard architecture for having a network infrastructure which can have components being worked on while data is still flowing through the network.

When the telecom provider is "waiting for active connections to go away" on a particular device only means that all of the virtual connections that are momentarily being switched that device have been successfully switched to another device. It doesn't mean that any of those virtual connections have to be terminated.
Re:Needed that bad? by mr_mischief · 2008-04-24 06:05 · Score: 1

If there's something so subtly wrong with your system that it fails from this, what's to say it won't test fine after rebooting and then fail when placed under load?
Re:Needed that bad? by SanityInAnarchy · 2008-04-24 06:07 · Score: 1

Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint. To summarize: Who load-balances the load-balancers? Linux makes a pretty decent load-balancer, but this would make it that much better.

--
Don't thank God, thank a doctor!
Re:Needed that bad? by Splab · 2008-04-24 06:16 · Score: 1

Taking one out of rotation might (along with what others have noted) cause the load on other servers to become just high enough that you risk maximizing the load should a sudden rush occur. Or even better, some part of the system doesn't respect the DNS load balancer(or whatever you have) and tries to keep on connecting to the box currently down.

You also got situations where a client will only talk to the machine its registered with (even in a load balanced situation) - so taking the server out means calls going for that client can't be routed until the client re-registers with some other part of the cluster. Not having to reboot is a nice feature.
Re:Needed that bad? by mr_mischief · 2008-04-24 06:19 · Score: 1

Yeah, the jargon really is necessary to getting the point across. Everyone knows "quiesce", but nobody understands "quieten" or "simmer down". WTG!
Re:Needed that bad? by profplump · 2008-04-24 06:22 · Score: 1

If you don't verify your startup configs, you still can't dial 911 next time the system reboots. And if that reboot occurs unexpectedly 23 months later you'll probably have forgotten whatever change broke the configs, so 911 will be offline much longer.

You can't honestly expect to provide 100% uptime with only one system -- it's simply not possible. Even if you never made mistakes and all your hardware worked perfectly for its entire expected life, you'd still have to replace it from time to time. If you haven't already configured the system with redundant capabilities you're gonna have a hard time installing the replacement and switching over without disrupting service.

And if you do have redundant system you can simply mark one offline, wait currently processing jobs (calls, whatever) to end, and then reboot it. Other than the potential for a failure of the second device while the first is down, there's really no risk (and if you're worried about that you can simply add a third system). Clearly you haven't ever dealt with actual high-availability systems...if you're representative of the carrier world we're all screwed.

And that's not to mention the fact that dialtone service, including 911, does go out from time to time. In my LATA there have been two separate multi-hour, multi-city outages in the past 18 months.
Re:Needed that bad? by necrogram · 2008-04-24 06:40 · Score: 1

Long uptimes are a bad thing! How do you know a configuration change hasn't rendered one of your startup scripts ineffective? If you have to reboot for some unexpected reason, you could be stuck debugging unrelated problems at very inopportune moments.

You need to schedule regular reboots so that you can test that your servers can start up fine at a moment's notice. Long uptimes are a sign a sysadmin hasn't been doing his job.
You're right. While you're on the phone with hazmat explaining that you have a issue with green goo, how about i test the reboots of my PBX before you give your address?

yeah, I run mission critical systems. yes, i have proper redundancy and resiliency systems. Think I'm going to disrupt operations to test my reboots? Hell no. When it comes to public safety, 5 nines is the *only* option.
Re:Needed that bad? by rossz · 2008-04-24 07:00 · Score: 1

Long uptimes are a bad thing!

How right you are. How many systems remain unpatched because some geek doesn't want to lose his humongous uptime penis?

--
-- Will program for bandwidth
Re:Needed that bad? by Anonymous Coward · 2008-04-24 07:02 · Score: 0

I've got to agree. Long uptimes are a sign that a sysadmin knew what he was doing when he set the machine up, and has kept it working properly.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 07:03 · Score: 1, Funny

no need to reboot to add reiser4 support. All somebody needs to do is work out some complex dependency resolution tool I thought Hans Reiser had already developed a highly effective method for resolving his dependents.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 07:03 · Score: 0

Production systems are not for testing purposes.

You want to test rebooting? Do it on a test box. You forgot to add "In a perfect world run by robots."

Simple example: A sys admin may forget, ignore, or simply not know to update a configuration file after an app update in production. The problem isn't going to pop up until the server reboots.

(If you are a sys admin and take offense, replace "sys admin" with "developer" or "intern")
Re:Needed that bad? by Anonymous Coward · 2008-04-24 07:06 · Score: 0

You must be a Windows administrator.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 07:35 · Score: 2, Interesting

Like that. Kookus' point applies to the Telco environment quite well. The grandparent's post asks:

>How do you know a configuration change hasn't rendered one of your startup scripts ineffective?

And the answer is: Because every configuration change that happens in the production environment happened many times in the lab on an identical machine, and if there was any danger of this occurring then an image of the production machine would be made "hot" during the maintenance window, applied to the lab system, and the config change applied there, and the system thoroughly tested, including reboots.

For those of us with a dozen boot options on our grub screen the danger of a "change that makes the system reboot into an odd state." May seem very real, but in production Telco environments it's been factored out of the equation.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 07:39 · Score: 0

Yes, but you end up taking one machine out at a time, giving it time to simmer down, and then bringing it back up. If you have 100 boxes and it takes 30 minutes to simmer down to idle, you have 30 * 100, or 3,000 minutes to do the upgrades. On average, your boxes go unpatched for 1500 minutes.

I would think if you had that many machines, you could afford to take more than one down at a time.
Re:Needed that bad? by 9InchRails · 2008-04-24 07:54 · Score: 1

I wouldn't have agreed to this without your reasoning, but you've won me over.
Re:Needed that bad? by smallfries · 2008-04-24 08:02 · Score: 2, Informative

True, but I've been standing in switch rooms watching operators manually kill those circuits because they wanted to reboot a box. 5x 9s doesn't mean perfect service, and if anyone complained about it they were told that a ms interruption once every few months was in their SLA. By the time they reconnected they went through another box so how were they to know it was any longer than that.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:Needed that bad? by mr_mischief · 2008-04-24 08:04 · Score: 1

Probably, but how many? 10%? Your average upgrade time per box is 2.5 hours and the whole process takes 5 hours that way (assuming the same time to get the boxes idle and upgraded). It's an improvement, but it's still not as good as running the patch against all them at once without rebooting.

Sure, there are drawbacks either way, but being able to patch a running kernel is an option that may make sense. If the kernels so patched can be kept up a few weeks, then you have plenty of time to schedule other maintenance in the meantime.

Let me be clear that I'm not advocating this for every patch. Most machines don't have anything exploitable facing the world directly anyway thanks to firewalls, intrusion detection systems, closed ports on the systems themselves, good routing policies, etc. If you have machines that are being used as firewalls, routers, switches, or other infrastructure, though,you might be less comfortable with a long turnaround time for a critical security patch than with patching the kernel while it's running.
Re:Needed that bad? by rawler · 2008-04-24 08:49 · Score: 1

I'm a bit interested in this. Have a more thorough explanation on how, or a link describing such setup?

Is it possible/viable with "curious" software where the engineer thought with his ass and typed with his foot? (A serious question, I happen to admin a few such systems)
Re:Needed that bad? by Anonymous Coward · 2008-04-24 08:56 · Score: 1, Insightful

If you're running around making undocumented configuration changes that even have a ghost of a chance of affecting server operation, anyone that gave you root access needs to have their fingers shortened.

Mistakes happen. Your attitude seems to be "well don't make mistakes then", while mine is "verify that you haven't made any from time to time when you have time to fix things if you discover that you have". Your system falls down as soon as you realise that yes, you are capable of making mistakes.
Re:Needed that bad? by afabbro · 2008-04-24 09:10 · Score: 1

How do you know a configuration change hasn't rendered one of your startup scripts ineffective?

Tripwire.

--
Advice: on VPS providers
Re:Needed that bad? by afabbro · 2008-04-24 09:12 · Score: 1

You can't honestly expect to provide 100% uptime with only one system

With one typical Unix system. Tandom nonstop? Sure. Mainframe? Done all the time.

--
Advice: on VPS providers
Re:Needed that bad? by Anonymous Coward · 2008-04-24 09:32 · Score: 0

It's clear you still manage "a server or two".

Things are a little different where the big boys play.

You can't just reboot to "test whether the system boots".

QA much?
Re:Needed that bad? by insane_machine · 2008-04-24 09:34 · Score: 1

dd and identical hardware
Re:Needed that bad? by doonboggle · 2008-04-24 11:11 · Score: 1

OK, but if there's any chance that the hot-patch won't go smoothly, then you're probably going to want to do the update only against 10% of the machines at once (so that you can handle it if they all barf). Which would put you back to square 1...
Re:Needed that bad? by Anonymous Coward · 2008-04-24 11:29 · Score: 0

i have proper redundancy and resiliency systems. Think I'm going to disrupt operations to test my reboots? Hell no.

If a single machine rebooting disrupts operations, then no, you don't have proper redundancy or resiliency systems.
Re:Needed that bad? by Anonymous Coward · 2008-04-24 15:21 · Score: 0

Do you have any idea how to achieve five nines reliability? If you have a single piece that can not be dropped then you can not possibly achieve five nines. The only way to have five nines is to have enough redundancy and flexibility to drop any given component at any time.
Re:Needed that bad? by WNight · 2008-04-24 15:23 · Score: 1

Really? If you run any sort of useful facility you should be able to unplug *any* one thing and not disrupt operations. Ideally up to and including an entire facility.
Re:Needed that bad? by scott_karana · 2008-04-24 18:04 · Score: 2, Insightful

You could also virtualize over a network file system. Removes the need for 1:1 identical machines. :)
Re:Needed that bad? by nettdata · 2008-04-24 19:23 · Score: 1

Because you have an automated installation/deployment tool that does a bare-metal install of everything on a box.

(something like Sun's SPS http://www.sun.com/software/products/sunmanagementcenter/index.xml)

At least, if you're a pro you do.

We support 9 different large-scale, global environments for various development and testing activities (never mind production), and they all run the same installation/update process.

Every night, the ENTIRE system is built, tested, deployed, stress/load/performance tested, etc... all automagically.

If you don't do that, then yes, you are very likely to be prone to such errors.

--

$0.02 (CDN)
Re:Needed that bad? by tepples · 2008-04-25 00:34 · Score: 1

right the day before you want to 'take it out of rotation', a circuit is opened through it that requires five nines (so you can't drop it), and it will remain open for months Your post expresses the same concern as this post. The solution is to switch the connection to another box while it is still open, as described in the reply to that post.
Re:Needed that bad? by WNight · 2008-04-25 02:31 · Score: 1

The audit trail was messy.
Re:Needed that bad? by Anonymous Coward · 2008-04-25 02:48 · Score: 0

Sounds like a Microsoft guy. Remember the bug in Windows 98 that would crash the system after 30 days of uptime but it was not discovered until 2 or 3 years after the release because Win98 could not stay running for 30 days anyway? That was a good time.

To be fair, Win98 was not designed for long uptime. Opps, I think that is actually another dig. :/

Seriously, test reboot a production system? I think others here have said it all.
Re:Needed that bad? by mr_mischief · 2008-04-25 04:52 · Score: 1

Well, if the hot patch went smoothly on all of the first 10%, then you could probably roll it out to the other 90% in a second step.

One could also have a four to ten node testing cluster that gets regularly load tested and intrusion tested which receives any updates before the big production cluster. That way, you test it on the testing cluster first, and can roll out to as many machines as you can afford to have down as the first part of the production update. Then, the rest of the systems could get the update.

You're still looking at a much faster turnaround time doing three steps with no lag as servers go idle than ten steps with that delay. I wouldn't suggest making a habit of hot-patching a running kernel just because it's faster, though. Weigh the benefits and risks.
Re:Needed that bad? by jelle · 2008-04-26 12:02 · Score: 1

Perhaps you should start a telecom carrier company. Apparently you know easier ways to reach high reliability than the people in the industry today.

--
--- Hindsight is 20/20, but walking backwards is not the answer.
Re:Needed that bad? by ultranova · 2008-04-27 05:57 · Score: 1

Every night, the ENTIRE system is built, tested, deployed, stress/load/performance tested, etc... all automagically.

How do you know this automagic system doesn't have a bug in it which makes it skip the tests and mark them as succeeded ?-)

Of course, you could make another system to watch the first one, but then that system could have a bug in it...

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Needed that bad? by DarkOx · 2008-04-27 08:16 · Score: 1

Exactly usualy the best thing you can do without spending an unjustifiable amount of money is an active/passive system. If you are super worried about it active/passive/passive. When you do maintenance operations you apply your changes after testing in your lab to the passive node in production. Next you do what you can test that node is funcitoning. At some planed on time with the proper parties aware like 2am Sunday morning you failover to the passive node. Then you update the active node.

Its a good idea to reboot the passive node from time to time at off peak (lower risk times) middle of the night for a system primarly used durring the day. Even in the case of 911 there are probably many fewer calls at 2am Sunday morning then other times because most people will be in bed. This way you know your passive node is working ok so that in a failover it will be ready.

Yes, there is a risk the primary will fail while you are testing the passive. If that happens you are right there to take care of it. Its bad but not as bad as failover happening some other time when you have to be paged instead and find out the passive node is not in the ready to go state you thought it was; especially because Murphy's law mandes such an event will be in the middle of peak use.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:Needed that bad? by SlashDev · 2008-04-28 05:45 · Score: 1

Am I mistaken by saying that you come from a Windows OS world? Uptime (which actually means the system is serviing its purpose at 100%) is a sign that the system is: - Stable - Secure - Well administered.

--

TOP DSLR Cameras Reviews of the top DSLRs
Re:Needed that bad? by Anonymous Coward · 2008-04-29 10:42 · Score: 0

Am I mistaken by saying that you come from a Windows OS world?

Yes, you are mistaken. I haven't worked with Windows in a long time.

Uptime (which actually means the system is serviing its purpose at 100%) is a sign that the system is: - Stable - Secure - Well administered.

Why did you feel the need to explain what uptime was to me? It's readily apparent from my comment that I know perfectly well what it is. My point is that a very long uptime means that a critical code path hasn't been exercised in a very long time.

Unless it fails. by Joe+Snipe · 2008-04-24 03:05 · Score: 2, Insightful

honestly how much downtime are we talking here? 30 seconds?

--
Sometimes, life itself is sarcasm...

Re:Unless it fails. by Anonymous Coward · 2008-04-24 03:09 · Score: 4, Funny

honestly how much downtime are we talking here? 30 seconds? well, think about the fsck that happens after 180 days or 30+ mounts ?
Re:Unless it fails. by geekoid · 2008-04-24 03:31 · Score: 1

It's more then the time. Management and interruption of even a second of downtime can be costly in a large organization.
All work comes to a halt, all connections need to be reestablished, work momentum is lost, etc.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Unless it fails. by UnknowingFool · 2008-04-24 03:33 · Score: 2, Informative

For, your average computer and generic linux servers the downtime is small. But companies often have applications that they need to restart. That is the difference. Also linux is used on equipment other than generic servers: embedded systems, etc where loading isn't optimized cause the equipment should never go down.

--
Well, there's spam egg sausage and spam, that's not got much spam in it.
Re:Unless it fails. by shamer · 2008-04-24 03:36 · Score: 1

lol it takes more than 30 seconds just to scan for SCSI devices, on my server anyway.

Total boot time is in the 3 minute range, most of that is server scanning for devices / POST'ing.

of course I'm in no need of true 24/7/365 uptime, but as stated above "Oh Cool!"
Re:Unless it fails. by Tychon · 2008-04-24 03:36 · Score: 4, Informative

A company that I once had dealings with was quite proud of their five nines. The motivation? It cost them $18,000 per second they were down. 30 seconds isn't just 30 seconds sometimes.
Re:Unless it fails. by m50d · 2008-04-24 04:01 · Score: 2, Insightful

Uh, if you actually need that, then you needed it anyway. And if you don't need it but don't know how to disable it, you shouldn't be running a system.

--
I am trolling
Re:Unless it fails. by ACMENEWSLLC · 2008-04-24 04:21 · Score: 1

Damn, that's the reason I get killed off in Eve. Right as I attack, my connection drops. 2AM, 6 months of work, and 30 seconds of downtime to ruin it all.
Re:Unless it fails. by Random+Destruction · 2008-04-24 04:40 · Score: 2, Interesting

Well, care of google:
100 - (((30 seconds) / (1 year)) * 100) = 99.9999049

So if you're trying to keep up 6 9s for some super critical system, you've just used a years worth of downtime.

Even for lower numbers of nines, you still don't get many minutes per year for patching, assuming no hardware failures ever.

--
:x
Re:Unless it fails. by hacker · 2008-04-24 05:27 · Score: 1

Conversely, you should be doing the fsck at upgrade time anyway, while the box is already down.
tune2fs -C400 /dev/sdXX
Re:Unless it fails. by smallfries · 2008-04-24 08:07 · Score: 1

I know of a certain transatlantic link that would fail once a day (turned out to be a missing free that caused the heap to become exhausted). The customer screamed that every 30s reboot cost them $50,000. The bug went unfixed for nine months because it couldn't be replicated in a test environment, only on their live link and for some reason they wouldn't let us debug it there.

Once a day their CEO called ours and shouted for five minutes about the 50 grand that they'd just lost.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:Unless it fails. by Lennie · 2008-04-25 02:08 · Score: 1

Using kexec (no BIOS involved, just load a new kernel in memory and reboot a new kernel) and passing the proper options to tune2fs (-C and -c) on all (even active) filesystems (no filesystem check) does help a lot already.

--
New things are always on the horizon

Amazing by cromar · 2008-04-24 03:05 · Score: 4, Interesting

That is truly amazing tech, right there. It would be interesting to know the security implications of being able to hot-patch the kernel, however.

Re:Amazing by katz · 2008-04-24 03:35 · Score: 5, Funny

Considering that you don't need to prepare the kernel in any way--just execute the program and bang, it's patched--means that someone with root access could slip a rootkit right under your nose (i.e., without the system administrator being aware of this).

- Roey
Re:Amazing by KeithJM · 2008-04-24 03:40 · Score: 5, Insightful

someone with root access could slip a rootkit right under your nose Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.
Re:Amazing by swillden · 2008-04-24 03:44 · Score: 1

someone with root access could slip a rootkit right under your nose Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.
Barring a carefully-implemented Mandatory Access Control system, anyway.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:Amazing by FooAtWFU · 2008-04-24 03:47 · Score: 1

A small (but nonzero) implication upgrade over them "only" having root on the server. (think "new spot to deploy a rootkit"). But at that point, you're already in deep trouble, so better avoid getting to that point to begin with.

--
The World Wide Web is dying. Soon, we shall have only the Internet.
Re:Amazing by Abcd1234 · 2008-04-24 03:51 · Score: 1

As opposed to slipping a rootkit into the kernel image on-disk, and then waiting for/forcing a reboot?
Re:Amazing by katz · 2008-04-24 04:03 · Score: 3, Insightful

My bad, I meant to say,

"A remote attacker who successfully executes a privilege escalation exploit and gains root access will have an easier time taking control of your server and hiding their tracks".

Thanks for pointing that out

- Roey
Re:Amazing by CompMD · 2008-04-24 04:31 · Score: 1

The SuckIt Rootkit can already patch the kernel on the fly without root access I believe.
Re:Amazing by Anonymous Coward · 2008-04-24 06:46 · Score: 0

Considering that you don't need to prepare the kernel in any way--just execute the program and bang, it's patched--means that someone with root access could slip a rootkit right under your nose (i.e., without the system administrator being aware of this).

- Roey I'll take extra care to not give the root user enough privileges to do that ;)
Re:Amazing by Anonymous Coward · 2008-04-24 08:09 · Score: 0

Netware has been doing it since 3.1x without any troubles..
Re:Amazing by thetartanavenger · 2008-04-24 08:50 · Score: 1

SOMEONE PLEASE MOD PARENT UP!! (and mod great-grandparent interesting/insightful instead of funny)

I read what he first said and noticed it was modded funny. It isn't funny, it's true!! Just because you've interpreted it as you've given someone root access instead of someone has gained root access does not make what he said any less relevant. It was exactly what I was thinking the moment I read the title of this article...

--
Who need's speling and grammar?
Re:Amazing by xZgf6xHx2uhoAj9D · 2008-04-24 09:22 · Score: 1

If you want to slip an exploit into the kernel, couldn't you accomplish it through a kernel module? I don't patching the kernel buys you too much.
Re:Amazing by DAharon · 2008-04-24 10:17 · Score: 1

While the parent was modded Funny, consider the implications if this was implemented on a desktop linux distro.
Installing Adobe AIR? They are going to have you gksudo or something for the install.
What about MS Office 2015 Suse Edition? They are going to have you gksudo for the install.
Re:Amazing by holyspidoo · 2008-04-25 01:41 · Score: 0

just execute the program and bang, it's patched--means that someone with root access could slip a rootkit right under your nose (i.e., without the system administrator being aware of this)
Pfff, I've been able to do that on windows for years! :)
Re:Amazing by Anonymous Coward · 2008-04-25 03:05 · Score: 0

Remote syslogs are essential to the security of any system. They don't stop the attack, but you know what, when, where from, and how. As long as your syslog server does not have any remote access other than incoming syslog, you should be good. Good idea to turn on logging of shell commands as well.

Maybe... by Anonymous Coward · 2008-04-24 03:06 · Score: 0

...this will spur microsoft to atleast implement updates that don't require reboots. Hmmm, I think I may have stumbled on MS WIN 7's marketing slogan...

Re:Maybe... by CogDissident · 2008-04-24 03:11 · Score: 5, Funny

I thought their working slogan was:

Windows 7, it's not awful like Vista!
Re:Maybe... by Alsee · 2008-04-24 08:26 · Score: 1

It's awful in different and previously unimagined ways!

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.

No more reboots - FTW! by Anonymous Coward · 2008-04-24 03:06 · Score: 0

NT

Wrong way to solve the uptime problem by Anon+E.+Muss · 2008-04-24 03:07 · Score: 4, Insightful

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.

--
The key sequence to access my Slashdot bookmark in Firefox is Alt-B-S. I don't believe this is a coincidence.

Re:Wrong way to solve the uptime problem by Qzukk · 2008-04-24 03:10 · Score: 5, Funny

Trust me, that was the first thing they thought of, then the CEO came in and said "Why are you ordering more equipment when we have half of our machines sitting there and doing nothing? We could be doing twice the work/traffic/whatever without paying more money!"

--
If I have been able to see further than others, it is because I bought a pair of binoculars.
Re:Wrong way to solve the uptime problem by N1ck0 · 2008-04-24 03:30 · Score: 3, Informative

Mainly why people in the telecom industry have been clamoring for it. Its very difficult to take over the termination of a circuit switched system without some interruption for the end user. And its also not aways easy to busy out all channels on a line as calls drop off so you can free up a machine for patching.

Of course many of the reasons is a lot of commercial telecom apps are badly implemented and need better management controls.
Re:Wrong way to solve the uptime problem by trybywrench · 2008-04-24 03:31 · Score: 4, Insightful

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers. People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle. From day 1 the server needs to be making money and never ever stop. For smaller general purpose servers like you can buy at Dell.com then yeah having a fail-over makes sense.

--
I came to the datacenter drunk with a fake ID, don't you want to be just like me?
Re:Wrong way to solve the uptime problem by Anonymous Coward · 2008-04-24 03:31 · Score: 1, Insightful

yes, but if the CEO knew anything, he'd know that clustered computing is part of the job (or not) and he (maybe she?) wouldn't ask stupid questions.
Re:Wrong way to solve the uptime problem by geekoid · 2008-04-24 03:32 · Score: 1

or get a mainframe.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Wrong way to solve the uptime problem by cellmaker · 2008-04-24 03:34 · Score: 1

Your thinking about the wrong type of equipment. Don't think about typical data room servers, think VERY specialized telephony equipment. Something you don't have redundant racks for. Instead, it is usual that the rack has redundant cards for the specialized functions.

'course, in this case, you would think that we can swap to a redundant card and reload the now inactive one with a pre-patched image. But in reality, this depends on the software management on the box. Some will not allow card-by-card updates and force the entire box to reboot if the software is updated. Those boxes that require a system boot to update the software could benefit from this. But then, there can be company policies about applying "patches". My company got bitten a few too many times by patching live equipment, so patches were suspended unless you got signoff by a number of managers for extraordinary cases.

I remember one time many moons ago, I needed to patch some object on disk & restart a board. I had honed my procedure in a lab all day long. The night of the patch, I had managers & project managers watching over my shoulder and the customer on the speaker phone. So I cranked up the disk editor and went to work. CLICK CLICK CLIKETY CLICK.... You know how key strokes sound over a speaker phone, right? CLICK CLICK ... "Oh!".. Tap Tap Tap... CLICK CLICK CLICK. I've always wondered if the people on the other end of the phone took a moment to look at each other about then. :)
Re:Wrong way to solve the uptime problem by Anonymous Coward · 2008-04-24 03:35 · Score: 4, Funny

If he knew anything, he wouldn't be the CEO.
Re:Wrong way to solve the uptime problem by bjourne · 2008-04-24 03:38 · Score: 1

Correct, but then you should never fix crasher bugs either. Because it is a mistake and you will never achieve 100% uptime. Use distributed computing instead.... Your argument is flawed. What happens if you have a dual node system and one node suffers a critical software failure while the other is rebooting due to a patched kernel? Your system suffers downtime that it otherwise wouldn't have if it had hot patching.

--
Football Odds
Re:Wrong way to solve the uptime problem by MrMunkey · 2008-04-24 03:44 · Score: 1

I'd mod you up if I had points. It's hard to have fail-over systems when a cable has to be plugged in somewhere, and on top of that the channels have to be synced with the end user.
Re:Wrong way to solve the uptime problem by diamondsw · 2008-04-24 03:45 · Score: 1

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle.

Active-Active clustering or load balancing. Sure, it can be a bitch to get working with all of the data synchronization required (especially for things like databases, which are traditionally active-passive), but if you want real reliability and the efficiency of using both boxes, it's what you do.

Anything less is asking for trouble.

--
I don't know what kind of crack I was on, but I suspect it was decaf.
Re:Wrong way to solve the uptime problem by Explodicle · 2008-04-24 03:47 · Score: 1

Just how frequent are your critical software failures, and how long does it take you to patch a kernel? I agree that in theory this could happen, but the probability seems extremely low.
Re:Wrong way to solve the uptime problem by Abcd1234 · 2008-04-24 03:53 · Score: 1

And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40?
Re:Wrong way to solve the uptime problem by Rich0 · 2008-04-24 04:06 · Score: 1

Why - that's no excuse for not clustering!

Just tell each phone customer to have two sets of phones at home, so that when one line is down they can just use the other. Be sure to charge them for both.

Hmm - that actually is starting to sound like the sort of business model the wired phone company around my area might actually propose...
Re:Wrong way to solve the uptime problem by poot_rootbeer · 2008-04-24 04:09 · Score: 1

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.

If you own a piece of Big Iron and run Linux on it, it's going to be virtualized. Hundreds of virtual Linux boxes that can arbitrarily failed over, patched, and rebooted, the physical hardware carrying on uninterrupted all the while.
Re:Wrong way to solve the uptime problem by Anon+E.+Muss · 2008-04-24 04:14 · Score: 2, Insightful

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.

I doubt there are many people running Linux on true Big Iron. I'm not saying it doesn't happen, I'm saying that most Big Iron runs something else. I know many financial institutions and telecom operators use HP NonStop systems. These can stay up 24/7/365/25years, and you pay millions of dollars for that. They have full redundant hardware inside the box, run a proprietary OS, and proprietary applications.

--
The key sequence to access my Slashdot bookmark in Firefox is Alt-B-S. I don't believe this is a coincidence.
Re:Wrong way to solve the uptime problem by Anonymous Coward · 2008-04-24 04:21 · Score: 2, Insightful

Now is not the time to claim banks know what they are doing.
Re:Wrong way to solve the uptime problem by Anonymous Coward · 2008-04-24 04:22 · Score: 1, Informative

Big Banks (tm) - like the one I currently work in - can afford to and do have even the largest systems installed in fully redundant configurations. It's part of standard BCM (business continuity management) practice - we need to, and can survive an entire datacenter dropping of the network, for whatever reason up to and including getting bombed off the face of the earth. In normal day to day practice these machines can and are used for load-balancing, to allow primary boxes to get taken down for maintenance.

And as a sysadmin in a bank, the solution described in the story isn't that appealing. It strikes me as something inherently less reliable than doing a cold boot with a new kernel. Scheduled downtime is OK, unscheduled problems because someone wanted to do an upgrade on the fly are *bad*.
Re:Wrong way to solve the uptime problem by Ed+Avis · 2008-04-24 04:22 · Score: 2, Insightful

Who cares about servers? I want my Linux desktop to stay up-to-date with security fixes without having to reboot it every few days.

--
-- Ed Avis ed@membled.com
Re:Wrong way to solve the uptime problem by guruevi · 2008-04-24 04:25 · Score: 1

And how do bigiron servers do it? Trust me, I've worked with bigiron and there are several solutions:

Some type of virtualization, partitioning or jails, and you can emulate a cluster of machines with minimal performance impact. The 'host' doesn't necessarily need to be upgraded frequently since it's very minimal in function (load a kernel into a processor).

You have your monthly/yearly maintenance that takes everything offline at 3 am and upgrades it if necessary. It's not unusual to see those things 3-5 major versions behind though depending on the work. Just like in Linux, a lot of it is modularized so much, that you don't have to take the whole thing offline to upgrade parts of it. If you have a somewhat decent vendor, they'll backport recent patches to kernel modules to your version and you can update.

100% is not possible with a single machine, even if you want it, there is no way that you will foresee any updates, patches or just plain people doing something stupid or stuff breaking. Any modern single server (mainframe) costing more than a mere $100,000 is most likely a machine consisting out of several machines already.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:Wrong way to solve the uptime problem by diamondsw · 2008-04-24 04:34 · Score: 1

And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40?

Having seen bank systems (and credit card companies, and pharma, etc), yes, I can damn well confidently say I do. The handle money for a living. I design networks and datacenters for a living.

You do NOT want to know the things I've seen - you'll never use a credit card or fill a prescription again. I could write TheDailyWTF for a month based on one specific credit card provider alone.

--
I don't know what kind of crack I was on, but I suspect it was decaf.
Re:Wrong way to solve the uptime problem by tzanger · 2008-04-24 04:37 · Score: 1

Every few days? Which distro are you running that a) has security fixes every few days, and b) requires you to reboot after them?
Re:Wrong way to solve the uptime problem by ToasterMonkey · 2008-04-24 04:38 · Score: 1

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.
If you own a piece of Big Iron and run Linux on it, it's going to be virtualized.
And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40? First, how are you trying to say big banks have been running for 30, or 40 years? The last two posters were specifically talking about Linux, which obviously hasn't been used in big banks or anywhere for 30 years.
Second, IF a big bank is running Linux on this "BigIron", you can almost bet your ass it's an IBM mainframe we're talking about. That being the case, it would be running 'virtualized', PERIOD. They most likely even have multiple physical mainframes to fail over to, regardless of what the first poster thinks about having expensive idle hardware sitting around. Sure they're expensive, this is why - big financial institutions & the govt have TONS of money.

I'm sorry, all this talk of a "big bank" running Linux of all things all by itself 24/7/365 on some big piece of hardware that's soooo expensive you couldn't buy a second one is just too hysterical.
Re:Wrong way to solve the uptime problem by Abcd1234 · 2008-04-24 04:49 · Score: 1

So, what you're saying is kernel hot patching would be exactly the kind of they they'd be interested in (well, assuming banks ran Linux)? :)
Re:Wrong way to solve the uptime problem by mr_mischief · 2008-04-24 05:30 · Score: 3, Informative

Can we please kill the 24/7/265 phrasing? Where do you people live that there are 365 weeks in a year?

Why not 24/7/52 or 24/7/4.3/12 or just 24/365 (or 24/365.242 for the pedants).
Re:Wrong way to solve the uptime problem by sconeu · 2008-04-24 06:03 · Score: 1

Mod parent up. Financial types *LOVE* the NonStops.

--
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Re:Wrong way to solve the uptime problem by trybywrench · 2008-04-24 07:02 · Score: 1

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.
If you own a piece of Big Iron and run Linux on it, it's going to be virtualized.
And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40? First, how are you trying to say big banks have been running for 30, or 40 years? i said "bank on" as in "count on" or "bet on" i didn't mean "bank' as in Bank Of America. sheesh

--
I came to the datacenter drunk with a fake ID, don't you want to be just like me?
Re:Wrong way to solve the uptime problem by Perf · 2008-04-24 07:09 · Score: 1

Seems programmer and computer operator nerds like to operate on the assumption that the hardware is always correct and will never fail. I guess it makes their life easier.

A few maxims:

Hardware does fail. (Although modern hardware can be very reliable, it can and does fail.)
Hardware requires maintenance.
Software require maintenance.
Next year, the same pointy haired boss who wanted system A installed because it looks cool will want system B installed because it is even cooler.
Engineers have figured out how to repair freeways with minimum stoppages. Building a dam on a big river requires even more foresight.

BTW, the first Certified Netware Engineer I ever met set up a server with disk mirroring. Later, I had the server down for maintenance. I had a look at the disks and found the backup HDD wouldn't boot. Seems he set it up to mirror partitions, but forgot to set up the boot sector.
Re:Wrong way to solve the uptime problem by N1ck0 · 2008-04-24 07:23 · Score: 1

Rich0 you can't cluster timings very easily. You could cluster based on individual calls, but take the instance where your circuit (call) is open for hours at a time...thats a long time to wait for it to fail over.

Digital phone transmissions (and I use that in a light sense cause ISDN, and SS7 are not as digital as newer stuff) are stateful and highly depend on a channel timing signals. If it was true analog back to the carrier you could do such a thing pretty seamlessly, but its not (its digital+encapsulated analog)

Essentially what you are talking about in the circuit switched world is running two sets of equipment in tandem. And unfortunately milliseconds do count here so the equipment becomes very specialized and very specific.

On one of my most recent pilot works for a carrier which I probably shouldn't name (if they could hear me now). Used channel-banking/re-muxing equipment to insulate the circuits from individual truck lines, then we translated from TDM cards to VoIP to our application servers. Which provided enough of a buffer to logically switch calls over in real-time (of course pulling call data from one location to another was kind of tough cause we needed to record it all) but hey thats why it wasn't cheap to do.

In a lot of mid to low size telecom deployments you are usually going direct from the carrier to the a telecom card in a server (like a dialogic digital T1 card). And unfortunately because of bad design either at the hardware/driver/api or at some software levels you usually loose the ability to control or sense the timing slips on individual channels, leaving the one of items in the stack (software/driver/card) leaving the line open for a little to long, not allowing the carrier to sense a busy out and fail the circuit to a different one. Not to mention that most telecom standards for long haul trunking do not have method for retransmitting the metadata (DNIS/CLI, clocking, etc) to a failover circuit (meaning you need to reinitialize the call...or hangup).

But yes if you have some of the newer stuff and some of the more expensive equipment you can do it.
Re:Wrong way to solve the uptime problem by xenocide2 · 2008-04-24 08:52 · Score: 1

On the other hand, several people are paid to make Linux efficient for very large scale computers. It doesn't take many customers of million dollar computers to justify one kernel hacker's salary. If you have a thousand core system running linux, being able to upgrade the kernel without a reboot means a lot of saved CPU time.

--
I Browse at +4 Flamebait
Open Source Sysadmin
Re:Wrong way to solve the uptime problem by afabbro · 2008-04-24 09:21 · Score: 2, Insightful

I doubt there are many people running Linux on true Big Iron.

And you would be wrong. Sure, most mainframes are running z/OS, but a goodly number of them are also running Linux images. I don't know the percentages but the IBM "run Linux on your mainframe" training classes are usually full.

--
Advice: on VPS providers
Re:Wrong way to solve the uptime problem by insane_machine · 2008-04-24 09:49 · Score: 1

24 hours a day / 7 days a week / 365 days a year

I really doubt people believe there are 365 weeks in a year. People won't change because they are used to it. I'm sure I even heard on TV at one point, not on a tech channel.
Re:Wrong way to solve the uptime problem by Ed+Avis · 2008-04-24 10:02 · Score: 1

I run Fedora. Recently they seem to put out a new kernel package every week or so. Of course you need to reboot to install a kernel security fix, which is what TFA is all about.

--
-- Ed Avis ed@membled.com
Re:Wrong way to solve the uptime problem by jd · 2008-04-24 12:33 · Score: 1

I suggest downloading one of the solar system simulators and figuring that out. We may yet be able to obtain proof that TV commercial writers are, in fact, from another world.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Wrong way to solve the uptime problem by tzanger · 2008-04-24 13:20 · Score: 1

I don't know of any kernel "security" fixes that have happened in the last 6 weeks, let alone 6 days... wow they're really screwin' you. :-(
Re:Wrong way to solve the uptime problem by DemingBuiltMyHotRod · 2008-04-24 14:19 · Score: 1

Can we please kill the 24/7/265 phrasing? Where do you people live that there are 365 weeks in a year?
Not to mention leap years! 24/7/365 implies 1 day of downtime every 4 years.
Re:Wrong way to solve the uptime problem by Rich0 · 2008-04-24 14:36 · Score: 1

Sorry - I was actually joking there.

While I'm certainly not a telecom expert, I do realize that solutions like clustering aren't always practical for everything. If you're an upstream provider you can't exactly dictate the equipment that all your customers use, and yet they'd probably appreciate it if you didn't drop their connection once a month for patches.

Your post only served to remind me that while I find such things interesting there is a reason I don't go looking for jobs working with serious telecom equipment. Sometimes I wonder at the rather arcane designs that are used, but I guess that is mostly due to history and when I look at the uptimes you see in the telecom world I have to wonder if they aren't doing something right. For me an ambitious project would be setting up VoIP at home using Asterisk with either a voice modem or Skype connection to the outside world... :)
Re:Wrong way to solve the uptime problem by erice · 2008-04-25 11:37 · Score: 1

Not to mention leap years! 24/7/365 implies 1 day of downtime every 4 years.

Beats the 24/7/52 suggestion which implies 1 day of downtime every year. (52*7= 364 days)

Unnecessary by isj · 2008-04-24 03:08 · Score: 1

> "this is pure gold"

It is also a waste of time. Instead of spending time hot-patching a kernel, jotting down which patch it was, verify that it actually installed, and considering you cannot change the layout of structures anyway in a hot-patch, the time would be better spent designing protocols that can handle a hot-standby switchover.

Yes, there are a few scenarios where the hardware is so expensive that you cannot afford redundancy, but that is rare.

Re:Unnecessary by nexex · 2008-04-24 05:30 · Score: 1

But, its gold, Jerry! Gold!

--
Winter 2010: With Glowing Hearts

Already been used by caluml · 2008-04-24 03:10 · Score: 4, Informative

There was a kernel exploit recently where someone submitted a patch that modified the running kernel using this technology. It didn't work for me, so I had to resort to patching the .c that was affected - but a lot of people reported that it worked.

--
Get your own free personal location tracker

Re:Already been used by ThisNukes4u · 2008-04-24 04:05 · Score: 2, Informative

IIRC, that code was actually a modified version of the exploit where the payload was changed to fix the exploit instead of spawn a root shell. Pretty fucking ingenious if you ask me.

--
thisnukes4u.net
Re:Already been used by mr_mischief · 2008-04-24 05:36 · Score: 1

It depends on who your boxes are serving. IF you're running boxes for in internal IT department, that's fine. If you drop a port with a 99.9999% uptime guarantee to your client who's using you as his phone or Internet provider, you've either causing him to be down or you're forcing him to switch all of his load over to his other provider.

at root it's just trampolining by norbac · 2008-04-24 03:14 · Score: 1

The way it identifies what to patch is cool, but the 'hot' part of the patch is ultimately just simple trampolining -- replacing the start of the patched function in the code segment with a jmp to your new code. I did similar work in the linux kernel for a masters project.

Re:at root it's just trampolining by tinkerghost · 2008-04-24 03:43 · Score: 1

Hmm, when I was doing 8bit assembly, we called it a wedge .... crazy kids ... and get off my lawn
Re:at root it's just trampolining by davidnicol · 2008-04-28 06:38 · Score: 1

quick, protest the patent!

replace modules by hey · 2008-04-24 03:16 · Score: 2, Interesting

Rather that a source code level system I'd prefer a way of replacing loadable kernel modules without a reboot. Then push more code into modules -- eg file system. (Hey sounds like a micro-kernel).

Re:replace modules by Anonymous Coward · 2008-04-24 03:20 · Score: 1, Insightful

Theory of operation:
1. Build new_module
2. rmmod old_module
3. modprobe new_module

Gee, that was hard :-)
Re:replace modules by Uncle+Focker · 2008-04-24 03:35 · Score: 1

Don't worry, in 50 years you'll be able to do it in Hurd. That is if it ever gets out of alpha state by then.
Re:replace modules by petermgreen · 2008-04-24 03:59 · Score: 1

you can already replace loadable modules without a reboot as long as they aren't doing anything critical to your kernels operation.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:replace modules by Anonymous Coward · 2008-04-24 04:29 · Score: 0

now try that with your disk driver, or network driver, or that specialized hardware interface to that fiber transceiver, or anything else that you can't re-initialize, without downtime, eh?
Re:replace modules by mr_mischief · 2008-04-24 05:41 · Score: 1

You can do that with many filesystem drivers. It's a bit harder to do it for the filesystem driver for your boot drive, but that's what initrd images are for.

A microkernel can theoretically do more of this sort of thing, but Linux does a fair amount already. Type "make menuconfig" and poke around a bit, and your eyes might be opened past your dogma and chanting.
Re:replace modules by mr_mischief · 2008-04-24 05:43 · Score: 1

You mean how my ATA, Serial ATA, Ethernet, and filesystem modules work now? When's the last time you configured a Linux kernel? Sure, there are some things you can't do. Try Minix with a faulty video driver sometime, though, if you think a microkernel is a cure-all.
Re:replace modules by Anders · 2008-04-24 05:57 · Score: 1

Rather that a source code level system I'd prefer a way of replacing loadable kernel modules without a reboot. Then push more code into modules -- eg file system. (Hey sounds like a micro-kernel).
It does not sound like a microkernel at all, it sounds like dynamic linking. A microkernel would have the modules in separate address spaces.
Besides, file systems are modules already.

Does this mean... by Thelasko · 2008-04-24 03:17 · Score: 1

I can now install hypervisors without rebooting the victim's... I mean... client's computer?

[strokes handlebar mustache deviously]

--
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".

The real test... by hal2814 · 2008-04-24 03:17 · Score: 4, Funny

Can ksplice be installed without rebooting?

Re:The real test... by LinuxDon · 2008-04-24 03:39 · Score: 2, Informative

It's in the comment: "ksplice requires no kernel modifications"

So yes, ksplice can be installed/used without rebooting.

Impressive hack by EriktheGreen · 2008-04-24 03:18 · Score: 4, Informative

For those that haven't read the paper, the technique used is straightforward in concept, but the devil is in the details.

He basically compiles a patched and unpatched kernel with the same compiler, compares the ELF output, and uses that to generate a binary file that corresponds to the change. That gets wrapped in a generic module for use, another module installs it along with JMPs to bypass the old code and use the new, and he performs the checks needed to make sure he can safely install the redirects.

He also has to differentiate real changes from incidental ones (the example given is changing the address of a function - all references to it will change, but they don't really need to be included in the binary diff).

The only human work required is to check whether a patch makes semantic changes to a data structure... whether eg. an unsigned integer variable that was being used as a number is now a packed set of flags - the data declaration is the same, but it's being used differently.

Interesting paper. Also a useful new set of capabilities for any Linux user who can't handle downtime for quarterly patching... worth its weight in gold in some businesses.

Erik

Re:Impressive hack by Vectronic · 2008-04-24 03:24 · Score: 1

I was just about to do the calculation to see how much it would be worth, but I forgot how much a bit weighs...
Re:Impressive hack by EriktheGreen · 2008-04-24 03:44 · Score: 3, Funny

Well, let's see.
A silver dollar, from which bits were commonly cut, weighs about .77 troy ounces.
Today's gold price as of posting is about $889.95 US per troy ounce.
A silver dollar was typically cut into 8 bits, which gives us a weight per bit of 0.096 ounces. That translates to about $85.66 per bit weight in gold. Remember, this is per system being patched.
Since the patches being applied ranged from 1 line to 285 lines per the paper, and a reasonable estimate of compiled average bytes per line is something like 20, we get a value of $13,700 per line of patch in gold. Even for the smaller patches, this is significant. The largest patch would be worth nearly $4,000,000 USD in gold.
Of course, for 64 bit systems vs. 32 bit, the value would be twice as much :)
Erik
Re:Impressive hack by EriktheGreen · 2008-04-24 03:45 · Score: 0, Redundant

Move that "Remember, this is per system..." down a paragraph. Slashdot needs a post edit function.
Re:Impressive hack by Vectronic · 2008-04-24 04:38 · Score: 1

Indeed it does, especially for a post thats so pertinent.

But thats why they added the (Preview) button before the (Post)...

Although i find nothing wrong with your math... considering that there are more bits traveling faster in a 64bit system wouldnt that depreciate the value of the bits, and aslong as each bit is being converted to money, likewise depreciate the value of the gold, aswell as saturate the market?

I guess 128bit processors will have to come soon to cope with the inflation...
Re:Impressive hack by EriktheGreen · 2008-04-24 04:40 · Score: 1

I knew there was a good reason to like inflation! As well as SSE instructions :)
Re:Impressive hack by Anonymous Coward · 2008-04-24 07:06 · Score: 0

I was doing this 20 years ago on embedded systems when it took a day to tear down the boxes and reload the EEPROMS (pre FlashROM days).

I believe that all procedure calls went through RAM first, that way you could add or change code and test it without re-burning the whole system.

One of the more harrowing moments of my career was doing this during a customer demo where it had to be done after first executing the EEPROM version and then executing the patched in RAM version. Later on, I saw this same type of being used in the semiconductor industry where the chip was manufactured with ~60k FPGA type gates which could be used to overcome some initial design flaws within the chip. An instance that I remember is the countdown register didn't work, so we just made another as a workaround and was fixed in later units, but allowed significant testing that would not have been otherwise possilbe.
Re:Impressive hack by Anonymous Coward · 2008-04-24 08:55 · Score: 0

hmm this is quite ugly.

And what they to if the patch need change to data format ? Reboot :)

And if the human miss it ? Corrupt data :(

If it's that critical, shouldn't you have two? by Paul+Carver · 2008-04-24 03:20 · Score: 4, Insightful

I'd rather have at least two of anything important and have statefull failover between them.

If you've got this system that's so critical you can't reboot it for a kernel upgrade, what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter?

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.

If you can't transfer the workload to a location at least a couple hundred miles away without users noticing then you're not in the big league.

And as long as the workload is in another datacenter, what's the big deal about rebooting for a kernel upgrade.

Re:If it's that critical, shouldn't you have two? by Akatosh · 2008-04-24 03:42 · Score: 1

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything. You must work for one of those telephone companies with infinite time, money and no legacy equipment. Must be nice.
Re:If it's that critical, shouldn't you have two? by mpapet · 2008-04-24 04:01 · Score: 1

There are applications where this is simply not possible and I happen to admin some applications like that. This is what active passive clustering is all about. Even then minor updates of any kind are a long, carefully practiced, high-anxiety events.

Another informative post mentioned telphony as the perfect application, I copied it as an FYI

"Its very difficult to take over the termination of a circuit switched system without some interruption for the end user. And its also not aways easy to busy out all channels on a line as calls drop off so you can free up a machine for patching."

--
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
Re:If it's that critical, shouldn't you have two? by EriktheGreen · 2008-04-24 04:02 · Score: 2

In some engineered systems, it just isn't possible to have redundancy in the way you mean.
Extreme example: Try to design a fail-over for the space shuttle's solid rocket boosters :)
Interestingly, I've found that the skill needed (and the pay gathered) to deal with systems that can't be made redundant is much higher than that needed to work on "grid" or cluster systems where multiple cheap pieces of hardware are used.
And they tend to be more reliable too.
Re:If it's that critical, shouldn't you have two? by noidentity · 2008-04-24 04:15 · Score: 1

I think one point made several times is that you will have multiple servers where taking one down wouldn't interrupt services, just that the cost of taking one down is so great that you'd rather replace the kernel live. You can't solve that by adding even more super-expensive servers either.
Re:If it's that critical, shouldn't you have two? by mr_mischief · 2008-04-24 05:47 · Score: 1

"what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter"

Well, that's when you use your 0.0001% of downtime, because you didn't use it for rebooting when you didn't need to. ;-)
Re:If it's that critical, shouldn't you have two? by mcmonkey · 2008-04-24 06:50 · Score: 1

I'd rather have at least two of anything important and have statefull failover between them.
I'll try that logic on my wife. "Honey, I'm going out Saturday night. I need to test the failover to my girlfriend."
Re:If it's that critical, shouldn't you have two? by mrv20 · 2008-04-29 08:10 · Score: 1

Although as the failover is stateful you still won't get any until you take out the trash.

--
"Algebraical symbols are used when you don't know what you are talking about" - BCS
Re:If it's that critical, shouldn't you have two? by mrv20 · 2008-04-29 08:13 · Score: 1

Even with such resources, I am interested to know how the GP proposes one should shift one's local loop termination to a remote location.

--
"Algebraical symbols are used when you don't know what you are talking about" - BCS

Year of the linux desktop ... again? by Anonymous Coward · 2008-04-24 03:21 · Score: 0

Maybe this new tech will spur the year of the linux desktop computer! ...

Over-engineered solution to a non-existent problem by hacker · 2008-04-24 03:23 · Score: 3, Insightful

Once again, we have an over-engineered solution to a non-existent problem.

Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production. This includes testing kernels, hardware, networks, interaction, application, data and so on. One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period.

So you hot-patch a running Linux kernel. How do you Q&A that? How do you roll back if the patch fails? Where is your 'control'?

The answer? A duplicate machine. But wait, if you have two identical machines... isn't that... a cluster?

Exactly. And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades. You NEVER EVER touch a running, production system like that.

Well, not if you want any sort of data integrity or control and want to pass any level of quality validation on that physical environment.

And Microsoft claims to have invented it by davecb · 2008-04-24 03:29 · Score: 3, Informative

Tomasz Chmielewski wrote on LKML: the idea seem to be patented by Microsoft, i.e. this patent from December 2002: http://www.google.com/patents?id=cVyWAAAAEBAJ&dq=hotpatching In essence, they patented kexec ;)

Andi Kleen promptly provided prior art: The basic patching idea is old and has been used many times, long predating kexec. e.g. it's a common way to implement incremental linkers too.

--
davecb@spamcop.net

Imagine by MortenMW · 2008-04-24 03:31 · Score: 0

Imagine a Beowulf cluster of hot-patching Linux-servers

Not only the CEO by Moraelin · 2008-04-24 03:34 · Score: 4, Interesting

Not only the CEO. I lived to see even a hardline IT guy (admittedly, one whose goal in life seems to be to be against whatever you want, and to avoid doing any extra work... actually, make that just: any work) argue along the lines of "nooo, you can't have the servers only 60% loaded! It's a waste of valuable hardware! Why, back in my day (of batch jobs on punched cards, presumably) we had the mainframe used at least an average of 95% before asking for an extra server!"

It always irks me to see people just not understand concepts like "peak" vs "average", or "failing over".

- A cluster of, say, 4 machines (small application, really) which are loaded to 90% of capacity, if one dies, the other 3 are now at 120% of capacity each. If you're lucky, it just crawls, if you're unlucky, Java clutches its chest and keels over with an "OutOfMemoryError" or such.

- if you're at 90% most of the time, then fear Monday 9:00 AM, when every single business partner on that B2B application comes to work and opens his browser. Or fear the massive year-end batch jobs, when that machine/cluster sized barely enough to be ready with the normal midnight jobs by 9 AM, so those users can see their new offers and orders in their browsers, now has to do 20 times as much in a burst.

Basically it amazes me how many people just don't seem to get that simple rule of thumb of clusters: you're either getting nearly 100% uptime and nearly guaranteed response times, _or_ you're getting that extra hardware fully used to support a bigger load. Not both. Or not until that cluster is so large that 1-2 servers failing add negligible load to the remaining machines.

--
A polar bear is a cartesian bear after a coordinate transform.

Re:Not only the CEO by DraconPern · 2008-04-24 05:35 · Score: 1, Funny

You are running Java. If you port your application to another language, you will automatically get a 10% boost in effeciency. Oh, and save money to buy 5 machines instead of 4.
Re:Not only the CEO by Anonymous Coward · 2008-04-24 06:03 · Score: 0

You are a moron. If you go jump off a cliff, the world will automatically get a boost in intelligence.
Re:Not only the CEO by Cal+Paterson · 2008-04-24 06:12 · Score: 1

Clearly the money saved by only having to purchase 4 machines instead of 5 is not worth the time spent on the rewrite.
Re:Not only the CEO by Jesus_666 · 2008-04-24 07:10 · Score: 1

Thank you! My business just ditched its old J2ME applications and replaced them with Visual Basic applications! Everything is so much better now! The sun is shining and little bunnies are hopping through the grass underneath a radiant rainbow! And all of that in the server room, which now also features a large nude beach for beautiful female exhibitionists! Also, everyone in the company now makes six digits - a month! Even the janitor!

Well, gotta go; our load balancer (now also written in VB) just wrote me an IM to inform me that it has finished replicating a cask of 50 year old Scotch and it's time for the tasting.

--
USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
Re:Not only the CEO by Lao-Tzu · 2008-04-24 08:53 · Score: 1

... got any job openings?

kexec by kondor6c · 2008-04-24 03:37 · Score: 1

Wasn't this possible before with kexec?

Re:kexec by Enderandrew · 2008-04-24 03:57 · Score: 1

Kexec allows you to boot another kernel from your kernel without a reboot. I think ksplice allows you to just put in a patch to your existing kernel, however, I almost have to assume they use a kexec-like implementation.

--
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.

Re:Over-engineered solution to a non-existent prob by ROBOKATZ · 2008-04-24 03:38 · Score: 1

Once again, we have an over-engineered solution to a non-existent problem.

Welcome to academia. I think it's an interesting start, and maybe someday we'll have solved the additional problems you've listed. And let's face it, rebooting for updates is annoying, mission critical or not.

No, No, No and No again. by Anonymous Coward · 2008-04-24 03:40 · Score: 5, Interesting

As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.

1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.

2. High availability translates more roughly into nodes that can fail (ex. power off) and not take the cluster down. This boils down to active-passive application architecture more than just using heartbeat.

As an FYI, PostgreSQL clustering is a killer application for me. Erlang is also great in many ways, but requires application architecture with active-passive node awareness. Which isn't present in things like Yaws, or even my other favorite non-erlang app nginx. Heartbeat is the solution there, but I'd like to see yaws be cluster aware on its own. http://yaws.hyber.org/

Re:No, No, No and No again. by 0racle · 2008-04-24 04:11 · Score: 1

If I may ask, what PostgreSQL clustering solution do you use?

--
"I use a Mac because I'm just better than you are."
Re:No, No, No and No again. by hab136 · 2008-04-24 05:17 · Score: 4, Insightful

As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.

1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.

I spent a decade in perimeter networking at a Fortune 50 US bank. My group didn't do the internal network, just the perimiter, and we still had dozens of network sites and thousands of pieces of equipment. The bank itself has hundreds of thousands of employees, millions of users. Online banking and brokerage are about as high availability as you can get save utilities (power, water, telephony, etc) or military. Seconds of online brokerage downtime equated to millions of dollars lost.

The idea that load balancing introduces inexplicable down time is completely unsupported by my experience.

"One-node clusters" seems like marketing speak for "single point of failure". A cluster by definition is two or more nodes.

Redundant routers, switches, firewalls, the works or you're not high-availability in my opinion. The fact that you're talking about Postgresql instead of Oracle or DB2 on mainframes makes me think that your idea of high availability is different than mine.
Re:No, No, No and No again. by Anonymous Coward · 2008-04-24 05:51 · Score: 2, Funny

one-node clusters

Of all the industry oxymorons out there, this is one of the most annoying.

One-goose gaggle.

Single-wheel bicycle.
Re:No, No, No and No again. by Splab · 2008-04-24 06:18 · Score: 1

He is most likely running some sort of warm failover and thinking he is safe (or the data isn't time critical enough to care that its only warm data).

Re:Over-engineered solution to a non-existent prob by kortex · 2008-04-24 03:42 · Score: 1

Thank you. I was getting depressed at what I was reading. Hot patching production kernels = amateur. Never take a *needless* risk. Ever. Hot Patching a running non-production kernel "because-you-can", well then that's a pretty neat thing, high on the geek scale. But don't even come near my prod cluster neophyte or I'll have your limbs removed.

--
-- kortex "Not everything that counts can be counted, and not everything that can be counted counts"

Sorry... by PJ+The+Womble · 2008-04-24 03:43 · Score: 2, Funny

This is old news down in the South.

They don't bother splicing. Them good ol' boys been big on Kernel Sanders for years now.

Re:Sorry... by Anonymous Coward · 2008-04-24 05:43 · Score: 0

I read this post and made disgusted sounds for 15 seconds.
Good show.
Re:Sorry... by mrv20 · 2008-04-29 08:19 · Score: 1

made disgusted sounds for 15 seconds Yep, that sounds like the Colonel's secret recipe alright.

--
"Algebraical symbols are used when you don't know what you are talking about" - BCS

It's Not For 100% Uptime by Bob9113 · 2008-04-24 03:44 · Score: 2, Insightful

Lots of people are saying, "100% uptime of a particular machine is neither necessary nor desirable, full failover is better. Full failover is the only way to handle catastrophic hardware failures." Or something to that extent.

But this isn't about 100% uptime. It's about not having to reboot for a kernel upgrade. You should still have hot failover if you want HA, this just removes one more thing that requires a reboot.

It's like people saying, "I don't mind rebooting after installing Office, I don't expect 100% uptime from my workstation." Of course you don't need to be able to do software installs without rebooting. But isn't it nice to have that option available?

Same with this. When (and if) it gets stabilized and standardized, you'll use it. Not for 100% uptime, just because it's nice to not be required to reboot to enable a particular software install.

--
Stop-Prism.org: Opt Out of Surveillance

Re:It's Not For 100% Uptime by Enderandrew · 2008-04-24 03:59 · Score: 1

If you have a critical system that needs to be up, you better have backup servers.

You fail-over to the backup, patch the first, fail-back, patch the second, etc.

--
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
Re:It's Not For 100% Uptime by mr_mischief · 2008-04-24 05:58 · Score: 1

That still takes longer than just patching in place. Meanwhile, your systems have a critical security bug.
Re:It's Not For 100% Uptime by felipekk · 2008-04-24 07:50 · Score: 1

Meh, way to miss what he just said:

"But this isn't about 100% uptime. It's about not having to reboot for a kernel upgrade. You should still have hot failover if you want HA, this just removes one more thing that requires a reboot."

AIX by Anonymous Coward · 2008-04-24 03:45 · Score: 0

Hasn't this kind of feature been available on AIX for quite some time? I'm told you have to "unroll" patch installations if you need to insert one somewhere into the existing patch-chain, which sucks, but you can do it all on a live system.

Re:Over-engineered solution to a non-existent prob by Anonymous Coward · 2008-04-24 03:46 · Score: 0

Q&A doesn't proof the absence of bugs. Also, the less you spend the more your shareholders will thank you (or ravage you).

pure gold for rootkit writers by Anonymous Coward · 2008-04-24 03:52 · Score: 0

If you are a carrier in telephony and don't want downtime, this stuff is pure gold.

If you're a darkhat writing rootkits, this is priceless :)

Re:Over-engineered solution to a non-existent prob by Anonymous Coward · 2008-04-24 04:05 · Score: 0

Your process of testing servers involves asking them questions and getting answers?

Imagine a Beowulf cluster of... by Anonymous Coward · 2008-04-24 04:06 · Score: 0

Systems continuously cycling the kernel version: downgrading to 0.99, upgrading back to git head and then back to 0.99...

kdawson by obsolete1349 · 2008-04-24 04:06 · Score: 0, Troll

Oh goody another KDawson post. Isn't there a way to filter them out?

You are Wrong by mpapet · 2008-04-24 04:10 · Score: 3, Insightful

And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades

Hmmm. I happen to live by your words in an environment where this is theoretically possible, but practically impossible. Why? Because when the cluster rolls to a passive node, the application times out on the existing connections. The time outs have business ($$$$) implications. I wish it were okay to have infinite retries, but it's viewed as a violation of the service agreement. Telephony is like this too.

An academic ideal for sure, but please speak more humbly because it is no silver bullet.

--
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html

Re:You are Wrong by hacker · 2008-04-24 04:26 · Score: 1

Frankly, if you roll to another node and you lose connections, then your cluster is misconfigured.
I've built and deployed clusters where I'm actively playing a streaming video across the cluster from a mounted drive, physically yank the power cable from the active node of the cluster, there's about a 1-2 second lag in the video, and then it continues to play right where it was, without any disconnects or interruptions.
In fact, I use this as a way to demonstrate that there is ZERO loss of connectivity when nodes are downed or recycled.
You might want to look into how your cluster is (mis)configured and fix it.
Re:You are Wrong by mpapet · 2008-04-24 05:26 · Score: 1

1-2 second lag

Oh if only I were streaming video.

Ignoring the fact that there are applications where a session cannot be interrupted does not validate your opinion. Telephony applications are also this way.

--
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
Re:You are Wrong by hacker · 2008-04-24 05:41 · Score: 1

I'm sure if my client's requirements were to have no lag, the cluster would allow for that. I've never personally had the need to test it in that fashion, and since I wasn't working in a telephony environment, it wouldn't matter.
Besides, the telephony space has mirrored and triplet'd hardware anyway, from trunking switches to fibre backplanes. All of this is possible and done already in the production telephony space. They already have processes in place to do rolling upgrades of hardware. Patching a live kernel for telephony customers is completely irrelevant.
Re:You are Wrong by Anonymous Coward · 2008-04-24 11:49 · Score: 0

This sounds very interesting. What plaform is this? Java? CGI/FastCGI ? or is it one of the vendor specific streaming servers (Ms, Real, Quicktime ?) . How does it keep track of the connections?

Gates Was Right by BigBlueOx · 2008-04-24 04:11 · Score: 1

"open source creates a license so that nobody can ever improve the software" according to Bill Gates. Therefore this is not an improvement. QED.

Re:Gates Was Right by DragonTHC · 2008-04-24 04:38 · Score: 2, Funny

I love anything that makes a billionaire whine.

--
They're using their grammar skills there.
Re:Gates Was Right by mr_mischief · 2008-04-24 06:00 · Score: 1

I hear Steve Jobs gets bitchy whenever someone says Woz was the brains. Oh, and I bet Trump gets pretty upset whenever people ask if the curtains match the Venetian blinds...

Re:Over-engineered solution to a non-existent prob by t0rkm3 · 2008-04-24 04:35 · Score: 1

For you.

There are some systems that:

1. Have zero support for a cluster. Maybe duplicate controller cards in the same chassis, but not clustered.

2. Are financially infeasible to cluster.

3. Are of such a high dollar per minute value that management is willing to run the risk of inserting code while operating live (provided it has been tested in another system in a mock-up lab) and having a catastrophic failure or a singing success.

Re:Over-engineered solution to a non-existent prob by Mantaar · 2008-04-24 04:36 · Score: 1

Look, that's just how open source works. It's not only demand driven - most of the time it's just hackers getting interested in stuff they wanna try out - whether it makes sense or not... This way interesting things can happen and if they find someone to use 'em, they may become popular. It's not like the Linux kernel was aimed at becoming the most popular open source OS - actually people may very well have thought of Linux as an "over-engineered solution to non-existent problem" back then! I'm thinking of the micro kernel debate and the HURD system here.

Actually I don't know why carrier systems would want to be able to upgrade on the run. Maybe their boot-up procedures are so complicated and lengthy that they just don't want to reboot... this way they can still to a round robin upgrade, but keep the machines alive and thus save a lot of time (I imagine they have quite a few clusters/machines...)
Maybe Google could use that stuff, too. Instead of rebooting every machine in a cluster in order to upgrade it, they'll write a script that visits each worker in order and upgrades the system on the fly. That's still less down-time. And Google's clusters are really large. If you can reduce the downtime per worker machine by, let's say 50% (I suppose you could do a lot more. Maybe you could even go without testing on all of the machines if they are built the same way (they probably are) and just have one test subject) that's still a lot of time (and money) if you multiply it by those 2000-6000 machines...

--
I'm an infovore...

Re:Over-engineered solution to a non-existent prob by geekoid · 2008-04-24 04:41 · Score: 1

"Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production."

BWAHAHAHAHAhahahahah..

"One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period."

Yes, Federally mandated. Most companies aren't, and in fact compaines that do that are the exception.

I have seen and read about too many CFO's pushing out enterprise level software against technical advice and have it fail.

I have seen for important and large pieces of software put into production with almost no testing.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:Over-engineered solution to a non-existent prob by Anonymous Coward · 2008-04-24 04:47 · Score: 0

So wrong - You do your testing in a test environment not on one half of your production cluster!

It also makes a difference for the end user by Anonymous Coward · 2008-04-24 04:53 · Score: 0

For end users, live updates of the kernel would mean that you don't need to reboot your machine to apply those dayly security updates. Yeah, unlike your server pool, my laptop is not load balanced even though it has multiple cores. That would be a tremendous advantage over Windows and Mac machines where the "updated the DRM managment software: you need to reboot" is so frustrating...

Traditional telephone switches had year+ uptimes by davidwr · 2008-04-24 05:00 · Score: 1

This applies to early digital city telephone switches, circa early 1990s:

These switches were designed to be up for years at a time.

One of the ways they did it was by splitting the switch in half, an "A side" and a "B side."

Typically it was either in a 50/50 or 100/0 with failover, meaning half or all of the phone lines did their processing on the "a" side and the remainder on the "b" side.

When patches were ready to install, which didn't happen very often, everyone but the test lines got moved to the A side, the B side got patched, the B side got tested, then everyone moved to the B side and then the A side got updated and tested.

There was hardly ever any schedule downtime for the switch as a whole during the life of the switch. Thanks to battery power and diesel generators, there was hardly any unscheduled downtime barring a major disaster like a fire in the building or a major earthquake. Downtime meant everyone's phones went dead for the duration of the outage.

The modern analog is a server pool.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

What!? Unpossible! by Linux_ho · 2008-04-24 05:10 · Score: 3, Funny

This is GPL'd software. Bill Gates told me nobody could improve it. These Linux developers are truly renegades!

--
include $sig;
1;

not new technology by lophophore · 2008-04-24 05:12 · Score: 1

I worked with a genius engineer at DEC in the late 80s who could patch a running VMS system on the fly, no reboot required.

He had a small program that made the whole thing happen.

--
there are 3 kinds of people:
* those who can count
* those who can't

Re:not new technology by Anonymous Coward · 2008-04-24 09:32 · Score: 0

Wow. That is so interesting.
Re:not new technology by Majik+Sheff · 2008-04-24 12:21 · Score: 1

Not only could you swap VMS kernels on the fly, if you had an Alpha system you could actually hot-swap processors.

VMS and the Vaxen were remarkable systems; almost as remarkable as the men and women who kept them running.

--
Women are like electronics: you don't know how damaged they are until you try to turn them on.

Already been there by pirap · 2008-04-24 05:22 · Score: 1

Long ago there were two different unofficial kernel patches, which allowed swapping the kernel during runtime. Think it was in time of the 2.2 series or so, quick googling sadly didn't come up with the stuff. If only my memory wasn't that bad :(

Re:And Microsoft claims to have invented it by AndersDahlberg · 2008-04-24 05:30 · Score: 1

Patent? Similar technology is present in 50% of todays telephony (e.g. POTS, GSM, WCDMA) servers. PLEX & emergency corrections in executive side have enabled this since around the late 1980:ies. http://en.wikipedia.org/wiki/AXE_telephone_exchange Yes, AXE nodes *usually* is patched by updating the hot-standby side while separated from executive side - but if you really want to live on the wild side... ;-)

redhat & centos by t35t0r · 2008-04-24 05:49 · Score: 1

Maybe RHAT/CENTOS will catch onto this and put it into production. This would probably work well since the actual kernel minor version never changes for the life of a RHEL/CENTOS version and I would guess kernel data structures don't change much either as long as the kernel is kept at the same minor version.

Re:Over-engineered solution to a non-existent prob by mr_mischief · 2008-04-24 05:54 · Score: 2, Interesting

You'd roll back much the same way, or even perhaps by rebooting into the previous kernel image from disk.

Every production environment I've ever administered had a smaller version set aside for testing. We'd configure the machines identically and just make the cluster smaller. Then we'd test on the test machines any action that was to be made part of the admin process of the production machines. If it passes on the test machine and fails in production, then you didn't make the machines sufficiently similar.

Round robin upgrades take ( ( (time_to_idle + time_to_upgrade + time_to_reboot) * machines ) / 2) on average to get a machine upgraded. If you have a "Critical" upgrade, that might be longer than you want.

Not everyone has the exact same QA requirements you do, either. Some of us are happy with proving that it works, then proving that it worked on the production machine, then resuming our normally scheduled maintenance.

Re:And Microsoft claims to have invented it by Anonymous Coward · 2008-04-24 06:17 · Score: 0

I find it interesting that Microsoft should have this patent, especially considering that windows machines nearly always need rebooting after doing system updates (not to mention after installing extra software, crashes, and when doing advanced stuff like using two programs at the same time.). Think of their 'expertise' in this area, and now try to imagine the Pandora's Box opened by attempting a 'live' patch on a MS kernel...

Re:And Microsoft claims to have invented it by Nuitari+The+Wiz · 2008-04-24 06:30 · Score: 1

Not only that, but there are countless buggy C/C++ program which can be used as prior art. Including some of mine.

Re:last of yOUR working capital being squandered.. by blackjackshellac · 2008-04-24 06:32 · Score: 0

wtf?

--
Salut,

Jacques

If MS holds the patent, then why don't they use it by Anonymous+Freak · 2008-04-24 06:35 · Score: 1

I mean, every minor little Windows Update makes my machine reboot. I am so sick of starting up Parallels, having updates immediately run and require a reboot. (But there's no way I'm letting my machine go without the updates.) Yeah, it probably doesn't help that I only load Windows once a month, so there are invariably a bunch of updates waiting. But still...

--
Another non-functioning site was "uncertainty.microsoft.com."
The purpose of that site was not known.

RIP: The last real reason to require a reboot. by dkh2 · 2008-04-24 06:39 · Score: 1

I've been asking since MS-DOS 3.0 - "Why the heck to I have to reboot after installing an application?" Windows has finally reached the point that most of the time. But this goes far beyond those improvements.

--
My office has been taken over by iPod people.

rebooting is not an option by decsnake · 2008-04-24 06:48 · Score: 1

This basic technique has been used to patch the running code on spacecraft as long as there have been computers on spacecraft. As you say, the concept is simple, the devil is in the details.

that was a little different than this by decsnake · 2008-04-24 06:55 · Score: 1

the concept was the same.

Allocate a chunk of non-paged pool
load your code into it.
set IPL to 31
patch the code that you want to change to jump to the code you loaded into pool.
Lower IPL

It would probably work on windows too. Most of the actual windows kernel is very similar to VMS, even down to having paged (not NON-paged) pool. You know someone is copying when the even copy the stupid ideas.

Erlang/OTP by Kupfernigk · 2008-04-24 06:57 · Score: 1

I thought this was supposed to be News for Nerds?

Erlang/OTP (Open Telephony Platform) already has the ability to roll out updates onto server nodes without a reboot. To everybody above who has said this is a nonexistent problem, or has pontificated about how simple it is to load balance telephony servers (it isn't): if you are right, then why did Ericsson need to invent it?

The ejabberd IM server is written in Erlang, and it repays study. This is a small language which can do things like extract patterns from bit arrays in a single line, and at the other end of the scale can drive large server networks with uptime measured in years.

Fortunately, any Microsoft patent taken out in 2002 on 1980s Swedish technology is unlikely to work in Europe.

--
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."

Comment removed by account_deleted · 2008-04-24 07:07 · Score: 1

Comment removed based on user account deletion

Replying to myself by Kupfernigk · 2008-04-24 07:09 · Score: 1

I should add before anybody objects that I KNOW that updating a running kernel is different from updating a higher level layer. My point is that telephony servers do need to be up for years, therefore the ability to patch the kernel as well as the OTP is valuable. Although telephony mainframes may be up for years, Linux boxes typically aren't (not if they want to stay current.) This technology can presumably project Linux into the six-nines availability range. But, if the MS patent comment is right, not in the US.

--
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."

Re:And Microsoft claims to have invented it by davecb · 2008-04-24 07:12 · Score: 1

Perhaps it's an evil plot to make
better OSs have to reboot (;-))

--dave

--
davecb@spamcop.net

Linux just gets better. by bannerman · 2008-04-24 07:43 · Score: 2, Insightful

I would think that on top of the benefits of patching running high-uptime servers this would in the long run also result in yet another benefit to running Linux on your desktop instead of Windows. I don't see any reason RedHat, Ubuntu and everyone else wouldn't implement this type of kernel upgrade for convenience' sake.

--
I keep forgetting my place. Jesus is for losers. Why do I still play to the crowd?

Brilliant by TheNetAvenger · 2008-04-24 07:49 · Score: 0, Troll

Brilliant...

Let all of us go through the MS patents, and write an article or tech paper on it pertaining to Linux so we can pretend we invented the concept.

Hell, most Linux users ignore Microsoft, so they would buy it 99% of the time, until they do it and MS knocks on the door and goes, "Nope..."

If we are going to apply this stuff to Linux, we need to at least give Microsoft credit or pretend it is not our 'brainchild', and maybe Micorsoft will continue to leave us alone.

Geesh...

Re:Brilliant by Anonymous Coward · 2008-04-24 09:06 · Score: 0

If we are going to apply this stuff to Linux, we need to at least give Microsoft credit or pretend it is not our 'brainchild', and maybe Micorsoft will continue to leave us alone.

Hey Microsoft Asshat. Guess What? Apple did this first with GS/OS:
Q In some of my recent work, I've found it necessary to patch the Apple II GS GS/OS vectors in order to monitor OS calls. My patch works without interfering, but it disappears when the user switches to ProDOS 8 and back to GS/OS.

Go Astroturf for Microsoft somewhere else.

Love,
From somebody who knows more about computers than you do.
Re:Brilliant by TheNetAvenger · 2008-05-03 19:43 · Score: 1

SO you are arguing the *linux* innovation is an Apple rip off. Fine, I'll go with that too, but this still holds my freaking point, it is NOT NEW!!!

GOT IT, Ass-masticator?

PS Comparing the kernel shufflings of a monitored process of GS/OS to something hotpatching portions of a kernel like NT is like comparing your Green Machine you had when you were three to your Hummer and claiming the technology is the same and fits the example.

Re:And Microsoft claims to have invented it by johannesg · 2008-04-24 08:04 · Score: 2, Informative

AmigaOS had its kernel in ROM, and could be patched on the fly. That was back in 1985, so even if it was patented, it isn't now.

The patching function was not an accident either; there was an OS-function for this purpose. Originally it was intended to allow bug-fixed to be installed without having to change the ROM, but it was quickly coopted into a mechanism for enhancing the OS in various other ways as well.

Microsoft Idea? by Anonymous Coward · 2008-04-24 08:06 · Score: 0

So you're trying to tell me that Microsoft came up with a way to patch an OS with no reboot necessary?

Gotta go, someone just saw a flying pig....

Prior Art? by fahrbot-bot · 2008-04-24 08:11 · Score: 1

Hotpatching ... the idea seems to be patented by Microsoft.

Hasn't NASA been doing this with satelites and probes since whenever (well, since they started probing things anyway)? Haven't computer viruses (and organic for that matter) been doing this since the first infection?

--
It must have been something you assimilated. . . .

Patented by Microsoft? by dreamchaser · 2008-04-24 08:27 · Score: 1

That must be why almost every version of Windows requires a reboot for even the most trivial of changes or updates. Vista is a bit better in this regard but it still needs a reboot when kernel level patches are applied.

Re:Patented by Microsoft? by Shados · 2008-04-24 16:13 · Score: 1

Windows requires reboots for kernel level patches, but it sure as hell doesn't (even pre-Vista, it really didn't change in Vista) requires a reboot for even the most trivial of changes or update either.

The only reason someone may think so is because installers have a "You must reboot now" screen added to them by default and no one seems to remember to take em out. Kernel patches aside, its -extremely- rare to have to reboot a Windows machine. Virtually on the same level as a *nix box.
Re:Patented by Microsoft? by Lennie · 2008-04-25 02:27 · Score: 1

The NT-kernel is much more like a micro-kernel as I understand it, so it would make sense to be able to swap a module for a new version.

--
New things are always on the horizon

Lisp has been patching running images for years by Anonymous Coward · 2008-04-24 08:33 · Score: 0

Lisp has been patching running images for quite a while (much earlier that '02), so MS' patent is vulnerable to prior art claims.

I'm sure there are other examples of patching running systems too.

You don't understand telephony. by Ungrounded+Lightning · 2008-04-24 08:41 · Score: 1

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching.

Servers be damned.

If you are a carrier in telephony, virtually all your subscribers have a connection consisting of a single line terminated by a singe box at the edge of your network (sometimes a series string of single-boxes each doing different parts of the job.) When anything makes that box unavailable, even for moments, all the customers whose sole connection is through it are down. That might be tens of thousands or hundreds of thousands of them.

Such boxes are designed for "six-nines" uptime targets. Redundant power supplies, redundant control processors, redundant interconnects between cards, redundant uplinks, redundant FANS, backplanes with no active components, software designed with separately replacable and restartable modules, etc.

And yes they DO run for years without a reboot. (In fact even servers do. For instance: The servers at the baby bells that record the billing records generated at the connect, answer, and end of a call. Some of those have been up for years - with minor patches queued for the next time some mishap makes them reboot.)

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way

pannus.sf.net had kernel hot updates in 2006 by tech-law-ny · 2008-04-24 09:21 · Score: 1

Why doesn't the Related Work section discuss kpannus from sourceforge.net/projects/pannus? 'Another command of the PANNUS is the "kpannus" for kernel live patching. ... the PANNUS controls kernel by using "stop_machine_run()" for safety ensuring, which creates threads for each CPU to execute a function without any interruption. ... The PANNUS for kernel patch("kpannus") is tested for some functions in the kernel such as sched_clock, do_gettimeofday, filesystems_read_proc, cmdline_read_proc, or init_timers.'

Prior art: telephony already solved this by KeithH · 2008-04-24 09:34 · Score: 1

In the telephony world, uninterupted service is a basic requirement. Much of that is accomplished through redundant hardware where upgrades take place on the standby side first, the service switches over, and then the other side upgrades. This is typically done at night during a maintenance window due to the temporary loss of redundancy.

However, hot patching has been done for decades.

In the case of Nortel's DMS, this is a trivial operation thanks to the clever design of the language and the operating system. A designer can code and build a patch in minutes.

Systems running C/C++ can accomplish the same thing by having a loader which recognizes "patch vectors" and updates the symbol table on the fly.

There are limitations but there is nothing magical about "Ksplice" and I'm sure that Microsoft's "patent" adds nothing new either.

Is this REALLY a concern? by Anonymous Coward · 2008-04-24 09:35 · Score: 0

Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.

Since when has Teh Lunix every concerned itself with using MS code or patents? No reason to start now.

Re:Is this REALLY a concern? by grantek · 2008-04-24 10:57 · Score: 1

It's OK - I believe IBM has got this technology in AIX 6, so Microsoft can spread all the patent FUD they want, if they get sue-happy they'll have to take on a corporation with good experience slapping patent trolls down.
Re:Is this REALLY a concern? by dgatwood · 2008-04-24 11:25 · Score: 1

How do you patent something that has been done by computer worms for years?

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Is this REALLY a concern? by Lennie · 2008-04-25 02:10 · Score: 1

Didn't Solaris also have binary kernel patches ?

--
New things are always on the horizon
Re:Is this REALLY a concern? by WebCowboy · 2008-04-25 05:15 · Score: 1

Well, it might be of concern if the patent was valid. It was filed in 2002. The practice of performing kernel updates on a running system predates this patent by many many years. I personally witnessed PCs running QNX have kernel updates applied without rebooting or service interruption in 1996.

Re:And Microsoft claims to have invented it by puppet10 · 2008-04-24 09:37 · Score: 1

With the number of reboots required to simply install apps on an MS OS machine do they have a successful implementation of this patent hiding somewhere?

--
-------- This space intentionally left blank --------

Re:Over-engineered solution to a non-existent prob by rcamans · 2008-04-24 09:55 · Score: 1

If there is a new-found security exploit in the wild, which requires an immediate patch, or you essentially die, you sure as sh*t better install that patch as fast as possible on a machine, and see that it does not hurt. No waiting for QA to sign off. I worked in one of the largest, and the QA process was very long, and we were regularly going down because of virus, etc. No email can kill a company, for example. If the company is doing 10 million a day business, totally on the internet, to have the internet go down is deadly.
We had servers that had multiple processors, and you could shut down one proc and upgrade its OS, etc, while the whole system kept running. we paid a lot for that. This is gold. Ignore the bs put out by so-called know-it-alls, and decide if this fits your needs for yourself. Anytime you get an increase in the number of choices of how to do something, you are on the upside of life.

--
wake up and hold your nose

Re:If MS holds the patent, then why don't they use by dakrin9 · 2008-04-24 10:03 · Score: 1

From the microsoft patent:

"Not all code changes can be installed via a hotpatch, at least not safely or easily. For example fixes that affect multiple functions and cannot be broken up into independent changes, and fixes that cannot be run while unpatched version of the code are running, cannot be applied with hotpatching"

"In other words, a particularly fix needs to be able to be broken into independent changes, each affecting a single routing, and the system has to function correctly even if some threads execute an unpatched routine after another thread executes the patched version"

They sum it up saying that hotpatching is basically limited to "relatively small, single function fixes, like adding a parameter check or fixing a leak, etc."

Re:If MS holds the patent, then why don't they use by dakrin9 · 2008-04-24 10:06 · Score: 1

You may be able to implement a system where every function knows what "version" it is and also keeps the old functional code so that if a patched function ends up being called by an old version then it can just execute the old code.

You still have to make sure that execution of an old version thread, and a new version thread at the same time doesn't break anything though.

Of course coldpatching doesn't have to worry about any of this, so the whole "version" functionality and keeping old code only need apply to the hotpatching, not the coldpatching

This was the smallest part of the interview... by tytso · 2008-04-24 10:46 · Score: 3, Informative

Funny thing... this was the smallest part of my oh, hour and twenty minute interview with the reporter. The reason for the call was to hear about what was up with the 2.6.25 release; she probably spent more time talking with me about KVM and Xen; and I mentioned ksplice just as an aside, as an example of lots of really interesting and exciting work that doesn't necessarily happen as part of a mainline kernel release. I spent maybe 2-3 minutes tops talking to her about ksplice --- and that's what she ends up writing about and getting slashdotted!

Comment removed by account_deleted · 2008-04-24 11:56 · Score: 1

Comment removed based on user account deletion

Re:And Microsoft claims to have invented it by Anonymous Coward · 2008-04-24 12:07 · Score: 0

The AS/400 has had the ability to load any object file generated from any compilation unit into a running kernel for a long time. Kernel patches are applied this way all the time on that system.

Depends on your install by jd · 2008-04-24 12:39 · Score: 1

If you use LinuxBIOS, your reboot time is about 3 seconds. If you're looking at the teleco market and want embedded Linux systems on literally millions of sites, you might even consider dumping the use of FLASH for the kernel and go to ASIC. At which point, your boot time might easily be around 0.3 seconds.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:Depends on your install by Allador · 2008-04-24 15:52 · Score: 1

That would only affect the BIOS post time for the mainboard though, right?

You've still got SCSI card bios startups, RAID card bios startups, RAC/ILO/BMC bios startups, etc.

Those each take at least 3-4 seconds in my experience, and most servers have a few.
Re:Depends on your install by Lennie · 2008-04-25 02:16 · Score: 1

I say, use kexec instead (it's in Debian stable/Ubuntu: kexec-tools).

It loads the kernel in memory and reboot into the new kernel.

--
New things are always on the horizon

Irony by kbolino · 2008-04-24 17:19 · Score: 1

Microsoft did such a good job with their patent that I only have to reboot Windows XP on Patch Tuesdays!

Re:If MS holds the patent, then why don't they use by darkpixel2k · 2008-04-24 17:46 · Score: 1

I mean, every minor little Windows Update makes my machine reboot. I am so sick of starting up Parallels, having updates immediately run and require a reboot. (But there's no way I'm letting my machine go without the updates.) Yeah, it probably doesn't help that I only load Windows once a month, so there are invariably a bunch of updates waiting. But still...

What's even worse is installing Windows Updates on a production machine, rebooting a few hours later after all the users have gone home--and then having it pop the hell up again with NEW updates.

--
There's no place like ::1 (I've completed my transition to IPv6)

Microsoft patent? by Arancaytar · 2008-04-24 20:05 · Score: 1

Microsoft patented a method of avoiding system reboots?

WHY ARE THEY NOT USING IT? :P

Like patching code in X.25 and frame switches... by Anonymous Coward · 2008-04-24 20:14 · Score: 0

I used to install patches for Telecom grade switches and basically we altered op-codes. In some cases it was a matter of adding a new section of op-codes to some spare memory location and then inserting a jump in the old area. Each line we entered did have checksums. This was late 1980's, early 1990's with the Hughes Network Systems packet and frame switching INS 9000. Fairly self-explanatory. Obviously nothing at all like the innovative Microsoft Patent.

ericsson by schamarty · 2008-04-24 20:35 · Score: 1

I have no idea when MS patented it, but I do know that Ericsson switches have long had the capability to accept OS patches while running. I am *not* an expert on that and I don't recall the exact terminology etc., but it shouldn't be hard to find. I also seem to recall that this was one of the features of Erlang, but am less sure of that.

Re:And Microsoft claims to have invented it by ranulf · 2008-04-24 21:26 · Score: 1

Solaris also supported in-place kernel patching a long while before this patent - I remember learning about it in 1997, but I think it had existed for quite some time before this.

Re:And Microsoft claims to have invented it by ranulf · 2008-04-24 21:33 · Score: 1

Bad form to reply to my own post, but having just looked at the patent it is also somewhat different to Linux patch mechanism anyway. Claim 20 involves checking the filename of currently running modules, claim 21 invloves checking a hash, etc... That said, the patent basically describes simple patching of a function entry points with a few consitency checks. There is a ton of prior art on this, as someone else mentioned the Amiga OS was designed for this, it was common on Mac extensions in the 80s, it's the basic mechanism for DOS-based TSRs and viruses, etc... It's also clearly not non-obvious.

Re:And Microsoft claims to have invented it by davecb · 2008-04-24 23:50 · Score: 1

And, to be fair, OS/360 had patch
space compiled/assembeled in.

--dave

--
davecb@spamcop.net

Brilliant indeed. Old concept too... by Fjodor42 · 2008-04-25 00:01 · Score: 1

Erm, maybe, just maybe, could there be a slight chance that you didn't read the comments written earlier than yours?

Check http://en.wikipedia.org/wiki/Prior_Art and do a search over the comments for examples of this concept, and explain this "we need to at least give Microsoft credit"-thingy in a little more detail?

It would be grand, though, if we could muster a rally against defunct PTOs like the US one...

--
"The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."

Microsoft has NOT patented this! by Anonymous+EPA · 2008-04-25 00:06 · Score: 2, Informative

Tomasz Chmielewski is wrong. Microsoft applied for a patent and their application was rejected by the examiner, as was their appeal in the USPTO. Check out the file history of application US 2004-0107416.

Their only resort is to appeal to court.

There are no applications in other countries.

A

Re: by clint999 · 2008-04-25 02:18 · Score: 0

True, but I've been standing in switch rooms watching operators manually kill those circuits because they wanted to reboot a box. 5x 9s doesn't mean perfect service, and if anyone complained about it they were told that a ms interruption once every few mon

--
College-Pages.com - Online Colleges, Degrees, and Programs

Obvious? by Yfrwlf · 2008-04-25 04:14 · Score: 1

Not only does this show how ludicrous software patents are, not only does this show how ludicrous our patent system is, not only do patent workers clearly need to be fired, but the definition of obvious clearly escapes the USPTO. The next logical step and feature to add to prevent having to reboot the kernel to apply updates is making it so you don't have to.

I wish Obama would eliminate the USPTO after he's president.

--
Promote true freedom - support standards and interoperability.

Hotpatch in Linux predates the filing by DrYak · 2008-04-25 11:55 · Score: 1

The Microsoft patent covers a special technique that Microsoft has claim they have developed, that they need to add to their future systems (and that we haven't seen yet into mass production yet). It basically consist of a messy kludge : the unpatched codes is overwritten with jump instruction that force the execution to proceed to another place which contain the patched code.

It's complicated, has to be specially implemented and hasn't seen widespread use yet.

Besides, as other /.er has said, I don't see how that's patentable because *ALMOST ALL MS-DOS VIRUSES* in the 90s did use this exact technique. In fact a "jump from the begining of the code to the viral block at the end" was one of the heuristic criterion used by some antivirus to assess the risks of some suspected code (Thunderbyte's antivirus for example). That constitute massive prior art.

All the various hot patching projects that exist on Linux use a completely different approach. They leverage the modules facility present in plain vanilla kernel to add/replace kernel functions or stubs with new code present in the module. Except that instead of adding a function to control some hardware as in a driver module, a patch modules replace a function with a new bugless version.

So there's nothing new conceptually. It's just slightly different usage for a mechanism that has been there in the kernel since along time - several years before the patent was even filed. The different project are, in a typical F/LOSS spirit, making tools that leverage those existing technology to facilitate the patching process.

But the technology itself has been there since the late 90s whereas the microsoft patent was filed in 2002.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

power pc port by mehemiah · 2008-04-26 16:46 · Score: 1

Is there anyone planning to port this to power PC architecture or any other? what part of this is arch specific anyway?

Slashdot Mirror

Patch the Linux Kernel Without Reboots

286 comments