"No Reboot" Kernel Patching - And Why You Should Care

As Linux version 4.0 was released on 15 April, one of the most discussed new features to be included in this release is "no reboot" kernel patching. With the major distros committing to support the 4.0 kernel and its features (including "no reboot" patching) at some point this year, it's a good time to take a look at what this feature actually does and what difference it will make for you.

First of all, what does it actually mean? Well, for once, this is a feature with a name that describes what it does pretty well. With versions of Linux before 4.0, when the kernel is updated via a patch, the system needs to reboot.

Kernel patches are released for a number of reasons, but fixing security holes is the most frequent reason. This is why it's important to install the patch as soon as possible.

Unlike other operating systems, Linux is able to update many different parts of the system without a reboot, but the kernel is different. Every running process integrates with the kernel intimately, so switching out parts of the kernel while it is running is quite risky.

On the other hand, rebooting the computer is irksome, and in some cases, where uptime is important, it can be a real issue. This is why "no reboot" kernel patching has been a priority for many administrators.

Recognizing this need, two companies have been hard at work on two different solutions. Red Hat has been working on kpatch, and SUSE has been working on kGraft. Both of these programs are designed to accomplish the same task, but they take a different approach and have different strengths.

Kpatch freezes every process and then reroutes system calls from the old kernel functions to the new, patched functions, before removing the old code. Because it handles every running process in one sweeping move, it runs quite fast - one to forty milliseconds and it's done. However, during this time the processes are frozen, which means there is some downtime - a mere fraction of a second, but in certain situations, that may be unacceptable.

kGraft, on the other hand, handles each thread one by one, as they make system calls (without forcing them to freeze first) until all of the threads are running the patched code. At this point, the patch is fully installed and the old code is replaced. This process takes longer to complete the patch, but it does it without any downtime.

Having solved the same problem separately, from two different angles, the 2 companies then came together in October last year. They looked at how their different approaches could be fused together, and the result of this merge has been pushed into version 4.0 of the kernel.

So, having described what "no reboot" kernel patching is, and how it works, the next question most users will have is "what difference does it make?"

For desktop users, the difference is relatively trivial. For users without 4.0, installing a kernel patch means rebooting the system, which means you must save your work and interrupt your work-flow. This is irritating, and can cause a small hiccup in your productivity. If everyone in a medium or large office has to install a patch on the same day, it hit productivity a bit harder. However, this is a relatively small cost and is worthwhile to ensure security.

On the other hand, some servers and critical real-time applications must not be taken down without advanced scheduling, even for a few minutes. This can be a pain when administrators need to keep the system secure and a patch is released to repair a newly discovered security hole. In this case, no-reboot patching becomes a real boon.

But this doesn't mean that system reboots are gone forever. Even on a system with the Linux 4.0 kernel, there will be security updates that still require a reboot, because there are other non-kernel components that can require patching, and some of these require a reboot as part of the process.

Some critics are therefore claiming that focusing so much effort and time on no-reboot patching is missing the real target that needs fixing - the reason why this feature was developed was to avoid the cost of rebooting a system. Maybe developers should be trying to make it less expensive to reboot a Linux system instead?

This is an introduction of a vulnerability.

Servers can avoid reboots for long periods of time, but not forever.

I wouldn't call it a 'vulnerability', but it does bring up something that can sometimes cause problems, and that is, the longer a system runs, the less confidence administrators tend to have of it coming back up cleanly after a reboot. I've seen servers go years between reboots even without this feature because they weren't being religiously patched. (They were fairly stable systems that weren't externally facing). The longer they'd go, the less confidence you'd have that you actually knew about any changes that had been made to the systems. Additionally, having long uptimes could occasionally mask hardware issues. I've seen AIX servers that pretty much ran continuously from the time the OS was installed until the next update, which in the case of these systems was about 4-5 years. For some, once the hard drive would spin down, they just wouldn't spin back up, so they were fine as long as they were chugging along, but the moment you tried to reboot, you were in serious trouble..

If your change control procedures are good, you can stay on top of any configuration changes that have occurred, but sometimes it's hard to remember stuff from more than a year back.

Yes, I agree with all.

Except the conventional wisdom of applying updates without fully testing them in test systems exhaustively and letting them “age”.

But on that practice I am pretty much contrarian to all admins.

Admins, I get it, have the job of applying system updates.

But they don’t have the authority to order exhaustive system tests... which would be whole projects in and of themselves that would involve business users and other IT folks, significantly... with no perceived benefit for the business user community and use of tons of their time and effort.

So, in lieu of comprehensive system testing... sys adms simply have to apply system updates on their own schedule.

The “conventional wisdom” that has been pushed on everyone is to continuously apply every update as fast as possible, i.e., keep up.

Skip no updates, apply them all as they come out as soon as possible.

Unfortunately, every update is not secure.

Therefore, the apply all as fast as you can practice actually guarantees that every insecure update will get applied at some point in time.

So the admin does not avoid any vulnerabilities, he installs every one, followed by its fix at some point, along with succeeding new vulnerabilities.

The best strategy for security, of course, would be to search for, test for, and build configurations that had a combination of updates that was secure, as much as can possibly be determined by researching the vulernabilities of each piece of software and its updates. Every server configuration deemed ready for production would have a combination of updates applied such that the system either had no vulnerabilities or had workarounds that were properly implemented for those that were known.

But alas, this is too much work.

But alas, this is too much work.

... for little perceived benifit by the business owners. It is really hard to prove a negative, i.e., if you force the BU to fully test patches (and other updates), you won't have as many errors.

Sometimes you can, but folks who aren't serious nerds don't generally understand how computers do their various magics in the first place, so explanations are lost on them.

Many moons ago, I worked for MCI. We had a really awesome lab facility that had copies of every bit of hardware installed on the network, so we could do full integration testing of all patches, updates and upgrades. It was freaking excellent. The "as-built" docs were astoundingly detailed.

So, MCI was bought by a criminal organization known as "Worldcom" so as to keep a ponzi scheme by the mastermind of the criminal organization afloat.

I recall an email conversation that followed a rather large outage that occurred on the Worldcom side of the house. The WC guy asked, why the MCI side hadn't expedienced the particular outage when patch "X" was applied to their switches. The MCI guy replied, well, when we loaded it in the test systems in the lab it broke stuff, so we sent it back to the vendor and held off deploying it. The WC guy response was essentially "you tested it first?" Uh, yeah bud. this is a multi-billion dollar corporation, we test our stuff.

Personally, I really like the idea of not having to reboot for kernel updates. There will always be exceptions. From what I read, there is some stuff that just can't safely be hot-patched because of dependencies. However, for routine stuff, it's a Godsend IMO. Computers should almost never have to be rebooted. The concept of monthly reboots is an artifact of the shoddy code produced by Microsoft. Real computers don't need monthly reboots, and IMO, anyone who recommends them is not someone I'm inclined to listen closely to.

Remember the floating-point accumulating error on the old Patriot missile system ?

Their workaround was to reboot every so often. Otherwise, according to reports, accuracy was not good.

*Completely* different situation from kernel patching and memory leak. That was a numerical issue due to propagation of numerical error due to limited precision of the floating point representation of numbers.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.