Livepatch Related Work
From KShot1
Existing live patching focuses on open-source operating systems, mainly Linux.
For example, Ksplice 2, kpatch 3, and kGraft 4 can effectively patch security vulnerabilities without causing a significant downtime.
Kpatch and Ksplice both stop the running OS and ensure that none of the processes are affected by changes induced by patched functions.
Specifically, kpatch replaces the whole functions with patch ones, and Ksplice patches individual instructions instead of functions.
kGraft patches vulnerabilities at function level, but does not need to stop the running processes.
It maintains the original and patched function simultaneously and decides which one to execute by monitoring the state of processes, potentially inducing incorrect behavior or consuming additional storage.
These methods cannot address changes to data structures.
To address this limitation, KUP 5 replaces the whole kernel with a patched version, but uses checkpoint-and-restore to maintain application state consistency.
However, it checkpoints all the user processes, leading to large CPU and memory overhead.
KARMA 6 uses a kernel module to replace vulnerable instructions that it identifies from a given patch diff.
In addition, several live updating methods have been integrated into operating systems, like Canonical Livepatch Service 7 in Ubuntu, and Proteos 8 on MINIX 3, which can update new components if the patch is small. However, these methods still rely on the trustworthy operation of the target OS, so potential kernel-level attacks may tamper with the live patching operation, leading to system failure. KSHOT addresses this by leveraging a TEE to reliably patch the target kernel with a smaller TCB and low total overhead.
Preliminary
Hot Patch history 9
DDJ: And when you had to hot-patch in flight?
GR: That's standard procedure. You always build in the ability to change it.
DDJ: Just in case.
GR: Just in case, but JPL and a bunch of decent-sized companies have had the problem where you can't get all the software done in time for launch. You always make sure you build the capability to change things.
I. kprobe
A kernel probe is a set of handlers placed on a certain instruction address. There are two types of probes in the kernel as of now, called “KProbes” and “JProbes.”
KProbes is a debugging mechanism for the Linux kernel which can also be used for monitoring events inside a production system.
A KProbe is defined by a pre-handler and a post-handler. When a KProbe is installed at a particular instruction and that instruction is executed, the pre-handler is executed just before the execution of the probed instruction. Similarly, the post-handler is executed just after the execution of the probed instruction.
Users can insert their own probe inside a running kernel by writing a kernel module which implements the pre-handler and the post-handler for the probe. In case a fault occurs while executing a probe handler function, the user can handle the fault by defining a fault-handler and passing its address in struct kprobe.
KProbes heavily depends on processor architecture specific features and uses slightly different mechanisms depending on the architecture on which it’s being executed.
KProbes is available on the following architectures however: ppc64, x86_64, sparc64 and i386.
JProbes are used to get access to a kernel function’s arguments at runtime. A JProbe is defined by a JProbe handler with the same prototype as that of the function whose arguments are to be accessed. When the probed function is executed the control is first transferred to the user-defined JProbe handler, followed by the transfer of execution to the original function.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ns3IuvGf-1620477632848)(https://static.lwn.net/images/ns/kernel/KProbesArchitecture.png)]
The figure above describes the architecture of KProbes. On the x86, KProbes makes use of the exception handling mechanisms and modifies the standard breakpoint, debug and a few other exception handlers for its own purpose. Most of the handling of the probes is done in the context of the breakpoint and the debug exception handlers which make up the KProbes architecture dependent layer. The KProbes architecture independent layer is the KProbes manager which is used to register and unregister probes. Users provide probe handlers in kernel modules which register probes through the KProbes manager.
Interface
- Data Structure
/*<linux/kprobes.h>*/
struct kprobe {
struct hlist_node hlist; /* Internal */
kprobe_opcode_t addr; /* Address of probe */
kprobe_pre_handler_t pre_handler; /* Address of pre-handler */
kprobe_post_handler_t post_handler; /* Address of post-handler */
kprobe_fault_handler_t fault_handler; /* Address of fault handler */
kprobe_break_handler_t break_handler; /* Internal */
kprobe_opcode_t opcode; /* Internal */
kprobe_opcode_t insn[MAX_INSN_SIZE]; /* Internal */
};
struct jprobe {
struct kprobe kp;
kprobe_opcode_t *entry; /* user-defined JProbe handler address */
};
- Register Method
typedef int (*kprobe_pre_handler_t)(struct kprobe*, struct pt_regs*);
typedef void (*kprobe_post_handler_t)(struct kprobe*, struct pt_regs*,
unsigned long flags);
typedef int (*kprobe_fault_handler_t)(struct kprobe*, struct pt_regs*,
int trapnr);
- Kprobes Management
void lock_kprobes(void)
Locks KProbes and records the CPU on which it was locked
void unlock_kprobes(void)
Resets the recorded CPU and unlocks KProbes
struct kprobe *get_kprobe(void *addr)
Using the address of the probed instruction, returns the probe from hash table
int register_kprobe(struct kprobe *p)
This function registers a probe at a given address. Registration involves copying the instruction at the probe address in a probe specific buffer. On x86 the maximum instruction size is 16 bytes hence 16 bytes are copied at the given address. Then it replaces the instruction at the probed address with the breakpoint instruction.
void unregister_kprobe(struct kprobe *p)
This function unregisters a probe. It restores the original instruction at the address and removes the probe structure from the hash table.
int register_jprobe(struct jprobe *jp)
This function registers a JProbe at a function address. JProbes use the KProbes mechanism. In the KProbe pre_handler it stores its own handler setjmp_pre_handler and in the break_handler stores the address of longjmp_break_handler. Then it registers struct kprobe jp->kp by calling register_kprobe()
void unregister_jprobe(struct jprobe *jp)
Unregisters the struct kprobe used by this JProbe
Trigger
-
Kprobes
After the probes are registered, the addresses at which they are active contain the breakpoint instruction (int3
on x86). As soon as execution reaches a probed address theint3
instruction is executed, causing the control to reach the breakpoint handlerdo_int3()
in arch/i386/kernel/traps.c.do_int3()
is called through an interrupt gate therefore interrupts are disabled when control reaches there. This handler notifies KProbes that a breakpoint occurred; KProbes checks if the breakpoint was set by the registration function of KProbes. If no probe is present at the address at which the probe was hit it simply returns 0. Otherwise the registered probe function is called.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-09XjYuXg-1620477632854)(https://static.lwn.net/images/ns/kernel/KProbeExecution.png)] -
JProbes
In the first step, when the breakpoint is hit control reacheskprobe_handler()
which calls the JProbe pre-handler (setjmp_pre_handler()
). This saves the stack contents and the registers before changing the eip to the address of the user-defined function. Then it returns 1 which tellskprobe_handler()
to simply return instead of setting up single-stepping as for a KProbe. On return control reaches the user-defined function to access the arguments of the original function. When the user defined function is done it callsjprobe_return()
instead of doing a normal return.
In the second stepjprobe_return()
truncates the current stack frame and generates a breakpoint which transfers control tokprobe_handler()
throughdo_int3()
.kprobe_handler()
finds that the generated breakpoint address (address ofint3
instruction injprobe_handler()
) does not have a registered probe however KProbes is active on the current CPU. It assumes that the breakpoint must have been generated by JProbes and hence calls the break_handler of the current_kprobe which it saved earlier. The break_handler restores the stack contents and the registers that were saved before transferring control to the user-defined function and returns.
In the third stepkprobe_handler()
then sets up single-stepping of the instruction at which the JProbe was set and the rest of the sequence is the same as that of a KProbe.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DQS6ZfC6-1620477632861)(https://static.lwn.net/images/ns/kernel/JProbeExecution.png)]
Weakness
There could be several possible problems which could occur when a probe is handled by KProbes. The first possibility is that several probes are handled in parallel on a SMP system.
However, there’s a common hash table shared by all probes which needs to be protected against corruption in such a case. In this case kprobe_lock serializes the probe handling across processors.
Another problem occurs if a probe is placed inside KProbes code, causing KProbes to enter probe handling code recursively.
This problem is taken care of in kprobe_handler() by checking if KProbes is already running on the current CPU.
In this case the recursing probe is disabled silently and control returns back to the previous probe handling code.
If preemption occurs when KProbes is executing it can context switch to another process while a probe is being handled.
The other process could cause another probe to fire which will cause control to reach kprobe_handler() again while the previous probe was not handled completely.
This may result in disarming the new probe when KProbes discovers it’s recursing.
To avoid this problem, preemption is disabled when probes are handled.
Similarly, interrupts are disabled by causing the breakpoint handler and the debug handler to be invoked through interrupt gates rather than trap gates.
This disables interrupts as soon as control is transferred to the breakpoint or debug handler.
These changes are made in the file arch/i386/kernel/traps.c.
A fault might occur during the handling of a probe.
In this case, if the user has defined a fault handler for the probe, control is transferred to the fault handler.
If the user-defined fault handler returns 0 the fault is handled by the kernel.
Otherwise, it’s assumed that the fault was handled by the fault handler and control reaches back to the probe handlers.
KProbes however cannot be used directly for these purposes.
In the raw form a user can write a kernel module implementing the probe handlers.
However higher level tools are necessary for making it more convenient to use.
Such tools could contain standard probe handlers implementing the desired features or they could contain a means to produce probe-handlers given simple descriptions of them in a scripting language like DProbes.
II. ftrace
Interface
III. strace
Interface
IV. ebpf
Interface
V.
State of Art
Ksplice2
Introduction
See pdf10
Especially,Ksplice starts with a call to stop_machine_run(), which dumps a high-priority thread onto each processor, thus taking control of all processors in the system.
It then examines all threads in the system to ensure that none of them are running in the functions to be replaced; if so, trampoline jumps are patched into the beginning of each replaced function (they “bounce” the call to the old code into the replacement code) and life continues.
Otherwise ksplice will back off and try again later.
Limitations
- Only Code. data structures are out of scope.
solution:- personal preparing,
ksplice_apply(void (*func)())
While Ksplice is applying the changes - and while the rest of the system is still stopped - the given func will be called. It can then go rooting through the kernel’s data structures, changing things as needed.
- personal preparing,
- Retry-based approach to ensring that no threads are running in the patched functions. It mightly always waits.
Additional Talking
It’s simple to satisfy the demand of structure field removing. However, things come to complicated when we need to put addition of a new filed. “Shadow” mechanism is applied to solve this. It allocates a separate structure to hold the new filed. Bring out some challenges as well:
- the original patch must be changed in a number of places.
- Code which allocates the affected structure must be modified to allocate the shadow as well, and code which frees the structure must be changed in similar ways.
- Any reference to the new field(s) must, instead, look up the shadow structure and use that version of the field. All told, it looks like a tiresome procedure which has a significant chance of introducing new bugs.
- There is also the potential for performance issues caused by the linear linked list search performed to find the shadow structures.
KReplace
Based on jprobe but only function-orientated and even harder to use than Ksplice.
see 11 12
KGraft
Introduction
KGraft is the work of Jiří Kosina and Jiří Slaby, both working at SUSE. The approach they have taken is simpler than Ksplice, and lacks some of the capabilities that Ksplice offers (adding shadow members to structures, for example). On the other hand, the basic kGraft code is only a 600-line patch, and the process of applying a patch is quite a bit more lightweight, with less impact on the system.
KGraft works by replacing entire functions in the kernel. Using the tools supplied with the patch set, a developer can turn a patch into a list of changed functions; the new versions of those functions are then compiled into a separate kernel module. When that module is loaded into the kernel, kGraft takes care of the task of replacing the existing, buggy functions with the new, fixed versions.
Hazards
-
Cheif hazard is this: what happens if a process is running in kernel during the patch is applying?
Solution: Marker inthread_info
to separate processes into “old universe” and “new universe”. -
What about processes that don’t make this transition in a reasonable period of time? For example, a process stuck waitting for I/O on a network socket.
Solution: a flag under /proc allowing the system administrator to identify processes that are gumming up the works. -
What about kernel threads, which have no user space to return to?
Solution: Most thread callkthread_should_stop()
to exit, the left inserted withkgr_task_safe()
.(old->new universe) -
Interrupts block globally?
Solution: kGraft can block interrupts on the local CPU while it’s making its changes, but cannot work globally.
kGraft adds a per-CPU array to track whether each processor has run in process (non-interrupt) context.
That flag is initially set to false, andschedule_on_each_cpu()
is called to run a kGraft function, in process context, on each processor. That function, which cannot run until any pending interrupts on a given CPU have been serviced, will set the per-CPU flag. The function-replacement stub, meanwhile, will force interrupt code to run in the old universe on any CPU that has not yet set its per-CPU “new universe” flag. -
Inner-data
do a “flip and forget” switch between functions that expect different
format of in-memory data without performing a non-trivial all-memory
lookup to find structures in question and perfoming corresponding
transformations.13
Two Steps:
(1). redirect to a temporary band-aid function that can handle both
semantics of the data (presumably in highly sub-optimal way)
(2). patching in (1) succeeds completely (kGraft claims victory), start a
new round of patching with redirect to the final function which expects only the new semantics
kPatch
Like kGraft, kpatch replaces entire functions within a running kernel. A kernel patch is processed to determine which functions it changes; the kpatch tools (not included with the patch, but available in this repository) then use that information to create a loadable kernel module containing the new versions of the changed functions.
A call to the new kpatch_register()
function within the core kpatch code will use the ftrace function tracing mechanism to intercept calls to the old functions, redirecting control to the new versions instead.
It starts by calling stop_machine() to bring all other CPUs in the system to a halt.
Then, kpatch examines the stack of every process running in kernel mode to ensure that none are running in the affected function(s); should one of the patched functions be active, the patch-application process will fail.
If things are OK, instead, kpatch patches out the old functions completely (or, more precisely, it leaves an ftrace handler in place that routes around the old function).
There is no tracking of whether processes are in the “old” or “new” universe; instead, everybody is forced to the new universe immediately if it is possible.(stop_machine()
is a massive sledgehammer14 and should be avoid using it.)
FATAL ISSUE: If kernel code is RUNNING inside one of the target functions, kpatch will simply FAIL.
And Ingo15 suggested that, how about achieving an even ‘cleaner’ state for all tasks in the system: to freeze all tasks, as the suspend and hibernation code (and kexec) does, via freeze_processes()?
Frederic Weisbecker15 suggested that the kernel thread parking mechanism.16
Inner-data: The plan for the near future is to add a callback that can be packaged with a live patch; its job would be to search out and convert all affected data structures while the system is stopped and the patch is being applied. This approach has the potential to work without the need for maintaining the ability to cope with older data structures, but only if all of the affected structures can be located at patching time — a tall order, in many cases. And they claim that only few patches make changes to kernel data structures.17
Livepatch
The code merged for 4.0 is a common core that is able to support patching with both kpatch and kGraft. It provides an API that allows patch-containing modules to be inserted into the kernel; it also allows the listing and removal of patches if need be. This API performs the low-level redirection needed to replace patched functions.
But it’s missing an important component, called the “consistency model”, that ensures the safety of switching between versions of a function in a running kernel.
Unified Consistency Model18
This approach retains the two-universe model from kGraft, but it uses the stack-trace checking from kpatch to accelerate the task of switching processes to the new code.
In theory, this technique increases the chances of successfully applying patches while doing away with kpatch’s disruptive stop_machine() call and much of kGraft’s higher code complexity.
When patching, tasks are carefully transitioned from the old universe to the new universe. A task can only be switched to the new universe if it’s not using a function that is to be
patched or unpatched. After all tasks have moved to the new universe, the
patching process is complete.
How it transitions various tasks to the new universe:
- The stacks of all sleeping tasks are checked. Each task that is not sleeping
on a to-be-patched function is switched. - Other user tasks are handled by do_notify_resume() (see patch 9/9). If a
task is I/O bound, it switches universes when returning from a system call.
If it’s CPU bound, it switches when returning from an interrupt. If it’s
sleeping on a patched function, the user can send SIGSTOP and SIGCONT to
force it to switch upon return from the signal handler. - Idle “swapper” tasks which are sleeping on a to-be-patched function can be
switched from within the outer idle loop. - An interrupt handler will inherit the universe of the task it interrupts.
- kthreads which are sleeping on to-be-patched functions are not yet handled
(more on this below).
advantages vs kpatch:
- no stop machine latency
- higher patch success rate (can patch in-use functions)
- patching failures are more predictable (primary failure mode is attempting to
patch a kthread which is sleeping forever on a patched function, more on this
below)
advantages vs kGraft:
- less code complexity (don’t have to hack up the code of all the different
kthreads) - less impact to processes (don’t have to signal all sleeping tasks)
disadvantages vs kpatch:
- no system-wide switch point (not really a functional limitation, just forces
the patch author to be more careful. but that’s probably a good thing anyway)
My biggest concerns and questions related to this patch set are:
-
To safely examine the task stacks, the transition code locks each task’s rq
struct, which requires using the scheduler’s internal rq locking functions.
It seems to work well, but I’m not sure if there’s a cleaner way to safely
do stack checking without stop_machine(). -
As mentioned above, kthreads which are always sleeping on a patched function
will never transition to the new universe. This is really a minor issue
(less than 1% of patches). It’s not necessarily something that needs to be
resolved with this patch set, but it would be good to have some discussion
about it regardless.To overcome this issue, I have 1/2 an idea: we could add some stack checking
code to the ftrace handler itself to transition the kthread to the new
universe after it re-enters the function it was originally sleeping on, if
the stack doesn’t already have have any other to-be-patched functions.
Combined with the klp_transition_work_fn()'s periodic stack checking of
sleeping tasks, that would handle most of the cases (except when trying to
patch the high-level thread_fn itself).But then how do you make the kthread wake up? As far as I can tell,
wake_up_process() doesn’t seem to work on a kthread (unless I messed up my
testing somehow). What does kGraft do in this case?
Objections
It comes down to the fact that getting a reliable stack trace out of a process
running in kernel space is not as easy as one might expect. So stack unwinding is not absolutly safe.
Trackback code stay out of sight until some distributro issues a live patch, at
which point things will go badly wrong. And it also is a risk that will always hard to avoid,
since the correct functioning of the kernel does not otherwise depend on perfectly
accurate stack traces.
Except Stack Traceback:
- Force every process in the system into a quiescent, non-kernel state before
applying a patch.19 (But kernel thread cannot be pushed out of kernel space.) - KGraft’s two universes model.(The downside with this approach is that the process of trapping every process in a safe place can take an unbounded period of time during which the system is in a weird intermediate state.)
Alternative Solution
CRIU-seamless
The developers working on CRIU (checkpoint-restore in user space) have had
seamless kernel upgrades20 in their list of use cases for a while, and
they evidently have it working for some workloads.
They just save the entire state of the system, boot into an entirely new kernel,
then restore the previous state on top of the new kernel.
For some use cases, it might make sense to checkpoint all of the user-space processes,
kexec() to the new kernel, then restore all of user space.
That would allow changing to a completely new kernel, but it would not be
immediate (or live). It also would reinitialize the hardware, which may not be desirable.
[AT&T/Lucent Unix RTR System]:https://lwn.net/Articles/735186/
[Nortel’s wireline switch]:https://lwn.net/Articles/802295/
a branch (to the new code) was inserted as the first instruction of the old function,
while keeping a copy of the old function around just in case you wanted to unpatch.
This so long, as there was no alteration to global variables etc. and the
changes were restricted to code within the old function.
If you needed to fully replace a module binary, you’d that option too, but
depending on the type of changes, you might have needed a restart to complete the operation.
Later work
There were competing solutions, so a meeting was held at the 2014 LPC in Düsseldorf to
discuss the matter.
Each solution was presented and the developers came up with a plan to try to merge one unified scheme.
It would start with a minimal base on top of Ftrace, with a simple API.
Live patches could be registered with a list of functions to be replaced,
and it only supported a limited set of patch types that could be applied.
That was merged into the mainline in February 2015.
Since then, ideas have been cherry-picked from kpatch and kGraft to be added
to the kernel under the CONFIG_LIVEPATCH option
.
There is now a combined, hybrid consistency model that uses lazy migration by
default, but falls back to stack examination for long-sleeping processes and kthreads.
Originally, the feature was x86-only, but it has been added to s390 and PowerPC-64, with ARM64 in the works.
Update of stack examination
- Earlier efforts at getting reliable stack traces either used frame pointers,
which had a severe performance penalty, or DWARF debugging records, which
turned out to be unreliable and slow.
ORC is effectively a stripped-down version of DWARF that has nothing more than
is needed for reliable stack unwinding.
The ORC unwinder was merged into 4.14 and will also be used for oops and panic output.
So far, it is only available for x86_64, but is in progress for other architectures;
the main work is on objtool, Kosina said, as the ORC unwinder is straightforward to port. - ensure that assembly language pieces of the kernel will also produce a valid stack trace.
Objective
- Competent
- Compatible
- Secure
More to do 21
- Modifications of data structures.
- Optimizations like GCC
-fipr-ra
would change the ABI for function they known, while patches
cannot be handled. - Kprobes cannot switch an existing kprobe to a new function. There is a inability to
patch hand-written assembly code. Ftrace is not able to work with that code. - User space application might built with tools other than GCC. which expands the problem. Ftrace
is not able to deal with this situation. - It’s also harder to define a checkpoint where the consistency can be assured.
References
J. Poimboeuf and S. Jennings, “Introducing kpatch: dynamic kernel
patching,” Red Hat Enterprise Linux Blog, vol. 26, 2014. ↩︎SUSE, “Live Patching the Linux Kernel Using kGraft,” https://www.suse.com/documentation/sles-15/book_sle_admin/data/cha_kgraft.html, 2018. ↩︎
S. Kashyap, C. Min, B. Lee, T. Kim, and P. Emelyanov, “Instant
OS updates via userspace checkpoint-and-restart.” in USENIX Annual
Technical Conference, 2016. ↩︎Y. Chen, Y. Zhang, Z. Wang, L. Xia, C. Bao, and T. Wei, “Adaptive Android kernel live patching,” in Proceedings of the 26th USENIX Security Symposium, 2017. ↩︎
Ubuntu, “Canonical Livepatch Service,” https://www.ubuntu.com/livepatch, 2018. ↩︎
C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum, “Safe and automatic live update for operating systems,” in ACM SIGARCH Computer Architecture News, vol. 41, no. 1. ACM, 2013, pp. 279–292. ↩︎
https://www.drdobbs.com/a-conversation-with-glenn-reeves/184411097 ↩︎
http://web.mit.edu/ksplice/doc/ksplice.pdf ↩︎
https://lwn.net/Articles/308236/ ↩︎
https://lwn.net/Articles/308421/ ↩︎
https://lwn.net/Articles/597426/ ↩︎
https://lwn.net/Articles/597407/ ↩︎
https://lwn.net/Articles/500338/ ↩︎
https://lwn.net/Articles/597430/ ↩︎
https://lwn.net/Articles/632582/ ↩︎
https://lwn.net/Articles/634660/ ↩︎
http://criu.org/Usage_scenarios#Seamless_kernel_upgrade ↩︎
https://lwn.net/Articles/734765/ ↩︎