The current
macro and task stack setup in the Linux kernel
Getting the task handle in kernel context
In the Linux kernel, a running process or a kernel thread is usually represented as a pointer to task_struct
structure, which holds almost all information related to the process or kernel thread. When an user-space application runs in kernel context, task_struct
has to be fetched in order to access process-related data. Linux kernel has an ingenuous solution for the problem, which enables running code to acquire a pointer to task_struct
in lightening speed: a C macro named current
. Let’s take Linux/ARMv7-a for example, related macros and functions are defined as:
Wherever kernel code needs a pointer to current task, just current
will do. On a Linux/ARMv7-a system, the lower 13 bits of current stack pointer is cleared, and the result is treated as a pointer to thread_info
structure; there is a task
member in thread_info
which points to the task_struct
for current process or kernel thread task handle. Let’s disassemble a system call, close
, to find out how the macros and inline functions are translated into machine code:
We can justify from the picture above, that the current stack pointer sp
, is indeed related to the acquisition of current task handle. An interesting fact is that a file descriptor is treated as an unsigned integer in the kernel, this works because when a negative integer is passed into the kernel, it becomes an unsigned integer so large that it cannot be used to index into the process file table, thus an invalid file descriptor is detected, EBADF. Next, we have two questions to ask:
-
On a Linux/ARMv7-a system, when switching between user-space and kernel-space contexts, how the kernel decides what stack to use? And are there two distinct stacks used separately by user-space and kernel-space contexts ?
-
When creating a new user-space process or thread, how the kernel sets up the
thread_info
andtask_struct
pointers, so that by invokingcurrent
macro, kernel code can easily access thetask_struct
pointer?
The first question is easy to answer with a moment’s pondering. User-space and kernel-space contexts have to use two distinct stacks, because user-space application cannot access kernel memory directly, a different stack must be used when switching between contexts. Now with the last question, a simple application has been wrote which creates a sub-thread after running, to help us debug Linux kernel, thus enhancing our understanding of the Linux kernel.
Creating a new kernel stack for a new task
The debugging session was accomplished with QEMU, which loads a kernel zImage and runs it. After the user-space application’s invocation, Linux kernel stopped at a breakpoint added at the very beginning of clone
system call:
Note that clone
system call can be used to create a new process, so we double-checked that the clone_flags
dictates the kernel to create a new thread for user-space application. Wandering about in the kernel sources, we can be certain that the newly created thread’s stack assignment is at kernel/fork.c, line 871, then another breakpoint is added:
Regsters r6 and r8 hold task_struct
pointer and newly created kernel stack pointer separately: 0x9e63ae00 and 0x9df0a000. One more word about the kernel stack allocation: on Linux/ARMv7-a systems, kernel stack sizes are usually 8192 bytes, and 8192 bytes aligned. This is an interesting feature but more due to technical reasons. Recall that during task_struct
pointer acquisition, the lower 13 bits of stack pointer has to be cleared, thus kernel stack is better off if 8192 bytes aligned. We can infer from assembly instruction, str r8, [r6, #4]
, that the offset of stack member in task_struct
structure is 4 bytes, which can be justified from kernel source code:
/* include/linux/sched.h */
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info;
#endif
/* -1 unrunnable, 0 runnable, >0 stopped: */
volatile long state;
/*
* This begins the randomizable portion of task_struct. Only
* scheduling-critical items should be added above here.
*/
randomized_struct_fields_start
void *stack; /* CONFIG_THREAD_INFO_IN_TASK is not defined, current offset is 4 bytes*/
...
From kernel source code we can infer that the lower end of newly created stack is actually treated as thread_info
structure, and task_struct
pointer will have to be stored in thread_info
; Let’s find out where the store happens:
Debugging results show that new task_struct
pointer 0x9e63ae00 is stored at beginning of new stack, offset by 12 bytes. Here is corresponding kernel source, an inline function defined in include/linux/sched/task_stack.h:
Now we know how Linux kernel creates a new kernel stack for a new task, and the two structures refer to each other (from kernel source code, tsk->stack = stack
, and task_thread_info(p)->task = p
); more importantly, the new kernel stack is 8192 bytes aligned, so after current
macro expansion, the result is always the lower end of new kernel stack (which stores thread_info
structure).
Settting up contexts for new task
Now that the new kernel stack has been allocated, and new task_struct
bring into existence, but the new task cannot be run immediately. Some architecture specific configurations have to be carried out before new task is ready to run. We now focus our attention to a function named copy_thread(...)
, which always gets called whenever an application is forking a child process, creating a sub-thread, or when kernel is creating a kernel thread. The main purpose of copy_thread(...)
is to setup the entry function and the top of stack for newly created task, by writing structures representing ARMv7-a core registers (notably struct pt_regs
for user-space context, and struct cpu_context_save
for kernel-space context):
After careful calculation, the stack pointer for the new task is 0x9df0bfb0, when executing at the very entry of new task, which is in fact an assembly function defined in arch/arm/kernel/entry-common.S. Note that on Linux/ARMv7-a systems, kernel stacks are usually 8192 bytes, the lower end of stack stores thread_info
structure, and the stack grows down: the kernel stack is so small compared to user-space application, that as kernel developers, we should always keep this in mind. For the new kernel task, registers are written to struct cpu_context_save
structure, this is distinct from struct pt_regs
, which is used to store registers from user-space. Lastly, add a breakpoint at the first machine instruction of function ret_from_fork
, we can verify our calculation:
The beautiful assembly code above will take the newly created task to user-space, which as we’ve mentioned earlier, works happily as sub-thread of our test application. So far we’ve followed roughly the whole dancing of Linux kernel creating a new task, setting up the kernel stack, which enables the correct expansion of current
macro to fetch current task handle. However, how does Linux kernel store the kernel stack pointer when application is running in user-space? The answer is that for ARMv7-a SoC, there are many stack pointer registers(R13), banked according to CPU execution modes. When an application switches from user-space into kernel-space, registers from user-space are pushed onto kernel stack (accessed via struct pt_regs
structure), which always is near the top of stack; when switching back to user-space, the saved registers will be popped out from kernel stack, thus ensure the kernel stack are balanced.
Conclusion
- For a thread of an application, running in user-space and kernel-space requires two different stacks. The user-space stack can be determined by application (via
clone
system call), but the kernel stack is allocated and freed by Linux kernel. current
macro in Linux kernel requires special attention, the lower end of kernel stack storesthread_info
structure, which has a pointer totask_struct
handle (of cause for Linux/ARMv7-a systems).- The Entry of an application process/thread, and the entry of a kernel thread, are always
ret_from_fork
. Kernel stacks are small in size, and not all of them are available.