今天开始读Linux Kernel Development这本书。
看了这本书的目录,覆盖比较广泛,和LDD相比多了一些东西,毕竟LDD侧重于device driver,而LKD侧重于kernel本身。
前面两章是Introduction和Get Started,主要是linux的历史,操作系统概念,kernel开发环境,以及下载kernel code,编译linux kernel等内容,这些内容作为阅读性内容,这里不做记录。
直接从第三章开始——Process Management
这一章主要讲解进程(Process),并且引入相关的概念——线程(Thread),以及kernel的进程管理和它的生命周期。作为应用程序的服务者,kernel的进程管理对用户态程序来说尤其重要。
The Process
以前操作系统里说,进程(Process)是运行着的程序(Program)。其实不大准确,进程除了包含程序的代码之外,还包含了很多进程执行需要的resource,比如open files,pending signals, internal kernel data,process state,内存地址空间,一个或多个线程,以及包含全局数据的data section等等,这些资源程序是不具备的。
不过这些资源对于用户态的进程来说都是透明的,由kernel统一管理。
线程和进程类似,但是又有所不同,操作系统里说过,线程是kernel调度的基本单位,进程是资源管理的基本单位,也就说真正在执行代码以及被kernel调度的是线程,而不是进程。每一个线程包含了自己的program counter,process stack,以及processor registers,有意思的是,linux kernel并不区分进程和线程,线程就是特殊的进程,也对应同一个结构体。
进程提供了两种虚拟化的概念:CPU的虚拟化和内存的虚拟化。在进程执行的过程中可以使用全部的CPU资源,也可以使用全部的内存资源,就像没有其他人在使用CPU和内存一样,应用程序使用CPU和内存的时候,不需要考虑别的进程。实际上CPU和内存这些物力资源都是被很多进程共享的。
进程的生命周期开始于被创建的时候,linux中创建进程使用fork系统调用,子进程拷贝当前进程执行。在调用的地方会返回两次,一次是父进程,一次是子进程。当子进程被创建出来以后,会调用exec系统调用开始执行全新的program。
进程的生命周期结束于exit系统调用,这个系统调用会结束进程的执行并释放进程占用的所有资源。父进程可以通过wait系统调用等待子进程结束,如果没有人wait子进程,那么子进程就会变为僵尸进程。
Process Descriptor and the Task Structure
kernel使用struct task_struct来管理进程,而process descriptor实际上就是task_struct类型的指针而已,这个struct里包含了一个进程的所有信息,比如打开的文件,虚拟地址空间,pending signals,进程的状态以及其他的很多信息,因此结构体本身非常大,至少有1.7KB。
Allocating the Process Descriptor
在kernel 2.6中,struct task_struct结构体占用的内存是通过slab allocator来分配。在kernel 2.6以前,task_struct直接存储在进程stack的末尾,这样通过stack pointer就能直接访问到task_struct,不需要额外的寄存器来存储它,对于x86这种寄存器不多的架构比较友好。在kernel 2.6以后,task_struct通过动态分配的方式获取内存,位置就不在stack的末尾了,同样的,struct thread_info取代了task_struct,被放到了stack的末尾,thread_info结构体如下(kernel 4.15):
/*
* On IA-64, we want to keep the task structure and kernel stack together, so they can be
* mapped by a single TLB entry and so they can be addressed by the "current" pointer
* without having to do pointer masking.
*/
struct thread_info {
struct task_struct *task; /* XXX not really needed, except for dup_task_struct() */
__u32 flags; /* thread_info flags (see TIF_*) */
__u32 cpu; /* current CPU */
__u32 last_cpu; /* Last CPU thread ran on */
__u32 status; /* Thread synchronous flags */
mm_segment_t addr_limit; /* user-level address space limit */
int preempt_count; /* 0=premptable, <0=BUG; will also serve as bh-counter */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
__u64 utime;
__u64 stime;
__u64 gtime;
__u64 hardirq_time;
__u64 softirq_time;
__u64 idle_time;
__u64 ac_stamp;
__u64 ac_leave;
__u64 ac_stime;
__u64 ac_utime;
#endif
};
存储thread_info的示意图:
如果stack向下(低地址)增长,thread_info存储在低地址,如果向上(高地址)增长,thread_info就存储在高地址。上面这个图有个typo,低地址存储的应该是struct thread_info结构体,而不是struct thread_struct,这两个不一样。
Storing the Process Descriptor
kernel中使用pid_t来标记一个process,这个pid_t就是进程的PID,一般用int。为了和以前的兼容,kernel PID最大到32768,不过可以通过/proc/sys/kernel/pid_max来修改。
kernel中操作一个process,一般是通过process的task_struct结构体来进行,因此如何获取某个进程的task_struct就很重要,如果要操作当前的process,直接使用current宏即可,这个宏是架构相关的,考虑到是通过stack上的thread_info来实现,依赖于架构也就容易理解了。(有些结构直接用寄存器存储task_struct,但是像x86这种,是通过访问stack上的thread_info里的task_struct来实现),简单看一下x86上的实现:
首先是current的定义,是在include/asm-generic/current.h:
#include <linux/thread_info.h>
#define get_current() (current_thread_info()->task)
#define current get_current()
current就是current_thread_info()->task,我们来看current_thread_info():
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
* definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
* including <asm/current.h> can cause a circular dependency on some platforms.
*/
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif
4.15的kernel上定义了CONFIG_THREAD_INFO_IN_TASK,所以current_thread_info()就是current转换的thread_info指针,接着看这里面的current:
DECLARE_PER_CPU(struct task_struct *, current_task);
static __always_inline struct task_struct *get_current(void)
{
return this_cpu_read_stable(current_task);
}
#define current get_current()
current宏是get_current()函数,这个函数又调用了this_cpu_read_stable:
/*
* this_cpu_read() makes gcc load the percpu variable every time it is
* accessed while this_cpu_read_stable() allows the value to be cached.
* this_cpu_read_stable() is more efficient and can be used if its value
* is guaranteed to be valid across cpus. The current users include
* get_current() and get_thread_info() both of which are actually
* per-thread variables implemented as per-cpu variables and thus
* stable for the duration of the respective task.
*/
#define this_cpu_read_stable(var) percpu_stable_op("mov", var)
this_cpu_read_stable也是一个宏,直接调用percpu_stable_op:
#define percpu_stable_op(op, var) \
({ \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
case 1: \
asm(op "b "__percpu_arg(P1)",%0" \
: "=q" (pfo_ret__) \
: "p" (&(var))); \
break; \
case 2: \
asm(op "w "__percpu_arg(P1)",%0" \
: "=r" (pfo_ret__) \
: "p" (&(var))); \
break; \
case 4: \
asm(op "l "__percpu_arg(P1)",%0" \
: "=r" (pfo_ret__) \
: "p" (&(var))); \
break; \
case 8: \
asm(op "q "__percpu_arg(P1)",%0" \
: "=r" (pfo_ret__) \
: "p" (&(var))); \
break; \
default: __bad_percpu_size(); \
} \
pfo_ret__; \
})
通过percpu_stable_op这个宏,可以看到是通过汇编做了实现,current_task是一个指针,在32位系统上匹配case 4,在64位系统上匹配case 8,没看出是怎么读到的thread_info。
以上实现都是基于kernel 4.15的code,看上去和2.6的实现有所不同,而且4.15中CONFIG_THREAD_INFO_IN_TASK=y是有的,也就意味着thread_info存放在task_struct中了。
Process State
task_struct中的state用来表明process当前的状态,有五个:
TASK_RUNNING
当前的process可以运行或者正在运行。如果process在run queue里面,就是等待运行。
TASK_INTERRUPTIBLE
当前的process正在sleep。正在等待某个事件,可以被signal唤醒,process不在run queue里。
TASK_UNINTERRUPTIBLE
当前的process正在sleep。正在等待某个事件,但是只能在事件发生时被唤醒,signal不能唤醒它,process不在run queue里。
__TASK_TRACED
当前的process被别的进程trace,比如ptrace,或者gdb等。
__TASK_STOPPED
process的执行被停止。发生在process收到SIGSTOP,SIGTSTP,SIGTTIN,SIGTTOU这些信号时,或者process在被debug的时候收到任何信号。
Manipulating the Current Process State
kernel经常需要修改process的状态,使用:
#define set_current_state(state_value) \
smp_store_mb(current->state, (state_value))
注意,在kernel 4.15中set_task_state已经没有了。
Process Context
process context就是进程上下文,process最主要的活儿就是在user space address空间内,执行从program load进来的指令。当process执行了系统调用,或者产生了异常,process就会陷入内核,此时kernel运行在process context,代表原来的用户态process执行,此时current变得有效。如果系统调用完成或者异常处理完毕,就会从kernel space退出,恢复process在user space的运行,除非此时有更高优先级的process等待运行。
从用户态陷入内核态,只有这两个接口:系统调用,异常。
The Process Family Tree
系统中有个层次分明的process树状结构,所有的进程都是init进程的子孙,在系统启动完成时,init进程开始执行,并通过initscripts把其他的进程创建并启动。
所有的进程(除了init进程)都有一个parent,有0或者多个children。parent相同的process被称为siblings,这些层次关系都存储在task_struct里,通过parent和children指针来记录和索引,通过这些指针可以获得对应的parent或者children process:
//访问parent
struct task_struct *my_parent = current->parent;
//遍历所有的children
struct task_struct *task; struct list_head *list;
list_for_each(list, ¤t->children) {
/* task now points to one of current’s children */
task = list_entry(list, struct task_struct, sibling);
}
作为系统中第一个process,init process的task_struct是静态创建的:
/*
* Set up the first task table, touch at your own risk!. Base=0,
* limit=0x1fffff (=2MB)
*/
struct task_struct init_task
#ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK
__init_task_data
#endif
= {
#ifdef CONFIG_THREAD_INFO_IN_TASK
.thread_info = INIT_THREAD_INFO(init_task),
.stack_refcount = ATOMIC_INIT(1),
#endif
.state = 0,
.stack = init_stack,
.usage = ATOMIC_INIT(2),
.flags = PF_KTHREAD,
.prio = MAX_PRIO - 20,
.static_prio = MAX_PRIO - 20,
.normal_prio = MAX_PRIO - 20,
.policy = SCHED_NORMAL,
.cpus_allowed = CPU_MASK_ALL,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm = &init_mm,
.restart_block = {
.fn = do_no_restart_syscall,
},
.se = {
.group_node = LIST_HEAD_INIT(init_task.se.group_node),
},
.rt = {
.run_list = LIST_HEAD_INIT(init_task.rt.run_list),
.time_slice = RR_TIMESLICE,
},
.tasks = LIST_HEAD_INIT(init_task.tasks),
#ifdef CONFIG_SMP
.pushable_tasks = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
.real_parent = &init_task,
.parent = &init_task,
.children = LIST_HEAD_INIT(init_task.children),
.sibling = LIST_HEAD_INIT(init_task.sibling),
.group_leader = &init_task,
RCU_POINTER_INITIALIZER(real_cred, &init_cred),
RCU_POINTER_INITIALIZER(cred, &init_cred),
.comm = INIT_TASK_COMM,
.thread = INIT_THREAD,
.fs = &init_fs,
.files = &init_files,
.signal = &init_signals,
.sighand = &init_sighand,
.nsproxy = &init_nsproxy,
.pending = {
.list = LIST_HEAD_INIT(init_task.pending.list),
.signal = {{0}}
},
.blocked = {{0}},
.alloc_lock = __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
.journal_info = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
.timer_slack_ns = 50000, /* 50 usec default slack */
.pids = {
[PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID),
[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),
[PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID),
},
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
.thread_node = LIST_HEAD_INIT(init_signals.thread_head),
#ifdef CONFIG_AUDITSYSCALL
.loginuid = INVALID_UID,
.sessionid = (unsigned int)-1,
#endif
#ifdef CONFIG_PERF_EVENTS
.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),
.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
#endif
#ifdef CONFIG_PREEMPT_RCU
.rcu_read_lock_nesting = 0,
.rcu_read_unlock_special.s = 0,
.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
.rcu_blocked_node = NULL,
#endif
#ifdef CONFIG_TASKS_RCU
.rcu_tasks_holdout = false,
.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
.rcu_tasks_idle_cpu = -1,
#endif
#ifdef CONFIG_CPUSETS
.mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq),
#endif
#ifdef CONFIG_RT_MUTEXES
.pi_waiters = RB_ROOT_CACHED,
.pi_top_task = NULL,
#endif
INIT_PREV_CPUTIME(init_task)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
.vtime.seqcount = SEQCNT_ZERO(init_task.vtime_seqcount),
.vtime.starttime = 0,
.vtime.state = VTIME_SYS,
#endif
#ifdef CONFIG_NUMA_BALANCING
.numa_preferred_nid = -1,
.numa_group = NULL,
.numa_faults = NULL,
#endif
#ifdef CONFIG_KASAN
.kasan_depth = 1,
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
.softirqs_enabled = 1,
#endif
#ifdef CONFIG_LOCKDEP
.lockdep_recursion = 0,
#endif
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
.ret_stack = NULL,
#endif
#if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPT)
.trace_recursion = 0,
#endif
#ifdef CONFIG_LIVEPATCH
.patch_state = KLP_UNDEFINED,
#endif
#ifdef CONFIG_SECURITY
.security = NULL,
#endif
};
EXPORT_SYMBOL(init_task);
/*
* Initial thread structure. Alignment of this is handled by a special
* linker map entry.
*/
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info init_thread_info __init_thread_info = INIT_THREAD_INFO(init_task);
#endif
通过判断task_struct是否和init_task相等,就可以知道是不是已经找到了最开始的process(init process)。
Process Creation
linux系统中创建进程,主要使用了两个函数:fork,exec。fork会把当前的进程copy到子进程中(可以指定copy哪些部分),然后通过exec让子进程开始执行新的program。
Copy-on-Write
fork创建子进程时,并不是真的把parent的内容copy到子进程中,而是使用了copy-on-write技术,也就是写时复制,如果子进程只读,就和parent share同一份,如果要写,就为子进程创建新的存储区域来写。虽然有copy-on-write,但是当fork时,至少要为子进程分配新的page table,以及一个新的task_struct。
Forking
用户态使用的fork,对应的系统调用是clone(),包含了一些flag,用来告诉kernel父子进程需要share哪些resource。除了fork函数,用户态使用的vfork(),__clone()等,都是使用clone()系统调用来实现。在kernel中,clone()又会调用do_fork()来实现,新进程的创建都是在do_fork()中来完成的,我们接下来看do_fork().
注意,在kernel 4.15中,fork的实现方式和2.6有一些区别,比如加入了HAVE_COPY_THREAD_TLS,这里仍然按照老的code来看。patch:http://lkml.iu.edu/hypermail/linux/kernel/1504.2/03324.html。
do_fork()定义在kernel/fork.c,do_fork调用了_do_fork,我们直接看_do_fork:
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
unsigned long tls)
{
struct task_struct *p;
int trace = 0;
long nr;
...
p = copy_process(clone_flags, stack_start, stack_size,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
if (!IS_ERR(p)) {
struct completion vfork;
struct pid *pid;
trace_sched_process_fork(current, p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr);
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
}
wake_up_new_task(p);
/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);
if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
}
put_pid(pid);
} else {
nr = PTR_ERR(p);
}
return nr;
}
_do_fork里通过copy_process创建了新的进程,然后通过wake_up_new_task让新进程开始执行。我们看copy_process:
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
*
* It copies the registers, and all the appropriate
* parts of the process environment (as per the clone
* flags). The actual kick-off is left to the caller.
*/
static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
int retval;
struct task_struct *p;
//... 一些clone flag的检查
retval = -ENOMEM;
p = dup_task_struct(current, node);
if (!p)
goto fork_out;
//...
}
在copy_process里:
1. 调用了dup_task_struct,为新的进程创建kernel stack,thread info structure,以及task_struct结构体。里面的 信息都是来自与parent,此时子进程里的info和parent都是完全一样的。
2. 检查子进程没有超过给当前用户分配的资源限制。
retval = -EAGAIN;
if (atomic_read(&p->real_cred->user->processes) >=
task_rlimit(p, RLIMIT_NPROC)) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_free;
}
current->flags &= ~PF_NPROC_EXCEEDED;
retval = copy_creds(p, clone_flags);
if (retval < 0)
goto bad_fork_free;
3. 新进程中的一些变量被清掉,不过大部分都是统计用的信息,task_struct中的大部分都没有改变。
4. 新进程的状态在sched_fork中被设置为TASK_NEW,防止被调度。
5. 新进程的一些flag被设置,比如:
p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
p->flags |= PF_FORKNOEXEC;
6. 为新进程分配pid
if (pid != &init_struct_pid) {
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
}
7. 为新进程copy需要的一切,比如share open files,fs,signal handler, process address space等等:
retval = copy_semundo(clone_flags, p);
retval = copy_files(clone_flags, p);
retval = copy_fs(clone_flags, p);
retval = copy_sighand(clone_flags, p);
retval = copy_signal(clone_flags, p);
retval = copy_mm(clone_flags, p);
retval = copy_namespaces(clone_flags, p);
retval = copy_io(clone_flags, p);
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
8. 最后copy_process返回新进程的指针。
回到_do_fork,在新进程创建成功以后,新进程就会被唤醒(wake_up_new_task),开始执行:
/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
*/
void wake_up_new_task(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
/*
* Fork balancing, do it here and not earlier because:
* - cpus_allowed can change in the fork path
* - any previously selected CPU might disappear through hotplug
*
* Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
* as we're not fully set-up yet.
*/
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
post_init_entity_util_avg(&p->se);
activate_task(rq, p, ENQUEUE_NOCLOCK);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p);
check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
if (p->sched_class->task_woken) {
/*
* Nothing relies on rq->lock after this, so its fine to
* drop it.
*/
rq_unpin_lock(rq, &rf);
p->sched_class->task_woken(rq, p);
rq_repin_lock(rq, &rf);
}
#endif
task_rq_unlock(rq, p, &rf);
}
activate_task就会把新的进程放到run queue里去,准备调度。
vfork和fork不同,vfork不会copy page table,并且子进程会把父进程block住,直到子进程完成,父进程才会被继续执行。
The Linux Implementation of Threads
linux kernel中没有thread这种概念,thread就是process,在数据结构和调度上没有任何区别,唯一的不同在于,thread是share了很多资源的process。这种实现方式非常简单优雅,并且逻辑简单。
Creating Threads
thread的创建和普通的process没有大的区别,都是使用clone()来实现,只不过传递的clone flag会有所区别。比如创建thread使用的clone flag可能是这样的:
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
这样,父子两个share同样的vm,fs,files,以及signal handler。常用的clone flag有:
CLONE_FILES
Parent and child share open files.
CLONE_FS
Parent and child share filesystem information.
CLONE_IDLETASK
Set PID to zero (used only by the idle tasks).
CLONE_NEWNS
Create a new namespace for the child.
CLONE_PARENT
Child is to have same parent as its parent.
CLONE_PTRACE
Continue tracing child.
CLONE_SETTID
Write the TID back to user-space.
CLONE_SETTLS
Create a new TLS for the child.
CLONE_SIGHAND
Parent and child share signal handlers and blocked signals.
CLONE_SYSVSEM
Parent and child share System V SEM_UNDO semantics.
CLONE_THREAD
Parent and child are in the same thread group.
CLONE_VFORK
vfork() was used and the parent will sleep until the child
wakes it.
CLONE_UNTRACED
Do not let the tracing process force CLONE_PTRACE on the
child.
CLONE_STOP
Start process in the TASK_STOPPED state.
CLONE_SETTLS
Create a new TLS (thread-local storage) for the child.
CLONE_CHILD_CLEARTID
Clear the TID in the child.
CLONE_CHILD_SETTID
Set the TID in the child.
CLONE_PARENT_SETTID
Set the TID in the parent.
CLONE_VM
Parent and child share address space.
Kernel Threads
kernel thread就是process,和user 的process的区别是kernel thread没有地址空间,也就说kernel thread的task_struct里mm为NULL,user space的process是保存的当前进程的vma。因为没有mm,就限制了kernel thread完全运行在内核空间,和user space没有任何交互,不过和别的process一样都是可以正常调度和抢占的。
linux kernel中有很多的kernel thread,比如flush task和ksoftirq,通过命令ps -ef可以看到kernel当前正在运行的kernel thread。每一个kernel thread都是由其他的kernel thread创建出来,并且也只能由其他的kernel thread创建出来。kernel在创建kernel thread的时候,都是从kthreadd这个thread fork出来的。
创建kernel thread接口定义在<linux/kthread.h>,有两个:kthread_create和kthread_run。二者的区别在于kthread_run创建出来的thread自动运行,前者需要手动调用一次wake_up_process才会运行thread,看一下这两个函数的原型:
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data,
int node,
const char namefmt[], ...);
/**
* kthread_create - create a kthread on the current node
* @threadfn: the function to run in the thread
* @data: data pointer for @threadfn()
* @namefmt: printf-style format string for the thread name
* @arg...: arguments for @namefmt.
*
* This macro will create a kthread on the current node, leaving it in
* the stopped state. This is just a helper for kthread_create_on_node();
* see the documentation there for more details.
*/
#define kthread_create(threadfn, data, namefmt, arg...) \
kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
kthread_create是一个宏,直接调用kthread_create_on_node,threadfn是thread执行的入口函数,data是threadfn的参数data,node是CPU的node,不用关心,namefmt是thread的名字。再看一下kthread_run:
/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
* @data: data ptr for @threadfn.
* @namefmt: printf-style name for the thread.
*
* Description: Convenient wrapper for kthread_create() followed by
* wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).
*/
#define kthread_run(threadfn, data, namefmt, ...) \
({ \
struct task_struct *__k \
= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
if (!IS_ERR(__k)) \
wake_up_process(__k); \
__k; \
}
可以看到kthread_run比kthread_create多做了一步wake_up_process而已,这个wake_up_process也并不是让kthread马上运行,而是把它加到task run queue里去,等待调度。
kernel thread如果需要退出,就需要调用do_exit,或者kthread_stop。
Process Termination
当process结束的时候,kernel要把process占用的资源释放掉,同时通知它的parent process。通常来说,process都是主动结束自己,比如调用exit系统调用(比如用户态程序,C编译器会在main函数之后调用一次exit系统调用),或者从某些routine中返回(比如kernel thread从threadfn中返回),但是也有可能是process发生了异常所以退出,比如收到了退出的信号(SIGKILL)或者产生了无法处理的异常(Segement Fault),无论是哪一种退出方式,最终都会调用do_exit来清理process。do_exit定义在kernel/exit.c中,原型如下:
void __noreturn do_exit(long code)
{
...
}
do_exit主要干了这些事情:
1. 通过exit_signal把task的state设置为PF_EXITING。
2. 书里说这里会调用del_timer_sync,把process的timer移除,kernel 4.15中未见这段code。
3. 调用acct_update_integrals把一些统计数据写出去。
4. 调用exit_mm把process的mm释放掉——如果没有人share的话。
5. 调用exit_sem,如果之前process在等待IPC semaphore的队列,这里就会把它移除。
6. 调用exit_files和exit_fs,把files和fs的引用计数减掉,如果变为0,这些资源就会被释放。
7. 记录process的exit_code。(tsk->exit_code = code;),以后parent可以从exit_code知道子进程的退出原因。
8. 调用exit_notify,通知parent进程,并且为当前process的children寻找合适的parent(比如kthread group里的其他thread,或者init process),然后设置task struct里的exit_state为EXIT_ZOMBIE。
9. 最后在do_task_dead中调用schedule,并且不再返回。(因为当前的process已经不存在了)
在do_exit走完以后,这个process占用的memory就只有它的kernel stack了,也就是thread info和task_struct这两个结构体,之所以他们还存在,就是为了给parent传递一些信息,当parent已经获取到了信息,或者说对这些信息不感兴趣,那么结构体占用的memory也就被彻底的释放了。
Removing the Process Descriptor
前面已经说了,process的do_exit调用完以后,还留有两个结构体给parent获取信息用,那么parent如何获取到信息呢?答案是wait4系统调用。parent在创建了child process以后,需要调用wait4来等待child process的状态,调用这个函数时,parent会被block住,直到它的child process退出,此时parent会获取到child process的PID以及exit code。当wait4之后,child process的两个结构体就会被彻底的释放,这个是通过release_task来实现的。release_task做了如下事情:
1. 调用__exit_signal,其中会调用__unhash_process,其中又会调用detach_pid,就会把process从pidhash里移除,并从task list中移除。
2.__exit_signal也会把一些其他的资源释放掉,完成一些统计信息。
3. 如果退出的process是thread group里的non-leader member,而且这个group里leader的状态是zombie,此时会通知leader的parent。
4. release_task最后会调用delayed_put_task_struct,其中会调用put_task_struct,最后就会把task struct占用的memory释放掉。
The Dilemma of the Parentless Task
如果某一个process在它的children process退出之前退出,就要为它的children process重新寻找合适的parent,否则这些children会因为没有parent wait而导致无法退出(zombie)。在do_exit的过程中我们提到过,在exit_notify的时候要为children寻找合适的parent,现在看一下这个逻辑:do_exit -> exit_notify -> forget_original_parent -> find_new_reaper.
/*
* When we die, we re-parent all our children, and try to:
* 1. give them to another thread in our thread group, if such a member exists
* 2. give it to the first ancestor process which prctl'd itself as a
* child_subreaper for its children (like a service manager)
* 3. give it to the init process (PID 1) in our pid namespace
*/
static struct task_struct *find_new_reaper(struct task_struct *father,
struct task_struct *child_reaper)
{
struct task_struct *thread, *reaper;
thread = find_alive_thread(father);
if (thread)
return thread;
if (father->signal->has_child_subreaper) {
unsigned int ns_level = task_pid(father)->level;
/*
* Find the first ->is_child_subreaper ancestor in our pid_ns.
* We can't check reaper != child_reaper to ensure we do not
* cross the namespaces, the exiting parent could be injected
* by setns() + fork().
* We check pid->level, this is slightly more efficient than
* task_active_pid_ns(reaper) != task_active_pid_ns(father).
*/
for (reaper = father->real_parent;
task_pid(reaper)->level == ns_level;
reaper = reaper->real_parent) {
if (reaper == &init_task)
break;
if (!reaper->signal->is_child_subreaper)
continue;
thread = find_alive_thread(reaper);
if (thread)
return thread;
}
}
return child_reaper;
上面的这段code是为了寻找合适的parent,如果thread group里有满足条件的process,就返回它,否则会返回init process。在这个函数返回以后,就找到了合适的parent,然后把所有的children的parent设置为新找到的parent:
list_for_each_entry(p, &father->children, sibling) {
for_each_thread(p, t) {
t->real_parent = reaper;
BUG_ON((!t->ptrace) != (t->parent == father));
if (likely(!t->ptrace))
t->parent = t->real_parent;
if (t->pdeath_signal)
group_send_sig_info(t->pdeath_signal,
SEND_SIG_NOINFO, t);
}
/*
* If this is a threaded reparent there is no need to
* notify anyone anything has happened.
*/
if (!same_thread_group(reaper, father))
reparent_leader(father, p, dead);
}
kernel 2.6以后引入了ptrace,如果当前exit的process有ptraced,比如被gdb attach,在它exit的时候就会把gdb设为它的children的parent。