Process Management [LKD 03]

最新推荐文章于 2023-05-15 11:17:50 发布

scutth

最新推荐文章于 2023-05-15 11:17:50 发布

阅读量321

点赞数

分类专栏： LKD3 Linux 文章标签： LKD3

本文链接：https://blog.csdn.net/scutth/article/details/105934769

版权

Linux 同时被 2 个专栏收录

37 篇文章 0 订阅

订阅专栏

LKD3

8 篇文章 0 订阅

订阅专栏

今天开始读Linux Kernel Development这本书。

看了这本书的目录，覆盖比较广泛，和LDD相比多了一些东西，毕竟LDD侧重于device driver，而LKD侧重于kernel本身。

前面两章是Introduction和Get Started，主要是linux的历史，操作系统概念，kernel开发环境，以及下载kernel code，编译linux kernel等内容，这些内容作为阅读性内容，这里不做记录。

直接从第三章开始——Process Management

这一章主要讲解进程（Process），并且引入相关的概念——线程（Thread），以及kernel的进程管理和它的生命周期。作为应用程序的服务者，kernel的进程管理对用户态程序来说尤其重要。

The Process

以前操作系统里说，进程（Process）是运行着的程序（Program）。其实不大准确，进程除了包含程序的代码之外，还包含了很多进程执行需要的resource，比如open files，pending signals， internal kernel data，process state，内存地址空间，一个或多个线程，以及包含全局数据的data section等等，这些资源程序是不具备的。

不过这些资源对于用户态的进程来说都是透明的，由kernel统一管理。

线程和进程类似，但是又有所不同，操作系统里说过，线程是kernel调度的基本单位，进程是资源管理的基本单位，也就说真正在执行代码以及被kernel调度的是线程，而不是进程。每一个线程包含了自己的program counter，process stack，以及processor registers，有意思的是，linux kernel并不区分进程和线程，线程就是特殊的进程，也对应同一个结构体。

进程提供了两种虚拟化的概念：CPU的虚拟化和内存的虚拟化。在进程执行的过程中可以使用全部的CPU资源，也可以使用全部的内存资源，就像没有其他人在使用CPU和内存一样，应用程序使用CPU和内存的时候，不需要考虑别的进程。实际上CPU和内存这些物力资源都是被很多进程共享的。

进程的生命周期开始于被创建的时候，linux中创建进程使用fork系统调用，子进程拷贝当前进程执行。在调用的地方会返回两次，一次是父进程，一次是子进程。当子进程被创建出来以后，会调用exec系统调用开始执行全新的program。

进程的生命周期结束于exit系统调用，这个系统调用会结束进程的执行并释放进程占用的所有资源。父进程可以通过wait系统调用等待子进程结束，如果没有人wait子进程，那么子进程就会变为僵尸进程。

Process Descriptor and the Task Structure

kernel使用struct task_struct来管理进程，而process descriptor实际上就是task_struct类型的指针而已，这个struct里包含了一个进程的所有信息，比如打开的文件，虚拟地址空间，pending signals，进程的状态以及其他的很多信息，因此结构体本身非常大，至少有1.7KB。

Allocating the Process Descriptor

在kernel 2.6中，struct task_struct结构体占用的内存是通过slab allocator来分配。在kernel 2.6以前，task_struct直接存储在进程stack的末尾，这样通过stack pointer就能直接访问到task_struct，不需要额外的寄存器来存储它，对于x86这种寄存器不多的架构比较友好。在kernel 2.6以后，task_struct通过动态分配的方式获取内存，位置就不在stack的末尾了，同样的，struct thread_info取代了task_struct，被放到了stack的末尾，thread_info结构体如下（kernel 4.15）：

/*
 * On IA-64, we want to keep the task structure and kernel stack together, so they can be
 * mapped by a single TLB entry and so they can be addressed by the "current" pointer
 * without having to do pointer masking.
 */
struct thread_info {
	struct task_struct *task;	/* XXX not really needed, except for dup_task_struct() */
	__u32 flags;			/* thread_info flags (see TIF_*) */
	__u32 cpu;			/* current CPU */
	__u32 last_cpu;			/* Last CPU thread ran on */
	__u32 status;			/* Thread synchronous flags */
	mm_segment_t addr_limit;	/* user-level address space limit */
	int preempt_count;		/* 0=premptable, <0=BUG; will also serve as bh-counter */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
	__u64 utime;
	__u64 stime;
	__u64 gtime;
	__u64 hardirq_time;
	__u64 softirq_time;
	__u64 idle_time;
	__u64 ac_stamp;
	__u64 ac_leave;
	__u64 ac_stime;
	__u64 ac_utime;
#endif
};

存储thread_info的示意图：

如果stack向下（低地址）增长，thread_info存储在低地址，如果向上（高地址）增长，thread_info就存储在高地址。上面这个图有个typo，低地址存储的应该是struct thread_info结构体，而不是struct thread_struct，这两个不一样。

Storing the Process Descriptor

kernel中使用pid_t来标记一个process，这个pid_t就是进程的PID，一般用int。为了和以前的兼容，kernel PID最大到32768，不过可以通过/proc/sys/kernel/pid_max来修改。

kernel中操作一个process，一般是通过process的task_struct结构体来进行，因此如何获取某个进程的task_struct就很重要，如果要操作当前的process，直接使用current宏即可，这个宏是架构相关的，考虑到是通过stack上的thread_info来实现，依赖于架构也就容易理解了。（有些结构直接用寄存器存储task_struct，但是像x86这种，是通过访问stack上的thread_info里的task_struct来实现），简单看一下x86上的实现：

首先是current的定义，是在include/asm-generic/current.h:

#include <linux/thread_info.h>
#define get_current() (current_thread_info()->task)
#define current get_current()

current就是current_thread_info()->task，我们来看current_thread_info():

#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
 * For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
 * definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
 * including <asm/current.h> can cause a circular dependency on some platforms.
 */
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif

4.15的kernel上定义了CONFIG_THREAD_INFO_IN_TASK，所以current_thread_info()就是current转换的thread_info指针，接着看这里面的current：

DECLARE_PER_CPU(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void)
{
	return this_cpu_read_stable(current_task);
}

#define current get_current()

current宏是get_current()函数，这个函数又调用了this_cpu_read_stable：

/*
 * this_cpu_read() makes gcc load the percpu variable every time it is
 * accessed while this_cpu_read_stable() allows the value to be cached.
 * this_cpu_read_stable() is more efficient and can be used if its value
 * is guaranteed to be valid across cpus.  The current users include
 * get_current() and get_thread_info() both of which are actually
 * per-thread variables implemented as per-cpu variables and thus
 * stable for the duration of the respective task.
 */
#define this_cpu_read_stable(var)	percpu_stable_op("mov", var)

this_cpu_read_stable也是一个宏，直接调用percpu_stable_op：

#define percpu_stable_op(op, var)			\
({							\
	typeof(var) pfo_ret__;				\
	switch (sizeof(var)) {				\
	case 1:						\
		asm(op "b "__percpu_arg(P1)",%0"	\
		    : "=q" (pfo_ret__)			\
		    : "p" (&(var)));			\
		break;					\
	case 2:						\
		asm(op "w "__percpu_arg(P1)",%0"	\
		    : "=r" (pfo_ret__)			\
		    : "p" (&(var)));			\
		break;					\
	case 4:						\
		asm(op "l "__percpu_arg(P1)",%0"	\
		    : "=r" (pfo_ret__)			\
		    : "p" (&(var)));			\
		break;					\
	case 8:						\
		asm(op "q "__percpu_arg(P1)",%0"	\
		    : "=r" (pfo_ret__)			\
		    : "p" (&(var)));			\
		break;					\
	default: __bad_percpu_size();			\
	}						\
	pfo_ret__;					\
})

通过percpu_stable_op这个宏，可以看到是通过汇编做了实现，current_task是一个指针，在32位系统上匹配case 4，在64位系统上匹配case 8，没看出是怎么读到的thread_info。

以上实现都是基于kernel 4.15的code，看上去和2.6的实现有所不同，而且4.15中CONFIG_THREAD_INFO_IN_TASK=y是有的，也就意味着thread_info存放在task_struct中了。

Process State

task_struct中的state用来表明process当前的状态，有五个：

TASK_RUNNING

当前的process可以运行或者正在运行。如果process在run queue里面，就是等待运行。

TASK_INTERRUPTIBLE

当前的process正在sleep。正在等待某个事件，可以被signal唤醒，process不在run queue里。

TASK_UNINTERRUPTIBLE

当前的process正在sleep。正在等待某个事件，但是只能在事件发生时被唤醒，signal不能唤醒它，process不在run queue里。

__TASK_TRACED

当前的process被别的进程trace，比如ptrace，或者gdb等。

__TASK_STOPPED

process的执行被停止。发生在process收到SIGSTOP，SIGTSTP，SIGTTIN，SIGTTOU这些信号时，或者process在被debug的时候收到任何信号。

Manipulating the Current Process State

kernel经常需要修改process的状态，使用：

#define set_current_state(state_value)					\
	smp_store_mb(current->state, (state_value))

注意，在kernel 4.15中set_task_state已经没有了。

Process Context

process context就是进程上下文，process最主要的活儿就是在user space address空间内，执行从program load进来的指令。当process执行了系统调用，或者产生了异常，process就会陷入内核，此时kernel运行在process context，代表原来的用户态process执行，此时current变得有效。如果系统调用完成或者异常处理完毕，就会从kernel space退出，恢复process在user space的运行，除非此时有更高优先级的process等待运行。

从用户态陷入内核态，只有这两个接口：系统调用，异常。

The Process Family Tree

系统中有个层次分明的process树状结构，所有的进程都是init进程的子孙，在系统启动完成时，init进程开始执行，并通过initscripts把其他的进程创建并启动。

所有的进程（除了init进程）都有一个parent，有0或者多个children。parent相同的process被称为siblings，这些层次关系都存储在task_struct里，通过parent和children指针来记录和索引，通过这些指针可以获得对应的parent或者children process：

//访问parent
struct task_struct *my_parent = current->parent;

//遍历所有的children
struct task_struct *task; struct list_head *list;
list_for_each(list, &current->children) {
    /* task now points to one of current’s children */
    task = list_entry(list, struct task_struct, sibling); 
}

作为系统中第一个process，init process的task_struct是静态创建的：

/*
 * Set up the first task table, touch at your own risk!. Base=0,
 * limit=0x1fffff (=2MB)
 */
struct task_struct init_task
#ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK
	__init_task_data
#endif
= {
#ifdef CONFIG_THREAD_INFO_IN_TASK
	.thread_info	= INIT_THREAD_INFO(init_task),
	.stack_refcount	= ATOMIC_INIT(1),
#endif
	.state		= 0,
	.stack		= init_stack,
	.usage		= ATOMIC_INIT(2),
	.flags		= PF_KTHREAD,
	.prio		= MAX_PRIO - 20,
	.static_prio	= MAX_PRIO - 20,
	.normal_prio	= MAX_PRIO - 20,
	.policy		= SCHED_NORMAL,
	.cpus_allowed	= CPU_MASK_ALL,
	.nr_cpus_allowed= NR_CPUS,
	.mm		= NULL,
	.active_mm	= &init_mm,
	.restart_block	= {
		.fn = do_no_restart_syscall,
	},
	.se		= {
		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),
	},
	.rt		= {
		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),
		.time_slice	= RR_TIMESLICE,
	},
	.tasks		= LIST_HEAD_INIT(init_task.tasks),
#ifdef CONFIG_SMP
	.pushable_tasks	= PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
#endif
#ifdef CONFIG_CGROUP_SCHED
	.sched_task_group = &root_task_group,
#endif
	.ptraced	= LIST_HEAD_INIT(init_task.ptraced),
	.ptrace_entry	= LIST_HEAD_INIT(init_task.ptrace_entry),
	.real_parent	= &init_task,
	.parent		= &init_task,
	.children	= LIST_HEAD_INIT(init_task.children),
	.sibling	= LIST_HEAD_INIT(init_task.sibling),
	.group_leader	= &init_task,
	RCU_POINTER_INITIALIZER(real_cred, &init_cred),
	RCU_POINTER_INITIALIZER(cred, &init_cred),
	.comm		= INIT_TASK_COMM,
	.thread		= INIT_THREAD,
	.fs		= &init_fs,
	.files		= &init_files,
	.signal		= &init_signals,
	.sighand	= &init_sighand,
	.nsproxy	= &init_nsproxy,
	.pending	= {
		.list = LIST_HEAD_INIT(init_task.pending.list),
		.signal = {{0}}
	},
	.blocked	= {{0}},
	.alloc_lock	= __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
	.journal_info	= NULL,
	INIT_CPU_TIMERS(init_task)
	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
	.timer_slack_ns = 50000, /* 50 usec default slack */
	.pids = {
		[PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),
		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),
		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),
	},
	.thread_group	= LIST_HEAD_INIT(init_task.thread_group),
	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
#ifdef CONFIG_AUDITSYSCALL
	.loginuid	= INVALID_UID,
	.sessionid	= (unsigned int)-1,
#endif
#ifdef CONFIG_PERF_EVENTS
	.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),
	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
#endif
#ifdef CONFIG_PREEMPT_RCU
	.rcu_read_lock_nesting = 0,
	.rcu_read_unlock_special.s = 0,
	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
	.rcu_blocked_node = NULL,
#endif
#ifdef CONFIG_TASKS_RCU
	.rcu_tasks_holdout = false,
	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
	.rcu_tasks_idle_cpu = -1,
#endif
#ifdef CONFIG_CPUSETS
	.mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq),
#endif
#ifdef CONFIG_RT_MUTEXES
	.pi_waiters	= RB_ROOT_CACHED,
	.pi_top_task	= NULL,
#endif
	INIT_PREV_CPUTIME(init_task)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
	.vtime.seqcount	= SEQCNT_ZERO(init_task.vtime_seqcount),
	.vtime.starttime = 0,
	.vtime.state	= VTIME_SYS,
#endif
#ifdef CONFIG_NUMA_BALANCING
	.numa_preferred_nid = -1,
	.numa_group	= NULL,
	.numa_faults	= NULL,
#endif
#ifdef CONFIG_KASAN
	.kasan_depth	= 1,
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
	.softirqs_enabled = 1,
#endif
#ifdef CONFIG_LOCKDEP
	.lockdep_recursion = 0,
#endif
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
	.ret_stack	= NULL,
#endif
#if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPT)
	.trace_recursion = 0,
#endif
#ifdef CONFIG_LIVEPATCH
	.patch_state	= KLP_UNDEFINED,
#endif
#ifdef CONFIG_SECURITY
	.security	= NULL,
#endif
};
EXPORT_SYMBOL(init_task);

/*
 * Initial thread structure. Alignment of this is handled by a special
 * linker map entry.
 */
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info init_thread_info __init_thread_info = INIT_THREAD_INFO(init_task);
#endif

通过判断task_struct是否和init_task相等，就可以知道是不是已经找到了最开始的process（init process）。

Process Creation

linux系统中创建进程，主要使用了两个函数：fork，exec。fork会把当前的进程copy到子进程中（可以指定copy哪些部分），然后通过exec让子进程开始执行新的program。

Copy-on-Write

fork创建子进程时，并不是真的把parent的内容copy到子进程中，而是使用了copy-on-write技术，也就是写时复制，如果子进程只读，就和parent share同一份，如果要写，就为子进程创建新的存储区域来写。虽然有copy-on-write，但是当fork时，至少要为子进程分配新的page table，以及一个新的task_struct。

Forking

用户态使用的fork，对应的系统调用是clone（），包含了一些flag，用来告诉kernel父子进程需要share哪些resource。除了fork函数，用户态使用的vfork（），__clone（）等，都是使用clone（）系统调用来实现。在kernel中，clone（）又会调用do_fork（）来实现，新进程的创建都是在do_fork()中来完成的，我们接下来看do_fork().

注意，在kernel 4.15中，fork的实现方式和2.6有一些区别，比如加入了HAVE_COPY_THREAD_TLS，这里仍然按照老的code来看。patch：http://lkml.iu.edu/hypermail/linux/kernel/1504.2/03324.html。

do_fork()定义在kernel/fork.c，do_fork调用了_do_fork，我们直接看_do_fork：

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long _do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr,
	      unsigned long tls)
{
	struct task_struct *p;
	int trace = 0;
	long nr;

        ...

	p = copy_process(clone_flags, stack_start, stack_size,
			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
	add_latent_entropy();
	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	if (!IS_ERR(p)) {
		struct completion vfork;
		struct pid *pid;

		trace_sched_process_fork(current, p);

		pid = get_task_pid(p, PIDTYPE_PID);
		nr = pid_vnr(pid);

		if (clone_flags & CLONE_PARENT_SETTID)
			put_user(nr, parent_tidptr);

		if (clone_flags & CLONE_VFORK) {
			p->vfork_done = &vfork;
			init_completion(&vfork);
			get_task_struct(p);
		}

		wake_up_new_task(p);

		/* forking complete and child started to run, tell ptracer */
		if (unlikely(trace))
			ptrace_event_pid(trace, pid);

		if (clone_flags & CLONE_VFORK) {
			if (!wait_for_vfork_done(p, &vfork))
				ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
		}

		put_pid(pid);
	} else {
		nr = PTR_ERR(p);
	}
	return nr;
}

_do_fork里通过copy_process创建了新的进程，然后通过wake_up_new_task让新进程开始执行。我们看copy_process：

/*
 * This creates a new process as a copy of the old one,
 * but does not actually start it yet.
 *
 * It copies the registers, and all the appropriate
 * parts of the process environment (as per the clone
 * flags). The actual kick-off is left to the caller.
 */
static __latent_entropy struct task_struct *copy_process(
					unsigned long clone_flags,
					unsigned long stack_start,
					unsigned long stack_size,
					int __user *child_tidptr,
					struct pid *pid,
					int trace,
					unsigned long tls,
					int node)
{
	int retval;
	struct task_struct *p;

	//... 一些clone flag的检查

	retval = -ENOMEM;
	p = dup_task_struct(current, node);
	if (!p)
		goto fork_out;

        //...
}

在copy_process里：

1. 调用了dup_task_struct，为新的进程创建kernel stack，thread info structure，以及task_struct结构体。里面的信息都是来自与parent，此时子进程里的info和parent都是完全一样的。

2. 检查子进程没有超过给当前用户分配的资源限制。

	retval = -EAGAIN;
	if (atomic_read(&p->real_cred->user->processes) >=
			task_rlimit(p, RLIMIT_NPROC)) {
		if (p->real_cred->user != INIT_USER &&
		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
			goto bad_fork_free;
	}
	current->flags &= ~PF_NPROC_EXCEEDED;

	retval = copy_creds(p, clone_flags);
	if (retval < 0)
		goto bad_fork_free;

3. 新进程中的一些变量被清掉，不过大部分都是统计用的信息，task_struct中的大部分都没有改变。

4. 新进程的状态在sched_fork中被设置为TASK_NEW，防止被调度。

5. 新进程的一些flag被设置，比如：

p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
p->flags |= PF_FORKNOEXEC;

6. 为新进程分配pid

	if (pid != &init_struct_pid) {
		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
		if (IS_ERR(pid)) {
			retval = PTR_ERR(pid);
			goto bad_fork_cleanup_thread;
		}

7. 为新进程copy需要的一切，比如share open files，fs，signal handler， process address space等等：

retval = copy_semundo(clone_flags, p);
retval = copy_files(clone_flags, p);
retval = copy_fs(clone_flags, p);
retval = copy_sighand(clone_flags, p);
retval = copy_signal(clone_flags, p);
retval = copy_mm(clone_flags, p);
retval = copy_namespaces(clone_flags, p);
retval = copy_io(clone_flags, p);
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);

8. 最后copy_process返回新进程的指针。

回到_do_fork，在新进程创建成功以后，新进程就会被唤醒（wake_up_new_task），开始执行：

/*
 * wake_up_new_task - wake up a newly created task for the first time.
 *
 * This function will do some initial scheduler statistics housekeeping
 * that must be done for every newly created context, then puts the task
 * on the runqueue and wakes it.
 */
void wake_up_new_task(struct task_struct *p)
{
	struct rq_flags rf;
	struct rq *rq;

	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
	p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
	/*
	 * Fork balancing, do it here and not earlier because:
	 *  - cpus_allowed can change in the fork path
	 *  - any previously selected CPU might disappear through hotplug
	 *
	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
	 * as we're not fully set-up yet.
	 */
	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
	rq = __task_rq_lock(p, &rf);
	update_rq_clock(rq);
	post_init_entity_util_avg(&p->se);

	activate_task(rq, p, ENQUEUE_NOCLOCK);
	p->on_rq = TASK_ON_RQ_QUEUED;
	trace_sched_wakeup_new(p);
	check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
	if (p->sched_class->task_woken) {
		/*
		 * Nothing relies on rq->lock after this, so its fine to
		 * drop it.
		 */
		rq_unpin_lock(rq, &rf);
		p->sched_class->task_woken(rq, p);
		rq_repin_lock(rq, &rf);
	}
#endif
	task_rq_unlock(rq, p, &rf);
}

activate_task就会把新的进程放到run queue里去，准备调度。

vfork和fork不同，vfork不会copy page table，并且子进程会把父进程block住，直到子进程完成，父进程才会被继续执行。

The Linux Implementation of Threads

linux kernel中没有thread这种概念，thread就是process，在数据结构和调度上没有任何区别，唯一的不同在于，thread是share了很多资源的process。这种实现方式非常简单优雅，并且逻辑简单。

Creating Threads

thread的创建和普通的process没有大的区别，都是使用clone()来实现，只不过传递的clone flag会有所区别。比如创建thread使用的clone flag可能是这样的：

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

这样，父子两个share同样的vm，fs，files，以及signal handler。常用的clone flag有：

CLONE_FILES
    Parent and child share open files.
CLONE_FS
    Parent and child share filesystem information.
CLONE_IDLETASK
    Set PID to zero (used only by the idle tasks).
CLONE_NEWNS
    Create a new namespace for the child.
CLONE_PARENT
    Child is to have same parent as its parent.
CLONE_PTRACE
    Continue tracing child.
CLONE_SETTID
    Write the TID back to user-space.
CLONE_SETTLS
    Create a new TLS for the child.
CLONE_SIGHAND
    Parent and child share signal handlers and blocked signals.
CLONE_SYSVSEM
    Parent and child share System V SEM_UNDO semantics.
CLONE_THREAD
    Parent and child are in the same thread group.
CLONE_VFORK
    vfork() was used and the parent will sleep until the child
wakes it.
CLONE_UNTRACED
    Do not let the tracing process force CLONE_PTRACE on the
child.
CLONE_STOP
    Start process in the TASK_STOPPED state.
CLONE_SETTLS
    Create a new TLS (thread-local storage) for the child.
CLONE_CHILD_CLEARTID
    Clear the TID in the child.
CLONE_CHILD_SETTID
    Set the TID in the child.
CLONE_PARENT_SETTID
    Set the TID in the parent.
CLONE_VM
    Parent and child share address space.

Kernel Threads

kernel thread就是process，和user 的process的区别是kernel thread没有地址空间，也就说kernel thread的task_struct里mm为NULL，user space的process是保存的当前进程的vma。因为没有mm，就限制了kernel thread完全运行在内核空间，和user space没有任何交互，不过和别的process一样都是可以正常调度和抢占的。

linux kernel中有很多的kernel thread，比如flush task和ksoftirq，通过命令ps -ef可以看到kernel当前正在运行的kernel thread。每一个kernel thread都是由其他的kernel thread创建出来，并且也只能由其他的kernel thread创建出来。kernel在创建kernel thread的时候，都是从kthreadd这个thread fork出来的。

创建kernel thread接口定义在<linux/kthread.h>，有两个：kthread_create和kthread_run。二者的区别在于kthread_run创建出来的thread自动运行，前者需要手动调用一次wake_up_process才会运行thread，看一下这两个函数的原型：

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
					   void *data,
					   int node,
					   const char namefmt[], ...);

/**
 * kthread_create - create a kthread on the current node
 * @threadfn: the function to run in the thread
 * @data: data pointer for @threadfn()
 * @namefmt: printf-style format string for the thread name
 * @arg...: arguments for @namefmt.
 *
 * This macro will create a kthread on the current node, leaving it in
 * the stopped state.  This is just a helper for kthread_create_on_node();
 * see the documentation there for more details.
 */
#define kthread_create(threadfn, data, namefmt, arg...) \
	kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)

kthread_create是一个宏，直接调用kthread_create_on_node，threadfn是thread执行的入口函数，data是threadfn的参数data，node是CPU的node，不用关心，namefmt是thread的名字。再看一下kthread_run：

/**
 * kthread_run - create and wake a thread.
 * @threadfn: the function to run until signal_pending(current).
 * @data: data ptr for @threadfn.
 * @namefmt: printf-style name for the thread.
 *
 * Description: Convenient wrapper for kthread_create() followed by
 * wake_up_process().  Returns the kthread or ERR_PTR(-ENOMEM).
 */
#define kthread_run(threadfn, data, namefmt, ...)			   \
({									   \
	struct task_struct *__k						   \
		= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
	if (!IS_ERR(__k))						   \
		wake_up_process(__k);					   \
	__k;								   \
}

可以看到kthread_run比kthread_create多做了一步wake_up_process而已，这个wake_up_process也并不是让kthread马上运行，而是把它加到task run queue里去，等待调度。

kernel thread如果需要退出，就需要调用do_exit，或者kthread_stop。

Process Termination

当process结束的时候，kernel要把process占用的资源释放掉，同时通知它的parent process。通常来说，process都是主动结束自己，比如调用exit系统调用（比如用户态程序，C编译器会在main函数之后调用一次exit系统调用），或者从某些routine中返回（比如kernel thread从threadfn中返回），但是也有可能是process发生了异常所以退出，比如收到了退出的信号（SIGKILL）或者产生了无法处理的异常（Segement Fault），无论是哪一种退出方式，最终都会调用do_exit来清理process。do_exit定义在kernel/exit.c中，原型如下：

void __noreturn do_exit(long code)
{
...
}

do_exit主要干了这些事情：

1. 通过exit_signal把task的state设置为PF_EXITING。

2. 书里说这里会调用del_timer_sync，把process的timer移除，kernel 4.15中未见这段code。

3. 调用acct_update_integrals把一些统计数据写出去。

4. 调用exit_mm把process的mm释放掉——如果没有人share的话。

5. 调用exit_sem，如果之前process在等待IPC semaphore的队列，这里就会把它移除。

6. 调用exit_files和exit_fs，把files和fs的引用计数减掉，如果变为0，这些资源就会被释放。

7. 记录process的exit_code。（tsk->exit_code = code;），以后parent可以从exit_code知道子进程的退出原因。

8. 调用exit_notify，通知parent进程，并且为当前process的children寻找合适的parent（比如kthread group里的其他thread，或者init process），然后设置task struct里的exit_state为EXIT_ZOMBIE。

9. 最后在do_task_dead中调用schedule，并且不再返回。（因为当前的process已经不存在了）

在do_exit走完以后，这个process占用的memory就只有它的kernel stack了，也就是thread info和task_struct这两个结构体，之所以他们还存在，就是为了给parent传递一些信息，当parent已经获取到了信息，或者说对这些信息不感兴趣，那么结构体占用的memory也就被彻底的释放了。

Removing the Process Descriptor

前面已经说了，process的do_exit调用完以后，还留有两个结构体给parent获取信息用，那么parent如何获取到信息呢？答案是wait4系统调用。parent在创建了child process以后，需要调用wait4来等待child process的状态，调用这个函数时，parent会被block住，直到它的child process退出，此时parent会获取到child process的PID以及exit code。当wait4之后，child process的两个结构体就会被彻底的释放，这个是通过release_task来实现的。release_task做了如下事情：

1. 调用__exit_signal，其中会调用__unhash_process，其中又会调用detach_pid，就会把process从pidhash里移除，并从task list中移除。

2.__exit_signal也会把一些其他的资源释放掉，完成一些统计信息。

3. 如果退出的process是thread group里的non-leader member，而且这个group里leader的状态是zombie，此时会通知leader的parent。

4. release_task最后会调用delayed_put_task_struct，其中会调用put_task_struct，最后就会把task struct占用的memory释放掉。

The Dilemma of the Parentless Task

如果某一个process在它的children process退出之前退出，就要为它的children process重新寻找合适的parent，否则这些children会因为没有parent wait而导致无法退出（zombie）。在do_exit的过程中我们提到过，在exit_notify的时候要为children寻找合适的parent，现在看一下这个逻辑：do_exit -> exit_notify -> forget_original_parent -> find_new_reaper.

/*
 * When we die, we re-parent all our children, and try to:
 * 1. give them to another thread in our thread group, if such a member exists
 * 2. give it to the first ancestor process which prctl'd itself as a
 *    child_subreaper for its children (like a service manager)
 * 3. give it to the init process (PID 1) in our pid namespace
 */
static struct task_struct *find_new_reaper(struct task_struct *father,
					   struct task_struct *child_reaper)
{
	struct task_struct *thread, *reaper;

	thread = find_alive_thread(father);
	if (thread)
		return thread;

	if (father->signal->has_child_subreaper) {
		unsigned int ns_level = task_pid(father)->level;
		/*
		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
		 * We can't check reaper != child_reaper to ensure we do not
		 * cross the namespaces, the exiting parent could be injected
		 * by setns() + fork().
		 * We check pid->level, this is slightly more efficient than
		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
		 */
		for (reaper = father->real_parent;
		     task_pid(reaper)->level == ns_level;
		     reaper = reaper->real_parent) {
			if (reaper == &init_task)
				break;
			if (!reaper->signal->is_child_subreaper)
				continue;
			thread = find_alive_thread(reaper);
			if (thread)
				return thread;
		}
	}

	return child_reaper;

上面的这段code是为了寻找合适的parent，如果thread group里有满足条件的process，就返回它，否则会返回init process。在这个函数返回以后，就找到了合适的parent，然后把所有的children的parent设置为新找到的parent：

	list_for_each_entry(p, &father->children, sibling) {
		for_each_thread(p, t) {
			t->real_parent = reaper;
			BUG_ON((!t->ptrace) != (t->parent == father));
			if (likely(!t->ptrace))
				t->parent = t->real_parent;
			if (t->pdeath_signal)
				group_send_sig_info(t->pdeath_signal,
						    SEND_SIG_NOINFO, t);
		}
		/*
		 * If this is a threaded reparent there is no need to
		 * notify anyone anything has happened.
		 */
		if (!same_thread_group(reaper, father))
			reparent_leader(father, p, dead);
	}

kernel 2.6以后引入了ptrace，如果当前exit的process有ptraced，比如被gdb attach，在它exit的时候就会把gdb设为它的children的parent。