Linux内核框架之内核进程

水乡夜航

已于 2022-10-23 23:12:14 修改

阅读量2.3k

点赞数 7

分类专栏： Linux内核结构学习文章标签：开发语言 linux 系统架构

于 2022-05-29 20:15:46 首次发布

本文链接：https://blog.csdn.net/weixin_39077305/article/details/125020350

版权

Linux内核结构学习专栏收录该内容

5 篇文章 5 订阅

订阅专栏

2.5.1 clone fork vfork系统调用

kernel/sched/sched.h:

一、Linux的进程介绍

1.进程线程和轻量级进程

进程是资源管理的最小单位，线程是程序执行的最小单位。在操作系统设计上，从进程演化出线程，最主要的目的就是减小多进程上下文切换开销。

最初的进程定义都包含程序、资源及其执行三部分，其中程序通常指代码，资源在操作系统层面上通常包括内存资源、IO资源、信号处理等部分，而程序的执行通常理解为执行上下文，包括对CPU的占用，后来发展为线程。在线程概念出现以前，为了减小进程切换的开销，操作系统设计者逐渐修正进程的概念，逐渐允许将进程所占有的资源从其主体剥离出来，允许某些进程共享一部分资源，例如文件、信号，数据内存，甚至代码，这就发展出轻量进程的概念。

Linux早期没有对多线程进行支持，后来才加上了轻量级进程的概念：轻量级进程可以是进程，也可以是线程。我们所说的线程，在Linux中，其实就是是轻量级进程之间共享代码段，文件描述符，信号处理，全局变量时；如果不共享，就是我们所说的独立进程。Linux内核在2.0.x版本就已经实现了轻量进程，应用程序可以通过一个统一的clone()或者fork()系统调用接口，用不同的参数指定创建轻量进程还是普通进程。

比如内核初始化的时候调用的：

.//init/main.c:371: pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

2.进程调度的结构组成

1. Scheduling Policy，实现进程调度的策略，它决定哪个（或哪几个）进程将拥有CPU。

2. Architecture-specific Schedulers，体系结构相关的部分，用于将对不同CPU的控制，抽象为统一的接口。这些控制主要在suspend和resume进程时使用，牵涉到CPU的寄存器访问、汇编指令操作等。

3. Architecture-independent Scheduler，体系结构无关的部分。它会和“Scheduling Policy模块”沟通，决定接下来要执行哪个进程，然后通过“Architecture-specific Schedulers模块”resume指定的进程。

4. System Call Interface，系统调用接口。进程调度子系统通过系统调用接口，将需要提供给用户空间的接口开放出去，同时屏蔽掉不需要用户空间程序关心的细节

二、进程的静态描述

2.1.进程描述符

先看看他的整体示意图有个大概了解：

再看代码

struct task_struct {    volatile long state;   /* -1 unrunnable, 0 runnable, >0 stopped 运行时状态，-1不可运行，0代表可运行，>0代表已停止*/

   void *stack;

/* 指向内核堆栈：

* 对每个进程，Linux内核都把两个不同的数据结构紧凑的存放在一个单独为进程分配的内存区域中

* ·一个是内核态的进程堆栈，

* ·另一个是紧挨着进程描述符的小数据结构thread_info，叫做线程描述符，由下图可知他是和体系结构相关的，描述的是进程的运行CPU、抢占性、进程上下文指针cpu_context，进程描述符指针等基本信息。

* Linux把thread_info（线程描述符）和内核态的线程堆栈存放在一起，这块区域通常是8192K（占两个页框），其实地址必须是8192的整数倍。

这样做的好处就是方便内核栈快速通过esp栈顶指针找到thread_info结构体，进一步有利于通过thread_info定位task_struct的位置：

        atomic_t usage; 进程描述符使用计数，被置为2时，表示进程描述符正在被使用而且其相应的进程处于活动状态
   unsigned int flags;   /* per process flags, defined below */

        /*

        flags是进程当前的状态标志，具体的如：

        0x00000002表示进程正在被创建；

        0x00000004表示进程正准备退出；

        0x00000040 表示此进程被fork出，但是并没有执行exec；

        0x00000400表示此进程由于其他进程发送相关信号而被杀死。

        */

   unsigned int ptrace; /* 它主要用于实现断点调试。*/
#ifdef CONFIG_SMP
    struct task_struct *wake_entry;  /*用于多cpu核的时候 唤醒空闲核进行reshedule负载均衡 ttwu_queue_remote()*/
    int on_cpu;   //当前在哪一个CPU上运行
#endif
/*调度优先级调度相关*/

   int prio, static_prio, normal_prio;    unsigned int rt_priority; /*实时任务优先级*/

   const struct sched_class *sched_class; /*调度策略类，用于进程切换时决定使用那种策略详见3.4.1 schedule()的实现*/

从代码可以看出到不同的调度策略有不同的调度函数组合：

   struct sched_entity se; 调度实体    struct sched_rt_entity rt; 实时进程调度实体
/*
policy 保存了对该进程应用的调度策略。
进程的调度策略有6种
SCHED_NORMAL SCHED_FIFO SCHED_RR SCHED_BATCH SCHED_IDLE
普通进程调度策略: SCHED_NORMAL、SCHED_BATCH、SCHED_IDLE，这些都是通过完全公平调度器来处理的
实时进程调度策略: SCHED_RR、SCHED_FIFO 这些都是通过实时调度器来处理的
*/

   unsigned int policy; //保存调度策略的具体类型

   cpumask_t cpus_allowed;

        cpumask_t cpus_allowed;

目前内核实现的调度策略有五种，对应的调度类有四种，这个我们放到后面再说

目前系統中,Scheduling Class的优先级顺序为StopTask > RealTime > Fair > IdleTask

开发者可以根据己的设计需求,來把所属的Task配置到不同的Scheduling Class中.

/*RCU 同步源语*/

#ifdef CONFIG_PREEMPT_RCU    int rcu_read_lock_nesting;    char rcu_read_unlock_special;    struct list_head rcu_node_entry; #endif /* #ifdef CONFIG_PREEMPT_RCU */ #ifdef CONFIG_TREE_PREEMPT_RCU    struct rcu_node *rcu_blocked_node; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ #ifdef CONFIG_RCU_BOOST    struct rt_mutex *rcu_boost_mutex; #endif /* #ifdef CONFIG_RCU_BOOST */

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)    struct sched_info sched_info; #endif

   struct list_head tasks; /*使用链表组织进程*/

/*除了内核线程（Kernel Thread），每个进程都拥有自己的地址空间（也叫虚拟空间），用mm_struct 来描述。active_mm是为内核线程而引入的。因为内核线程没有自己的地址空间，为了让内核线程与普通进程具有统一的上下文切换方式，当内核线程进行上下文切换时，让切换进来的线程的active_mm指向刚被调度出去的进程的active_mm（如果进程的mm 域不为空，则其active_mm 域与mm 域相同）*/

   struct mm_struct *mm, *active_mm;

#ifdef CONFIG_COMPAT_BRK    unsigned brk_randomized:1; #endif #if defined(SPLIT_RSS_COUNTING)    struct task_rss_stat   rss_stat; #endif

/* 进程退出的相关状态信息 */ int exit_state; int exit_code, exit_signal; int pdeath_signal; /* The signal sent when the parent dies */ unsigned int jobctl; /* JOBCTL_*, siglock protected */ /* ??? */

   unsigned int personality;    unsigned did_exec:1;    unsigned in_execve:1;   /* Tell the LSMs that the process is doing an                * execve */    unsigned in_iowait:1;

   /* Revert to default priority/policy when forking */    unsigned sched_reset_on_fork:1;    unsigned sched_contributes_to_load:1;

#ifdef CONFIG_GENERIC_HARDIRQS    /* IRQ handler threads */    unsigned irq_thread:1; #endif

   pid_t pid;    pid_t tgid;

/*堆栈保护编译加参数-fstack-protector*/

#ifdef CONFIG_CC_STACKPROTECTOR    /* Canary value for the -fstack-protector gcc feature */    unsigned long stack_canary; #endif

/*    * pointers to (original) parent process, youngest child, younger sibling,    * older sibling, respectively. (p->father can be replaced with    * p->real_parent->pid)    */    struct task_struct __rcu *real_parent; /* real parent process */    struct task_struct __rcu *parent; /* recipient of SIGCHLD, wait4() reports */    /*    * children/sibling forms the list of my natural children    */

/*父子关系*/    struct list_head children;   /* list of my children */    struct list_head sibling;   /* linkage in my parent's children list */    struct task_struct *group_leader;   /* threadgroup leader */

   /*    * ptraced is the list of tasks this task is using ptrace on.    * This includes both natural children and PTRACE_ATTACH targets.    * p->ptrace_entry is p's link on the p->parent->ptraced list.    */    struct list_head ptraced;    struct list_head ptrace_entry;

   /* PID/PID hash table linkage. */    struct pid_link pids[PIDTYPE_MAX];    struct list_head thread_group;

   struct completion *vfork_done;       /* for vfork() */    int __user *set_child_tid;       /* CLONE_CHILD_SETTID */    int __user *clear_child_tid;       /* CLONE_CHILD_CLEARTID */

   cputime_t utime, stime, utimescaled, stimescaled;    cputime_t gtime; #ifndef CONFIG_VIRT_CPU_ACCOUNTING    cputime_t prev_utime, prev_stime; #endif    unsigned long nvcsw, nivcsw; /* context switch counts 上下文切换次数，说明发生了调度 */    struct timespec start_time;        /* monotonic time */    struct timespec real_start_time;   /* boot based time */ /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */    unsigned long min_flt, maj_flt;

   struct task_cputime cputime_expires;    struct list_head cpu_timers[3];

/* process credentials */    const struct cred __rcu *real_cred; /* objective and real subjective task                    * credentials (COW) */    const struct cred __rcu *cred;   /* effective (overridable) subjective task                    * credentials (COW) */    struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */

   char comm[TASK_COMM_LEN]; /* executable name excluding path                - access with [gs]et_task_comm (which lock                it with task_lock())                - initialized normally by setup_new_exec */ /* file system info */    int link_count, total_link_count; #ifdef CONFIG_SYSVIPC /* ipc stuff */    struct sysv_sem sysvsem; #endif #ifdef CONFIG_DETECT_HUNG_TASK /* hung task detection */    unsigned long last_switch_count; /* 内核线程发生上下文切换次数 */ #endif

/* CPU-specific state of this task */    struct thread_struct thread; /* filesystem information */    struct fs_struct *fs; /* open file information */    struct files_struct *files; /* namespaces */    struct nsproxy *nsproxy;

/* signal handlers */    struct signal_struct *signal;    struct sighand_struct *sighand;

   sigset_t blocked, real_blocked;    sigset_t saved_sigmask;   /* restored if set_restore_sigmask() was used */    struct sigpending pending;

   unsigned long sas_ss_sp;    size_t sas_ss_size;

   int (*notifier)(void *priv);    void *notifier_data;    sigset_t *notifier_mask;    struct audit_context *audit_context; #ifdef CONFIG_AUDITSYSCALL    uid_t loginuid;    unsigned int sessionid; #endif    seccomp_t seccomp;

/* Thread group tracking */     u32 parent_exec_id;     u32 self_exec_id; /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, * mempolicy */    spinlock_t alloc_lock;

   /* Protection of the PI data structures: */    raw_spinlock_t pi_lock;

#ifdef CONFIG_RT_MUTEXES    /* PI waiters blocked on a rt_mutex held by this task */    struct plist_head pi_waiters;    /* Deadlock detection and priority inheritance handling */    struct rt_mutex_waiter *pi_blocked_on; #endif

#ifdef CONFIG_DEBUG_MUTEXES    /* mutex deadlock detection */    struct mutex_waiter *blocked_on; #endif

#ifdef CONFIG_TRACE_IRQFLAGS

中断相关    unsigned int irq_events;    unsigned long hardirq_enable_ip;    unsigned long hardirq_disable_ip;    unsigned int hardirq_enable_event;    unsigned int hardirq_disable_event;    int hardirqs_enabled;    int hardirq_context;    unsigned long softirq_disable_ip;    unsigned long softirq_enable_ip;    unsigned int softirq_disable_event;    unsigned int softirq_enable_event;    int softirqs_enabled;    int softirq_context; #endif

#ifdef CONFIG_LOCKDEP

死锁检测 # define MAX_LOCK_DEPTH 48UL    u64 curr_chain_key;    int lockdep_depth;    unsigned int lockdep_recursion;    struct held_lock held_locks[MAX_LOCK_DEPTH];    gfp_t lockdep_reclaim_gfp; #endif

/* journalling filesystem info */    void *journal_info;

/* stacked block device info */    struct bio_list *bio_list;

#ifdef CONFIG_BLOCK /* stack plugging */    struct blk_plug *plug; #endif

/* VM state */
   struct reclaim_state *reclaim_state;

   struct backing_dev_info *backing_dev_info;

   struct io_context *io_context;

   unsigned long ptrace_message;    siginfo_t *last_siginfo; /* For ptrace use. */    struct task_io_accounting ioac;

#if defined(CONFIG_TASK_XACCT)    u64 acct_rss_mem1;   /* accumulated rss usage */    u64 acct_vm_mem1;   /* accumulated virtual memory usage */    cputime_t acct_timexpd;   /* stime + utime since last update */ #endif #ifdef CONFIG_CPUSETS    nodemask_t mems_allowed;   /* Protected by alloc_lock */    seqcount_t mems_allowed_seq;   /* Seqence no to catch updates */    int cpuset_mem_spread_rotor;    int cpuset_slab_spread_rotor; #endif #ifdef CONFIG_CGROUPS    /* Control Group info protected by css_set_lock */    struct css_set __rcu *cgroups;    /* cg_list protected by css_set_lock and tsk->alloc_lock */    struct list_head cg_list; #endif #ifdef CONFIG_FUTEX    struct robust_list_head __user *robust_list; #ifdef CONFIG_COMPAT    struct compat_robust_list_head __user *compat_robust_list; #endif    struct list_head pi_state_list;    struct futex_pi_state *pi_state_cache; #endif #ifdef CONFIG_PERF_EVENTS    struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];    struct mutex perf_event_mutex;    struct list_head perf_event_list; #endif #ifdef CONFIG_NUMA

非一致性内存访问    struct mempolicy *mempolicy;   /* Protected by alloc_lock */    short il_next;    short pref_node_fork; #endif    struct rcu_head rcu;

   /*    * cache last used pipe for splice    */    struct pipe_inode_info *splice_pipe; #ifdef   CONFIG_TASK_DELAY_ACCT    struct task_delay_info *delays; #endif #ifdef CONFIG_FAULT_INJECTION    int make_it_fail; #endif    /*    * when (nr_dirtied >= nr_dirtied_pause), it's time to call    * balance_dirty_pages() for some dirty throttling pause    */    int nr_dirtied;    int nr_dirtied_pause;    unsigned long dirty_paused_when; /* start of a write-and-pause period */

#ifdef CONFIG_LATENCYTOP    int latency_record_count;    struct latency_record latency_record[LT_SAVECOUNT]; #endif    /*    * time slack values; these are used to round up poll() and    * select() etc timeout values. These are in nanoseconds.    */    unsigned long timer_slack_ns;    unsigned long default_timer_slack_ns;

   struct list_head   *scm_work_list; #ifdef CONFIG_FUNCTION_GRAPH_TRACER    /* Index of current stored address in ret_stack */    int curr_ret_stack;    /* Stack of return addresses for return function tracing */    struct ftrace_ret_stack   *ret_stack;    /* time stamp for last schedule */    unsigned long long ftrace_timestamp;    /*    * Number of functions that haven't been traced    * because of depth overrun.    */    atomic_t trace_overrun;    /* Pause for the tracing */    atomic_t tracing_graph_pause; #endif #ifdef CONFIG_TRACING    /* state flags for use by tracers */    unsigned long trace;    /* bitmask and counter of trace recursion */    unsigned long trace_recursion; #endif /* CONFIG_TRACING */ #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */    struct memcg_batch_info {        int do_batch;   /* incremented when batch uncharge started */        struct mem_cgroup *memcg; /* target memcg of uncharge */        unsigned long nr_pages;   /* uncharged usage */        unsigned long memsw_nr_pages; /* uncharged mem+swap usage */    } memcg_batch; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT    atomic_t ptrace_bp_refcnt; #endif };

2.2.进程链表的维护

2.2.1运行队列

内核根据不同的优先级维护了不同优先级的待执行链表，称之为运行队列。他们都被prio_array_t描述

static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
	update_rq_clock(rq);
	sched_info_queued(p);
	p->sched_class->enqueue_task(rq, p, flags);
}

static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
{
	update_rq_clock(rq);
	sched_info_dequeued(p);
	p->sched_class->dequeue_task(rq, p, flags);
}

2.2.2 等待队列

对于等待状态的进程，内核将其归类到了等待队列中：

等待队列链表中的元素就是等待同一类资源的进程，要根据是否资源互斥对他们做分类(flag)，有选择性的唤醒相应的队列(func):

他的操作函数在/kernel/wait.c中

2.3.进程间关系

描述进程间关系的结构体参数是：

/* * pointers to (original) parent process, youngest child, younger sibling, * older sibling, respectively. (p->father can be replaced with * p->real_parent->pid) */ struct task_struct __rcu *real_parent; /* real parent process */ struct task_struct __rcu *parent; /* recipient of SIGCHLD, wait4() reports */ /* * children/sibling forms the list of my natural children */

/*父子关系*/ struct list_head children; /* list of my children */ struct list_head sibling; /* linkage in my parent's children list */ struct task_struct *group_leader; /* threadgroup leader */

2.4.进程切换主要内容

4.1 硬件上下文

进程恢复执行时，CPU 寄存器装载的值叫硬件上下文，这些值一部分放在TSS段，其他的存在进程堆栈中。（TSS 全称task state segment，是指在操作系统进程管理的过程中，任务（进程）切换时的任务现场信息。但是现在已经不使用TSS进行切换了）

进程切换全部发生在内核态，在发生切换之前，用户态堆栈已经全部保存在内核态堆栈上。

4.2 thread字段

在进程描述符thread_info结构体中可以找到cpu_context变量，

他应该就是发生切换时候，用于存储CPU的大部分寄存器值的，从注释可以看出使用__switch_to()接口切换，硬件上下文紧邻在cpu_domain字段后面：


/*
 * low level task data that entry.S needs immediate access to.
 * __switch_to() assumes cpu_context follows immediately after cpu_domain.
 */
struct thread_info {
	unsigned long		flags;		/* low level flags */
	int			preempt_count;	/* 0 => preemptable, <0 => bug */
	mm_segment_t		addr_limit;	/* address limit */
	struct task_struct	*task;		/* main task structure */
	struct exec_domain	*exec_domain;	/* execution domain */
	__u32			cpu;		/* cpu */
	__u32			cpu_domain;	/* cpu domain */
	struct cpu_context_save	cpu_context;	/* cpu context */
	__u32			syscall;	/* syscall number */
	__u8			used_cp[16];	/* thread used copro */
	unsigned long		tp_value;
	struct crunch_state	crunchstate;
	union fp_state		fpstate __attribute__((aligned(8)));
	union vfp_state		vfpstate;
#ifdef CONFIG_ARM_THUMBEE
	unsigned long		thumbee_state;	/* ThumbEE Handler Base register */
#endif
	struct restart_block	restart_block;
};

struct cpu_context_save结构内容如下：


struct cpu_context_save {
	__u32	r4;
	__u32	r5;
	__u32	r6;
	__u32	r7;
	__u32	r8;
	__u32	r9;
	__u32	sl;
	__u32	fp;
	__u32	sp;
	__u32	pc;
	__u32	extra[2];		/* Xscale 'acc' register, etc */
};

可以看出他不包含r0 r1 r2前三个寄存器，这是用来传参用的，分别代表切换涉及到的三个进程结构体指针。

4.3 执行进程切换

1️⃣ 切换全局页目录，安装新的地质空间

2️⃣切换内核堆栈，切换硬件上下文 ------> switch_to()----->__switch_to()

__switch_to()中 r0 = previous task_struct, r1 = previous thread_info, r2 = next thread_info


#define switch_to(prev,next,last)					\
do {									\
	last = __switch_to(prev,task_thread_info(prev), task_thread_info(next));	\
} while (0)

/*
 * Register switch for ARMv3 and ARMv4 processors
 * r0 = previous task_struct, r1 = previous thread_info, r2 = next thread_info
 * previous and next are guaranteed not to be the same.
 */
ENTRY(__switch_to)
 UNWIND(.fnstart	)
 UNWIND(.cantunwind	)
	add	ip, r1, #TI_CPU_SAVE
	ldr	r3, [r2, #TI_TP_VALUE]
 ARM(	stmia	ip!, {r4 - sl, fp, sp, lr} )	@ Store most regs on stack
 THUMB(	stmia	ip!, {r4 - sl, fp}	   )	@ Store most regs on stack
 THUMB(	str	sp, [ip], #4		   )
 THUMB(	str	lr, [ip], #4		   )
#ifdef CONFIG_CPU_USE_DOMAINS
	ldr	r6, [r2, #TI_CPU_DOMAIN]
#endif
	set_tls	r3, r4, r5
#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)
	ldr	r7, [r2, #TI_TASK]
	ldr	r8, =__stack_chk_guard
	ldr	r7, [r7, #TSK_STACK_CANARY]
#endif
#ifdef CONFIG_CPU_USE_DOMAINS
	mcr	p15, 0, r6, c3, c0, 0		@ Set domain register
#endif
	mov	r5, r0
	add	r4, r2, #TI_CPU_SAVE
	ldr	r0, =thread_notify_head
	mov	r1, #THREAD_NOTIFY_SWITCH
	bl	atomic_notifier_call_chain
#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)
	str	r7, [r8]
#endif
 THUMB(	mov	ip, r4			   )
	mov	r0, r5
 ARM(	ldmia	r4, {r4 - sl, fp, sp, pc}  )	@ Load all regs saved previously
 THUMB(	ldmia	ip!, {r4 - sl, fp}	   )	@ Load all regs saved previously
 THUMB(	ldr	sp, [ip], #4		   )
 THUMB(	ldr	pc, [ip]		   )
 UNWIND(.fnend		)
ENDPROC(__switch_to)

为什么进程切换要用到三个进程指针？

现在假设只有两个参数, 考虑有三个进程ABC，如果只有两个参数的话，从进程A->进程B, 后来要从另一个进程通常不会是B，现假定为C再切回到A，在A获得处理器开始运行, 这时候它的prev指向A, next指向B，这样我们失去了进程C的信息了，而实际上在schedule函数调用了switch_to后面还调用_schedul_tail来收尾，这时需要用到切换回A之前进程C的信息，可惜已经丢掉了, 进程A的prev是指向它自己。

2.5.进程创建

2.5.1 clone fork vfork系统调用

Unix系统通过复制父进程的所有资源来创建进程，但是这样的效率太低，因此引入了三种机制：

1️⃣ 写时复制：允许父子进程读取相同的物理页，单只有子进程准备写的时候才会去复制新的物理页

2️⃣轻量级进程允许父子进程共享很多内核数据结构：页表打开文件表信号处理等

3️⃣vfork允许子进程共享父进程的内存地址空间

从下面的代码中可以看出，clone fork vfork系统调用最终都是调用的do_fork()，只是参数不一样。

·SIGCHLD，在一个进程终止或者停止时，将SIGCHLD信号发送给其父进程，按系统默认将忽略此信号，如果父进程希望被告知其子系统的这种状态，则应捕捉此信号。

·CLONE_VM 共享所有内存描述符和页表； CLONE_VFORK ：vfork专用

/* Fork a new task - this creates a new program thread.
 * This is called indirectly via a small wrapper
 */
asmlinkage int sys_fork(struct pt_regs *regs)
{
#ifdef CONFIG_MMU
	return do_fork(SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);
#else
	/* can not support in nommu mode */
	return(-EINVAL);
#endif
}

/* Clone a task - this clones the calling program thread.
 * This is called indirectly via a small wrapper
 */
asmlinkage int sys_clone(unsigned long clone_flags, unsigned long newsp,
			 int __user *parent_tidptr, int tls_val,
			 int __user *child_tidptr, struct pt_regs *regs)
{
	if (!newsp)
		newsp = regs->ARM_sp;

	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
}

asmlinkage int sys_vfork(struct pt_regs *regs)
{
	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);
}

·long do_fork (

unsigned long clone_flags, 低1字节表示子进程结束给父进程发送的信号，高字节用于CLONE_XXX标志
  unsigned long stack_start, 把父进程用户态的堆栈指针赋值给子进程的esp寄存器
   struct pt_regs *regs, 指向通用寄存器的指针，他们是从用户态切换到内核态时保存到内核堆栈中的
   unsigned long stack_size, 未使用，设置为0
   int __user *parent_tidptr, ptid
   int __user *child_tidptr) ctid

很明显，

p = copy_process(clone_flags, stack_start, regs, stack_size,child_tidptr, NULL, trace);

是关键步骤，我们后面再看

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          struct pt_regs *regs,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr)
{
    struct task_struct *p;
    int trace = 0;
    long nr;

    /*
     * Do some preliminary argument and permissions checking before we
     * actually start allocating stuff
     */
    if (clone_flags & CLONE_NEWUSER) {
        if (clone_flags & CLONE_THREAD)
            return -EINVAL;
        /* hopefully this check will go away when userns support is
         * complete
         */
        if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
                !capable(CAP_SETGID))
            return -EPERM;
    }

    /*
     * Determine whether and which event to report to ptracer.  When
     * called from kernel_thread or CLONE_UNTRACED is explicitly
     * requested, no event is reported; otherwise, report if the event
     * for the type of forking is enabled.
     */
    if (likely(user_mode(regs)) && !(clone_flags & CLONE_UNTRACED)) {
        if (clone_flags & CLONE_VFORK)
            trace = PTRACE_EVENT_VFORK;
        else if ((clone_flags & CSIGNAL) != SIGCHLD)
            trace = PTRACE_EVENT_CLONE;
        else
            trace = PTRACE_EVENT_FORK;

        if (likely(!ptrace_event_enabled(current, trace)))
            trace = 0;
    }

    p = copy_process(clone_flags, stack_start, regs, stack_size,
             child_tidptr, NULL, trace);
    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    if (!IS_ERR(p)) {
        struct completion vfork;

        trace_sched_process_fork(current, p);

        nr = task_pid_vnr(p);

        if (clone_flags & CLONE_PARENT_SETTID)
            put_user(nr, parent_tidptr);

        if (clone_flags & CLONE_VFORK) {
            p->vfork_done = &vfork;
            init_completion(&vfork);
            get_task_struct(p);
        }

        wake_up_new_task(p);

        /* forking complete and child started to run, tell ptracer */
        if (unlikely(trace))
            ptrace_event(trace, nr);

        if (clone_flags & CLONE_VFORK) {
            if (!wait_for_vfork_done(p, &vfork))
                ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
        }
    } else {
        nr = PTR_ERR(p);
    }
    return nr;
}

2.5.2 内核线程

内核线程在内核初始化的最后阶段通过kernel_thread()创建，实质上也是调用do_fork()。

static noinline void __init_refok rest_init(void)
{
   int pid;

   rcu_scheduler_starting();
   /*
   * We need to spawn init first so that it obtains pid 1, however
   * the init task will end up wanting to create kthreads, which, if
   * we schedule it before we create kthreadd, will OOPS.
   */
   kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
   numa_default_policy();
   pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
   rcu_read_lock();
   kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
   rcu_read_unlock();
   complete(&kthreadd_done);

   /*
   * The boot idle thread must execute schedule()
   * at least once to get things moving:
   */
   init_idle_bootup_task(current);
   schedule_preempt_disabled();
   /* Call into cpu_idle with preempt disabled */
   cpu_idle();
}

下面的示意图很好的解释了上面的代码所处的位置和做的事情：

Linux下有3个特殊的进程，idle进程(PID = 0), init进程(PID = 1)和kthreadd(PID = 2)

idle进程由系统自动创建, 运行在内核态
idle进程其pid=0，其前身是系统创建的第一个进程，也是唯一一个没有通过fork或者kernel_thread产生的进程。完成加载系统后，演变为进程调度、交换；

init进程由idle通过kernel_thread创建，在内核空间完成初始化后, 加载init程序
并最终用户空间由0进程创建，完成系统的初始化. 是系统中所有其它用户进程的祖先进程
Linux中的所有进程都是有init进程创建并运行的。首先Linux内核启动，然后在用户空间中启动init进程，再启动其他系统进程。在系统启动完成完成后，init将变为守护进程监视系统其他进程。

kthreadd进程由idle通过kernel_thread创建，并始终运行在内核空间, 负责所有内核线程的调度和管理它的任务就是管理和调度其他内核线程kernel_thread, 会循环执行一个kthreadd的函数，该函数的作用就是运行kthread_create_list全局链表中维护的kthread, 当我们调用kernel_thread创建的内核线程会被加入到此链表中，因此所有的内核线程都是直接或者间接的以kthreadd为父进程；

2.6 多线程

实际工作中，我们一般不会使用单个进程去完成所有任务，一个进程下面还有相应的子线程，这块涉及到多线程编程，先挖个坑。

三、进程调度

终于到了进程管理的重点部分。进程调度主要关心什么时候调度以及调度哪个进程来运行。

3.1 调度策略

3.1.1 调度策略目的

由上可知，调度需要要识别程序属于哪一类。

3.1.2 系统调用

3.1.3 内核抢占

内核比较正在运行的程序和进入RUNNING状态的程序，谁的动态优先级高，如果高于正在运行的程序，则会发生抢占。

抢占时伴随着schedule()的执行。内核提供了一个TIF_NEED_RESCHED标志来表明是否要用schedule()调度一次,这样的设置牺牲了内核切换上下文的开销，但是使得内核变得更加灵活，也拥有了更大的后台吞吐量。

3.1.4 时间片长度

时间片长度决定了内核发生调度的时间，不能太长也不能太短，太短上下文切换高，太长会让人觉得不够实时。

3.1.5 调度算法类型

可以看出实时进程和普通进程调度差异很大。

3.1.6 普通进程调度

·静态优先级 [100-139]：数值越大，优先级越低，子进程从父进程集成静态优先级，但是可以通过nice值传递或者setpriority()接口修改这个优先级

·基本时间片：很明显，优先级越高的进程获得的时间片长度也越长

·动态优先级和睡眠时间

进程睡眠时间越长，bonus越大，导致动态优先级越小。

3.1.7 实时进程调度

优先级范围[0-99], 实时进程运行期间，禁止低优先级进程运行，而且他总是被系统当成没有用完时间片的活动进程，除非发生以下情况，他才会让出CPU：

1️⃣调用sched_yield()主动让出CPU

2️⃣被另外一个优先级更高的实时进程抢占

3️⃣执行了阻塞操作，进入了睡眠

4️⃣进程停止或者被杀死

5️⃣进程是基于时间片轮转的，且用完了时间片

从上面可知，实时进程如果是基于5️⃣让出，其执行时间片长度是影响系统性能的关键因素：

一般来说，我们的网络收发包，高精度时钟等机制都是在软中断中去实现，而且是在各个核上的ksoftirqd进程中去调度，ps可以看到这个线程的优先级是80，属于实时进程，如果有一个优先级更高的实时进程，没次使用的时间片都很高，可以预想会对软中断任务产生性能上的影响。

同时可以看到，内核存在大量优先级一样的实时进程，他们之间遵守的就是1️⃣3️⃣4️⃣规则了：

更高优先级的实时进程则是一些更加重要的任务：

看门狗的优先级高达9，可见无论何时，其都是第一优先执行的任务；[writeback]负责回写脏页；kworker/1:0H负责处理内核高优先级的实际任务处理，是一个占位进程。

3.2 调度程序使用的数据结构

3.2.1 runqueue结构

kernel/sched/sched.h:

/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
* (such as the load balancing or the thread migration code), lock
* acquire operations must be ordered by ascending &runqueue.
*/
struct rq {
   /* runqueue lock: */
   raw_spinlock_t lock;

   /*
   * nr_running and cpu_load should be in the same cacheline because
   * remote CPUs use both these fields when doing load calculation.
   */
   unsigned long nr_running;
   #define CPU_LOAD_IDX_MAX 5
   unsigned long cpu_load[CPU_LOAD_IDX_MAX]; //负载均衡算法负载计算相关
   unsigned long last_load_update_tick;
#ifdef CONFIG_NO_HZ
   u64 nohz_stamp;
   unsigned long nohz_flags;
#endif
   int skip_clock_update;

   /* capture load from *all* tasks on this cpu: */
   struct load_weight load;         //负载值当前cpu上运行的所有进程数量
   unsigned long nr_load_updates;
   u64 nr_switches;

   struct cfs_rq cfs; //CFS调度器调度队列
   struct rt_rq rt;      //   实时进程调度队列

#ifdef CONFIG_FAIR_GROUP_SCHED
   /* list of leaf cfs_rq on this cpu: */
   struct list_head leaf_cfs_rq_list;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
   struct list_head leaf_rt_rq_list;
#endif

   /*
   * This is part of a global counter where only the total sum
   * over all CPUs matters. A task can increase this counter on
   * one CPU and if it got migrated afterwards it may decrease
   * it on another CPU. Always updated under the runqueue lock:
   */
   unsigned long nr_uninterruptible;

   struct task_struct *curr, *idle, *stop;
   unsigned long next_balance;
   struct mm_struct *prev_mm;

   u64 clock;
   u64 clock_task;

   atomic_t nr_iowait;

#ifdef CONFIG_SMP
   struct root_domain *rd;
   struct sched_domain *sd;

   unsigned long cpu_power;

   unsigned char idle_balance;
   /* For active balancing */
   int post_schedule;
   int active_balance;
   int push_cpu;
   struct cpu_stop_work active_balance_work;
   /* cpu of this runqueue: */
   int cpu;
   int online;

   struct list_head cfs_tasks;

   u64 rt_avg;
   u64 age_stamp;
   u64 idle_stamp;
   u64 avg_idle;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
   u64 prev_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
   u64 prev_steal_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
   u64 prev_steal_time_rq;
#endif

   /* calc_load related fields */
   unsigned long calc_load_update;
   long calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
   int hrtick_csd_pending;
   struct call_single_data hrtick_csd;
#endif
   struct hrtimer hrtick_timer;
#endif

#ifdef CONFIG_SCHEDSTATS
   /* latency stats */
   struct sched_info rq_sched_info;
   unsigned long long rq_cpu_time;
   /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

   /* sys_sched_yield() stats */
   unsigned int yld_count;

   /* schedule() stats */
   unsigned int sched_count;
   unsigned int sched_goidle;

   /* try_to_wake_up() stats */
   unsigned int ttwu_count;
   unsigned int ttwu_local;
#endif

#ifdef CONFIG_SMP
   struct llist_head wake_list;
#endif
};

在每个 CPU 中都有一个自身的运行队列 rq，每个活动进程只出现在一个运行队列中，在多个 CPU 上同时运行一个进程是不可能的, 因此他是一个每cpu变量：

其操作方法如下：

系统中所有的运行队列都在 runqueues 数组中，该数组的每个元素分别对应于系统中的一个 CPU。在单处理器系统中，由于只需要一个就绪队列，因此数组只有一个元素。

3.3调度时机

何时发生调度？

·时机1，进程要调用 sleep() 或 exit() 等函数进行状态转换，这些函数会主动调用调度程序进行进程调度。

·时机2，由于进程的时间片是由时钟中断来更新的，因此，这种情况和时机4 是一样的。

·时机3，当设备驱动程序执行长而重复的任务时，直接调用调度程序。在每次反复循环中，驱动程序都检查 need_resched 的值，如果必要，则调用调度程序 schedule() 主动放弃 CPU。

·时机4 ，不管是从中断、异常还是系统调用返回，最终都调用 ret_from_sys_call()，由这个函数进行调度标志的检测，如果必要，则调用调用调度程序。那么，为什么从系统调用返回时要调用调度程序呢？这当然是从效率考虑。从系统调用返回意味着要离开内核态而返回到用户态，而状态的转换要花费一定的时间，因此，在返回到用户态前，系统把在内核态该处理的事全部做完。

3.4调度程序使用的函数

1.try_to_wakeup() 唤醒处于睡眠或者停止的进程将其插入本地CPU的运行队列

2.recalc_task_prio() 更新动态优先级

3.schedule() 从运行队列中取出一个进程分配给cpu运行

4.load_balance() 负载均衡相关

5.scheduler_tick() 更新时间片顾名思义针对基于时间片轮转的进程有效，如果是实时进程（FIFO）不产生效果

3.4.1 schedule()实现

./kernel//sched/core.c

asmlinkage void __sched schedule(void)
{
   struct task_struct *tsk = current;

   sched_submit_work(tsk);
   __schedule();
}

EXPORT_SYMBOL(schedule);

/*
* __schedule() is the main scheduler function.
*/
static void __sched __schedule(void)
{

    /*prev 表示调度之前的进程, next 表示调度之后的进程 */
   struct task_struct *prev, *next;
   unsigned long *switch_count;
   struct rq *rq;
   int cpu;

need_resched:
   preempt_disable(); //调度期间禁止抢占
   cpu = smp_processor_id(); //获取当前需要调度的CPU
   rq = cpu_rq(cpu); //获取当前CPU的运行队列
   rcu_note_context_switch(cpu); //rcu同步机制，参考：Linux RCU机制_风雨夕的博客-CSDN博客_linux rcu1. 简介RCU (Read-copy update)是2002年10月添加到Linux内核中的一种同步机制。作为数据同步的一种方式，在当前的Linux内核中发挥着重要的作用。RCU主要针对的数据对象是链表，目的是提高遍历读取数据的效率，为了达到目的使用RCU机制读取数据的时候不对链表进行耗时的加锁操作。这样在同一时间可以有多个线程同时读取该链表，并且允许一个线程对链表进行修改（修改的时候，需要加锁）。RCU适用于需要频繁的读取数据，而相应修改数据并不多的情景，例如在文件系统中，经常需要查找定位目https://blog.csdn.net/qq_33095733/article/details/123708142      prev = rq->curr;        //prev curr分别代表之前和当前运行线程，这里当前运行线程即将被调度出去成为上一个线程

   schedule_debug(prev);

   if (sched_feat(HRTICK))
       hrtick_clear(rq);

   raw_spin_lock_irq(&rq->lock);

   switch_count = &prev->nivcsw;
   if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
       if (unlikely(signal_pending_state(prev->state, prev))) {
           prev->state = TASK_RUNNING;
       } else {
           deactivate_task(rq, prev, DEQUEUE_SLEEP);
           prev->on_rq = 0;

           /*
           * If a worker went to sleep, notify and ask workqueue
           * whether it wants to wake up a task to maintain
           * concurrency.
           */
           if (prev->flags & PF_WQ_WORKER) {
               struct task_struct *to_wakeup;

               to_wakeup = wq_worker_sleeping(prev, cpu);
               if (to_wakeup)
                   try_to_wake_up_local(to_wakeup);
           }
       }
       switch_count = &prev->nvcsw;
   }

   pre_schedule(rq, prev);

   if (unlikely(!rq->nr_running))        //不太可能没有任务运行了
       idle_balance(cpu, rq); //没有任务则调用idle_balance，盲猜使用IDLE角度策略

   put_prev_task(rq, prev);         //调用到prev->sched_class->put_prev_task(rq, prev）

这其实是调度策略中注册的策略接口，具体使用实时策略还是CFS策略要看prev使用的类型：

     next = pick_next_task(rq); //从运行队列pick一个新的队列也是调度策略中注册的接口

   clear_tsk_need_resched(prev);
   rq->skip_clock_update = 0;

   if (likely(prev != next)) {
       rq->nr_switches++; // 切换次数+1
       rq->curr = next; //进程指针指向新的进程
       ++*switch_count;

       context_switch(rq, prev, next); /* unlocks the rq */ 上下文切换，保存老进程上文，设置加载好新进程下文，这是进程运行环境的设置
       /*
       * The context switch have flipped the stack from under us
       * and restored the local variables which were saved when
       * this task called schedule() in the past. prev == current
       * is still correct, but it can be moved to another cpu/rq.
       */
       cpu = smp_processor_id();
       rq = cpu_rq(cpu);
   } else
       raw_spin_unlock_irq(&rq->lock);

   post_schedule(rq);

   sched_preempt_enable_no_resched();
   if (need_resched())
       goto need_resched;//调度的进程和之前的进程一样重新调度
}

可见 schedule()主要做了以下几件事：

（1）清理当前运行中的进程

（2）选择下一个要运行的进程（pick_next_task）

（3）设置新进程的运行环境

（4）进程上下文切换

3.5多CPU的负载均衡

调度域：Linux内核并不在单个cpu之间做负载平衡，而是提出了调度域概念并以此为单位金恒负载。

3.5 调度相关的系统调用

nice（）用于降低的基本优先级只影响调用它的进程

setpriority、getpriority 可以作用于给定进程组的所有进程

sched_getaffinity/sched_setaffinity 设置进程的CPU亲和力

四、参考文章

1. 《深入理解Linux内核第三版》

2. Linux内核的整体架构1. 前言本文是“Linux内核分析”系列文章的第一篇，会以内核的核心功能为出发点，描述Linux内核的整体架构，以及架构之下主要的软件子系统。之后，会介绍Linux内核源文...http://www.wowotech.net/linux_kenrel/11.html

水乡夜航

关注

7
点赞
踩
25

收藏

觉得还不错? 一键收藏
0
评论
Linux内核框架之内核进程

前言内核主要架构由五部分构成：内存管理，进程调度和管理，文件系统、设备管理和驱动，网络驱动。本系列文希望通过代码实践和参考文章的方式力争对这几个部分做出深入的了解。目录前言一、Linux的进程介绍1.进程线程和轻量级进程2.进程调度的结构组成二、进程的静态描述1.进程描述符2.进程链表的维护2.1运行队列2.2 等待队列3.进程间关系4.进程切换一、Linux的进程介绍1.进程线程和轻量级进程进.............
复制链接

扫一扫