深入Linux内核架构第二章学习笔记

最新推荐文章于 2022-09-02 20:29:09 发布

置顶 hongrui

最新推荐文章于 2022-09-02 20:29:09 发布

阅读量3.8k

点赞数

分类专栏： Linux 文章标签： linux内核 struct linux 编译器 null user

本文链接：https://blog.csdn.net/hongrui/article/details/5860897

版权

Linux 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

深入linux内核架构是本根据Linux源代码讲述内核的书，比深入理解linux内核更加贴近代码，讲的更深入浅出一些。

本书第二章主要讲述Linux进程的调度，linux的进程都可以描述为task_struct的结构，这个结构包含进程运行的所有信息。

task_struct的有一个字段是

volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */

来跟踪线程的运行状态，这个volatile 关键字很重要，它告诉编译器定义的变量随时都有可能改变，程序每次需要存储或读取这个变量的时候，直接从变量地址中读取数据。如果没有 volatile关键字，则编译器可能优化读取和存储，可能使用寄存器中的值，如果这个变量由别的程序更新了的话，将出现不一致的现象。

task_struct另外一个重要的字段是类型为pid_t的pid，它默认的最大值是32768。和windows 类似的是linux的线程id和进程id肯定不同，从内核来看，一个用户层创建的线程就是一个普通进程，从源码来看他们的实现非常类似，都调用了do_fork 函数，这是最后的两个参数不同。IA32 linux fork内核实现，

arch/x86/kernel/process_32.c

asmlinkage int sys_fork(struct pt_regs regs)

{

return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);

}

线程的实现

arch/x86/kernel/process_32.c

asmlinkage int sys_clone(struct pt_regs regs)

{

unsigned long clone_flags;

unsigned long newsp;

int __user *parent_tidptr, *child_tidptr;

clone_flags = regs.ebx;

newsp = regs.ecx;

parent_tidptr = (int __user *)regs.edx;

child_tidptr = (int __user *)regs.edi;

if (!newsp)

newsp = regs.esp;

return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);

}

(parent_tidptr和child_tidptr是指向用户空间的指针，对应linux的线程实现，目前通用的linux实现有LinuxThreads和NPTL(Native Posix Thread Library)。

NPTL的支持是在glibc中实现的，与内核无关。

大家输入 getconf GNU_LIBPTHREAD_VERSION 便可以看到本机NPTL的版本。

如果是早期的glibc，使用以下命令查看

$(ldd /bin/ls | grep libc.so | awk '{print $3}') | grep 'Threads|threads'

Linux 2.4 早期版本中使用linuxpthreads库,那么在多线程程序中,会产生一个用户线程来维护当前的线程,例如在main()中创建了2个线程,那么用 ps axm 查看进程表,程序应有4个线程。使用NPTL后大家就会看到一个进程在运行，通过/proc/pid/status产看当前进程的线程数。

由于每个task_struct都嵌入了一个schedule entity的实例，因此进程是可以调度的。

内核版本2.6.24之后的kernel在runqueue中以新增的cfs_rq红黑树结构存放schedule entity(SE)，记录每个task执行相关的时间数据。Kernel在runqueue新增task的時候以SE中的vruntime作为比较依据，和红黑树各SE的vruntime相比，vruntime比较小的SE往左靠拢，反之大vruntime大的SE向右靠拢，直到这个SE成为最左边或最右边的节点为止。SE中的vruntime的类型为u64。

当新的task创建出来之后task_new_fair函数通过place_entity取得新task的vruntime。

static void

place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)

{

u64 vruntime = cfs_rq->min_vruntime;

* The 'current' period is already promised to the current tasks,

* however the extra weight of the new task will slow them down a

* little, place the new task so that it fits in the slot that

* stays open at the end.

if (initial && sched_feat(START_DEBIT))

vruntime += sched_vslice(cfs_rq, se);

/* sleeps up to a single latency don't count. */

if (!initial && sched_feat(FAIR_SLEEPERS)) {

unsigned long thresh = sysctl_sched_latency;

* Convert the sleeper threshold into virtual time.

* SCHED_IDLE is a special sub-class. We care about

* fairness only relative to other SCHED_IDLE tasks,

* all of which have the same weight.

if (sched_feat(NORMALIZED_SLEEPER) && (!entity_is_task(se) ||

task_of(se)->policy != SCHED_IDLE))

thresh = calc_delta_fair(thresh, se);

* Halve their sleep time's effect, to allow

* for a gentler effect of sleepers:

if (sched_feat(GENTLE_FAIR_SLEEPERS))

thresh >>= 1;

vruntime -= thresh;

}

/* ensure we never gain time by being placed backwards. */

vruntime = max_vruntime(se->vruntime, vruntime);

se->vruntime = vruntime;

}

cfs_rq->min_vruntime就是整个红黑树最左边schedule entity的vruntime，在kernel调用__update_curr时更新，整个红黑树中所有的schedule entity的vruntime会随着时间而增加。

取得新task的vruntime之后task_new_fair函数间接调用enqueue_task_fair函数调用__enqueue_entity函式把新task的schedule entity放入cfs_rq，vruntime在__enqueue_entity函式中被用来決定新task该放在cfs_rq红黑树中的那个位置。

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

{

struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;

struct rb_node *parent = NULL;

struct sched_entity *entry;

s64 key = entity_key(cfs_rq, se);

int leftmost = 1;

* Find the right place in the rbtree:

while (*link) {

parent = *link;

entry = rb_entry(parent, struct sched_entity, run_node);

* We dont care about collisions. Nodes with

* the same key stay together.

if (key < entity_key(cfs_rq, entry)) {

link = &parent->rb_left;

} else {

link = &parent->rb_right;

leftmost = 0;

}

* Maintain a cache of leftmost tree entries (it is frequently

* used):

if (leftmost)

cfs_rq->rb_leftmost = &se->run_node;

rb_link_node(&se->run_node, parent, link);

rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);

}

时钟中断通过update_process_times调用scheduler_tick，在2.6.24以前的版本中 scheduler_tick通过减少当前task的time slice直到为0，或者到某个值(这个值可以通过TIMESLICE_GRANULARITY得到)。但是在2.6.24以后不再使用time slice。

void scheduler_tick(void)

{

int cpu = smp_processor_id();

struct rq *rq = cpu_rq(cpu);

struct task_struct *curr = rq->curr;

u64 next_tick = rq->tick_timestamp + TICK_NSEC;

spin_lock(&rq->lock);

__update_rq_clock(rq);

* Let rq->clock advance by at least TICK_NSEC:

if (unlikely(rq->clock < next_tick))

rq->clock = next_tick;

rq->tick_timestamp = rq->clock;

update_cpu_load(rq);

if (curr != rq->idle) /* FIXME: needed? */

curr->sched_class->task_tick(rq, curr);

spin_unlock(&rq->lock);

#ifdef CONFIG_SMP

rq->idle_at_tick = idle_cpu(cpu);

trigger_load_balance(rq, cpu);

#endif

}

scheduler_tick调用task_tick函数，task_tick的实现方式取决于底层的实现。

SE的load是load_weight的结构体，struct load_weight {

unsigned long weight, inv_weight;

};

weight和inv_weight分別来自prio_to_weight和prio_to_wmult两个数组，如果task为real-time task或idle task，kernel直接把weight和inv_weight设为固定值，调用函数为set_load_weight。

static void set_load_weight(struct task_struct *p)

{

if (task_has_rt_policy(p)) {

p->se.load.weight = prio_to_weight[0] * 2;

p->se.load.inv_weight = prio_to_wmult[0] >> 1;

return;

}

* SCHED_IDLE tasks get minimal weight:

if (p->policy == SCHED_IDLE) {

p->se.load.weight = WEIGHT_IDLEPRIO;

p->se.load.inv_weight = WMULT_IDLEPRIO;

return;

}

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];

p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];

}

当task是普通task，它的load取决于static priority，static priority在创建task时就创建好了，在100到140之间。

static const int prio_to_weight[40] = {

/* -20 */ 88761, 71755, 56483, 46273, 36291,

/* -15 */ 29154, 23254, 18705, 14949, 11916,

/* -10 */ 9548, 7620, 6100, 4904, 3906,

/* -5 */ 3121, 2501, 1991, 1586, 1277,

/* 0 */ 1024, 820, 655, 526, 423,

/* 5 */ 335, 272, 215, 172, 137,

/* 10 */ 110, 87, 70, 56, 45,

/* 15 */ 36, 29, 23, 18, 15,

};

static const u32 prio_to_wmult[40] = {

/* -20 */ 48388, 59856, 76040, 92818, 118348,

/* -15 */ 147320, 184698, 229616, 287308, 360437,

/* -10 */ 449829, 563644, 704093, 875809, 1099582,

/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,

/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,

/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,

/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,

/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,

};

sched_slice根据runqueue中task的个数决定每个task的合理执行时间，当current task使用CPU超出合理执行时间时，kernel将current task置为重新调度。

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)

{

u64 slice = __sched_period(cfs_rq->nr_running);

slice *= se->load.weight;

do_div(slice, cfs_rq->load.weight);

return slice;

}

kernel判断每个task执行时间的取决于task的static priority和cfs_rq->nr_running(runqueue的task数量)。当runqueue中task小于等于5时，task的priority和执行时间由task的static priority决定，当runqueue中的task5時，則runqueue中的task也成为影响因素之一。