linux分配固定cou资源,linux调度器（三）- fair class入门

最新推荐文章于 2022-07-06 11:09:58 发布

Luna Knight

最新推荐文章于 2022-07-06 11:09:58 发布

阅读量276

点赞数

文章标签： linux分配固定cou资源

在看这一篇之前先看一下中的普通进程的模型。关于vruntime以及weight。

这两点对于fair进程至关重要。

CFS(completely fair

scheduler)并不是完全公平的，而且也不可能是完全公平的，因为进程会更加其重要性而有优先级之分。

在pick_next_entity中，仅仅是从每一个cfs rq中去选择vruntime最小的进程。所以vruntime成为fair class进程也就是普通进程的唯一的标识符。

下面就从pick_next_entity函数进入cfs吧。

static struct sched_entity *

pick_next_entity(struct cfs_rq *cfs_rq,

struct sched_entity *curr)

{

struct sched_entity *left = __pick_first_entity(cfs_rq);

这个函数看起来似乎有点的复杂，但是最主要的其实就是第一句。

Linux将所有的进程以vruntime为key，以红黑树的形式组织在cfs

rq中。显然__pick_first_entity就是去获取vruntime最小的entity。这边顺便提一句，由于cgroup的存在。Pick_next_entity获得entity并不一定是一个实际的task，而有可能是一个cgroup。

所以在pick_next_task_fair中是一个do while的形式。

do {

struct sched_entity *curr = cfs_rq->curr;

* Since we got here without doing put_prev_entity() we also

* have to consider cfs_rq->curr. If it is still a runnable

* entity, update_curr() will update its vruntime, otherwise

* forget we've ever seen it.

if (curr) {

if (curr->on_rq)

update_curr(cfs_rq);

else

curr = NULL;

* This call to check_cfs_rq_runtime() will do the

* throttle and dequeue its entity in the parent(s).

* Therefore the 'simple' nr_running test will indeed

* be correct.

if (unlikely(check_cfs_rq_runtime(cfs_rq)))

goto simple;

}

se = pick_next_entity(cfs_rq, curr);

cfs_rq = group_cfs_rq(se);

} while (cfs_rq);

属于同一个cgroup的task，都以类似的形式组织在se->my_q中。My_q也是一个cfs rq。用来表示属于一个cgroup的不同task。这些task也跟cpu

runqueue上的进程类似，在cgroup内部竞争CPU的使用权。

有点远了，再次回到Pick_next_entity函数，除了上面提到的第一句代码之外，其余的代码都用于处理特殊情况。即set_next_entity/set_skip_buddy/set_last_buddy，这部分后面再说。

上面说道了CFS在vruntime层面上是公平的，weight影响着task的实际运行时间。

实时进程的优先级从0到99，普通进程的优先级从100开始到140.为了计算简单，linux维护了一个数组用来表示不同优先级的进程对于的weight值。

static const int prio_to_weight[40] = {

/* -20 */ 88761, 71755, 56483, 46273, 36291,

/* -15 */ 29154, 23254, 18705, 14949, 11916,

/* -10 */ 9548, 7620, 6100, 4904, 3906,

/* -5 */ 3121, 2501, 1991, 1586, 1277,

/* 0 */ 1024, 820, 655, 526, 423,

/* 5 */ 335, 272, 215, 172, 137,

/* 10 */ 110, 87, 70, 56, 45,

/* 15 */ 36, 29, 23, 18, 15,

}

同时维护了另外一个数组，里面是2^32/x的预先计算的值。

static const u32 prio_to_wmult[40] = {

/* -20 */ 48388, 59856, 76040, 92818, 118348,

/* -15 */ 147320, 184698, 229616, 287308, 360437,

/* -10 */ 449829, 563644, 704093, 875809, 1099582,

/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,

/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,

/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,

/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,

/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,

};

其实就weight以及1/weight的预先算好的值，用于后续的计算。但是由于linux kernel对于浮点的支持不好，

所以这边的1/weight用2^32/weight来代替(__update_inv_weight函数)。

我们看一下update_curr中一段关于vruntime的计算。

static void update_curr(struct cfs_rq *cfs_rq)

{

struct sched_entity *curr = cfs_rq->curr;

u64 now = rq_clock_task(rq_of(cfs_rq));

u64 delta_exec;

if (unlikely(!curr))

return;

delta_exec = now - curr->exec_start;

if (unlikely((s64)delta_exec <= 0))

return;

curr->exec_start = now;

schedstat_set(curr->statistics.exec_max,

max(delta_exec, curr->statistics.exec_max));

curr->sum_exec_runtime += delta_exec;

schedstat_add(cfs_rq, exec_clock, delta_exec);

curr->vruntime += calc_delta_fair(delta_exec, curr);

update_min_vruntime(cfs_rq);

比较关心的是其中使用calc_delta_fair函数去根据当前task的weight值计算vruntime。

* delta /= w

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)

{

if (unlikely(se->load.weight != NICE_0_LOAD))

delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

return delta;

}

其最终结果为delta*1024/weight.

优先级高的进程其weight值大(参考prio_to_weight数组)，所以其vruntime相对就增加的比较慢，实际运行时间就相对长一些。

接着回到关键函数pick_next_task_fair。

if (prev != p) {

struct sched_entity *pse = &prev->se;

while (!(cfs_rq = is_same_group(se, pse))) {

int se_depth = se->depth;

int pse_depth = pse->depth;

if (se_depth <= pse_depth) {

put_prev_entity(cfs_rq_of(pse), pse);

pse = parent_entity(pse);

}

if (se_depth >= pse_depth) {

set_next_entity(cfs_rq_of(se), se);

se = parent_entity(se);

}

put_prev_entity(cfs_rq, pse);

set_next_entity(cfs_rq, se);

}

上面这段代码看起来有点费解，还是关于cgroup的。上面讲到了一个cgroup有一个对于的sched_entity结构，其中成员my_q维护这这个cgroup里面的task。其原理跟task竞争CPU的使用权一样，cgourp内部的task竞争代表这个cgroup运行的权利，就像CPU上每次只能运行一个task一样，cgroup内部每次只能选择一个task来运行。同样的原理，选择vruntime最小的进程。

下面继续select_task_rq_fair，用于给一个新创建的task或者新wakeup的task选择新的CPU。其中prev_cpu为进程p上一次运行的cpu。

static int

select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)

{

struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;

int cpu = smp_processor_id();

int new_cpu = prev_cpu;

int want_affine = 0;

int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)

want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();

for_each_domain(cpu, tmp) {

if (!(tmp->flags & SD_LOAD_BALANCE))

break;

* If both cpu and prev_cpu are part of this domain,

* cpu is a valid SD_WAKE_AFFINE target.

if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&

cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {

affine_sd = tmp;

break;

}

在这里面涉及到两个新的结构体struct

sched_domain,sched_group.这是调度器相关的两个结构体。下面是我花的一个典型的8核ARM B-L架构的sched_domain/sched_group的相关示意图。

由于这颗SOC有两个cluster，每个cluster有4个CPU。所以sched_domain分为两个level，第一个level有一个sched_domain的mask为0XFF，这个sched_domain一共有两个sched_group，分别代表两个cluster，两个sched_group的cou mask分别为0X0F跟0XF0.

第二个level有两个sched_domain，他们的cpu mask分别为0xF0,0x0F.每个sched_domain下面有4个sched_group。第一个sched_domain的4个sched_group的cpu mask分别为0x1,0x2,0x4,0x8.分别用来表示第一个cluster上的4个cpu。第二个sched_domain的4个sched_group的cpu mask分别为0x10，0x20，0x40，0x80，用来表示第二个cluster上的4个CPU。如上图所示。

这上面有一个sync =

wake_flags & WF_SYNC的判断，其中WF_SYNC用于表示是否需要sync wakeup。什么叫sync？就是进程A唤醒进程B，然后进程A就进入休眠状态。由于sync的特殊性，所以在满足一定的条件下，倾向于将进程B放在原来进程A运行的CPU上。

这些sched_domian有从属关系，显然cpu

mask为0x0F的domain从属于0xFF的domain。

他们之间用parent，child指针来表示这种关系。

除去wake affine相关的代码之外，其余核心代码如下。

} else while (sd) {

struct sched_group *group;

int weight;

if (!(sd->flags & sd_flag)) {

sd = sd->child;

continue;