Linux之调度管理（3）-CFS调度器详解

最新推荐文章于 2025-02-11 21:53:37 发布

wenqiang.zhao

最新推荐文章于 2025-02-11 21:53:37 发布

阅读量1.1k

点赞数 11

文章标签： linux 运维服务器

本文链接：https://blog.csdn.net/z20230508/article/details/143703918

版权

一、调度的发展历史

字段	版本
O(n) 调度器	linux0.11 - 2.4
O(1) 调度器	linux2.6
CFS调度器	linux2.6至今

O(n) 调度器是在内核2.4以及更早期版本采用的算法，其调度算法非常简单和直接，就绪队列是个全局列表，从就绪队列中查找下一个最佳任务，由于每次在寻找下一个任务时需要遍历系统中所有的任务（全局列表），因此被称为 O(n) 调度器（时间复杂度）。
内核2.6采用了O(1) 调度器，让每个CPU维护一个自己的就绪队列，从而减少了锁的竞争。就绪队列由两个优先级数组组成，分别是active优先级数组和expired优先级数组。每个优先级数组包含140个优先级队列，也就是每个优先级对应一个队列，其中前100个对应实时进程，后40个对应普通进程。如下图所示：
这样设计的好处，调度器选择下一个被调度任务就变得高效和简单多了，只需要在active优先级数组中选择优先级高，并且队列中有可运行的任务即可。这里使用位图来定义该队列中是否有可运行的任务，如果有，则位图中相应的位就会被置1。这样选择下一个被调用任务的时间就变成了查询位图的操作。
但上面的算法有个问题，一个高优先级多线程的应用会比低优先级单线程的应用获得更多的资源，这就会导致一个调度周期内，低优先级的应用可能一直无法响应，直到高优先级应用结束。CFS调度器就是站在一视同仁的角度解决了这个问题，保证在一个调度周期内每个任务都有执行的机会，执行时间的长短，取决于任务的权重。下面详细看下CFS调度器是如何动态调整任务的运行时间，达到公平调度的。

二、目前内核中调度器类

Linux定义了5种调度器类，分别对应stop、deadline、realtime、cfs、idle，他们通过next串联起来。

const struct sched_class stop_sched_class = {
    .next            = &dl_sched_class,
...
};

const struct sched_class dl_sched_class = {
    .next            = &rt_sched_class,
...
};

const struct sched_class rt_sched_class = {
    .next            = &fair_sched_class,
...
};

const struct sched_class fair_sched_class = {
    .next            = &idle_sched_class,
...
}

const struct sched_class idle_sched_class = {
    /* .next is NULL */
...
};

/*
 * Scheduling policies
 */
#define SCHED_NORMAL        0
#define SCHED_FIFO        1
#define SCHED_RR        2
#define SCHED_BATCH        3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE        5
#define SCHED_DEADLINE        6

同时定义了6中调度策略，3中调度实体，他们之间的关系如下表：

调度器类	调度策略	调度实体	优先级
stop_sched_class
dl_sched_class	SCHED_DEADLINE	sched_dl_entity	(, 0)
rt_sched_class	SCHED_FIFO 、SCHED_RR	sched_rt_entity	[0, 100)
fair_sched_class	SCHED_NORMAL 、SCHED_BATCH	sched_entity	[100, )
idle_sched_class	SCHED_IDLE

创建进程时，会根据调度策略初始化调度器类。对于实时调度器类，后面一篇文章介绍。

三个调度实体是定义在 struct task_struct 结构体中的，可以方便把进程放入其中一个调取器类。

三、CFS 调度器中涉及的几个重要算法

1、权重计算

1.1 计算优先级

计算优先级之前，首先要明白struct task_struct中各个关于优先级成员的含义：

struct task_struct {
...
    int prio, static_prio, normal_prio;
    unsigned int rt_priority;
...
    unsigned int policy;
...
};

prio：保存进程动态优先级，系统根据prio选择调度类，有些情况需要暂时提高进程优先级。
static_prio：静态优先级，在进程启动时分配。内核不保存nice值，通过PRIO_TO_NICE 根据task_struct->static_prio计算得到。这个值可以通过nice/renice或者setpriority()修改。
normal_prio：是基于static_prio和调度策略计算出来的优先级，在创建进程时会继承父进程normal_prio。对普通进程来说，normal_prio等于static_prio；对实时进程，会根据rt_priority重新计算normal_prio。
rt_priority：实时进程的优先级，和进程设置参数sched_param.sched_priority等价。

rt_priority在普通进程中等于0，实时进程中范围是1~99。

normal_prio在普通进程中等于static_prio；在实时进程中normal_prio=99-rt_priority。

下面有几个问题：

第一：在内核中，normal_prio（）函数如何实现优先级的计算？获取normal_prio的函数是normal_prio()。

static inline int __normal_prio(struct task_struct *p)
{
    return p->static_prio;
}

static inline int normal_prio(struct task_struct *p)
{
    int prio;

    if (task_has_dl_policy(p))
        prio = MAX_DL_PRIO-1;-----------------------------------对于DEADLINE类进程来说固定值为-1。
    else if (task_has_rt_policy(p))
        prio = MAX_RT_PRIO-1 - p->rt_priority;------------------对于实时进程来说，normal_prio=100-1-rt_priority
    else
        prio = __normal_prio(p);--------------------------------对普通进程来说normal_prio=static_prio
    return prio;
}

normal_prio 函数是根据调度策略来计算 prio 优先级。如果用户在使用sched_setscheduler()系统调用函数设置优先级和调度策略要匹配。

普通进程：prio和static_prio相等;

实时进程：prio和rt_priority存在prio+rt_priority=99关系。

第二：在内核中，如何获取 prio ? 获取prio的函数是effective_prio()。仅仅是获取某个进程的 prio

static int effective_prio(struct task_struct *p)
{
    p->normal_prio = normal_prio(p); // normal_prio 优先级的计算函数
    /*
     * If we are RT tasks or we were boosted to RT priority,
     * keep the priority unchanged. Otherwise, update priority
     * to the normal priority:
     */
    if (!rt_prio(p->prio))-------------------即prio大于99的情况，此时为普通进程，prio=normal_prio=static_prio。
        return p->normal_prio;
    return p->prio;  // 上面是设置normal_prio 优先级，此处是返回动态优先级 prio.
}

第三：在进程创建时，prio 动态优先级是多少？是继承父进程的 normal_prio 优先级，如果需要修改子进程的优先级，需要调用 sched_setscheduler()函数，进行 prio 优先级修改。

进程创建函数栈：kernel_clone-->copy_process

copy_process()
  sched_fork()
    __sched_fork()
    fair_sched_class->task_fork()->task_fork_fair()
      __set_task_cpu()
      update_curr()
      place_entity()
  wake_up_new_task()
    activate_task()
      enqueue_task
        fair_sched_class->enqueue_task-->enqueue_task_fair()

在sched_fork()调用__sched_fork()对struct task_struct进行初始化，

int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
    unsigned long flags;
    int cpu = get_cpu();--------------------------------------------------禁止任务抢占并且获取cpu序号

    __sched_fork(clone_flags, p);
    p->state = TASK_RUNNING;----------------------------------------------此时并没有真正运行，还没有加入到调度器
    p->prio = current->normal_prio;

    /*
     * Revert to default priority/policy on fork if requested.
     */
    if (unlikely(p->sched_reset_on_fork)) {-------------------------------如果sched_reset_on_fork为true，重置policy、static_prio、prio、weight、inv_weight等。
        if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
            p->policy = SCHED_NORMAL;
            p->static_prio = NICE_TO_PRIO(0);
            p->rt_priority = 0;
        } else if (PRIO_TO_NICE(p->static_prio) < 0)
            p->static_prio = NICE_TO_PRIO(0);

        p->prio = p->normal_prio = __normal_prio(p);
        set_load_weight(p);
        p->sched_reset_on_fork = 0;
    }

    if (dl_prio(p->prio)) {
        put_cpu();
        return -EAGAIN;
    } else if (rt_prio(p->prio)) {
        p->sched_class = &rt_sched_class;
    } else {
        p->sched_class = &fair_sched_class;-------------------------------根据task_struct->prio选择调度器类，
    }

    if (p->sched_class->task_fork)
        p->sched_class->task_fork(p);-------------------------------------调用调度器类的task_fork方法，cfs对应task_fork_fair()。

    raw_spin_lock_irqsave(&p->pi_lock, flags);
    set_task_cpu(p, cpu);-------------------------------------------------将p指定到cpu上运行，如果task_struct->stack->cpu和当前所在cpu不一致，需要将cpu相关设置到新CPU上。
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
    if (likely(sched_info_on()))
        memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
    p->on_cpu = 0;
#endif
    init_task_preempt_count(p);-------------------------------------------初始化preempt_count
#ifdef CONFIG_SMP
    plist_node_init(&p->pushable_tasks, MAX_PRIO);
    RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

    put_cpu();------------------------------------------------------------启用任务抢占
    return 0;
}

可以看出，新进程的prio :

 p->prio = current->normal_prio;

调用sched_setscheduler()函数修改优先级：

sched_setscheduler->_sched_setscheduler->__setscheduler_prio

static void __setscheduler_prio(struct task_struct *p, int prio)
{
        if (dl_prio(prio))
                p->sched_class = &dl_sched_class;
        else if (rt_prio(prio))
                p->sched_class = &rt_sched_class;
        else
                p->sched_class = &fair_sched_class;

        p->prio = prio;// 新的优先级 赋值给 prio
}

从上面代码可以发现，修改一个进程的优先级，会根据这个优先级来更换调度器类。进一步说明，在内核中，是根据prio 来选择调度器类，即根据 prio 来选择调度算法。进程优先级的设定非常重要。

第四：如果父进程是普通进程，创建子进程时，父进程和子进程的优先级关系？如果父进程是实时进程，创建子进程时，父进程和子进程的关系？

答案是和父进程的调度策略和优先级一致。但是对于动态优先级 prio 是不一样的。

总结：

普通进程：static_prio=prio=normal_prio；rt_priority=0。

实时进程：prio=normal_prio=99-rt_priority；rt_priority=sched_param.sched_priority，rt_priority=[1, 99]；static_prio保持默认值不改变。

1.1.1 static_prio和nice之间的关系

内核使用0~139数值表示优先级，数值越低优先级越高。其中0~99给实时进程使用，100~139给普通进程(SCHED_NORMAL/SCHED_BATCH)使用

用户空间nice传递的变量映射到普通进程优先级，即100~139；

关于nice和prio之间的转换，内核提供NICE_TO_PRIO和PRIO_TO_NICE两个宏。

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO        MAX_USER_RT_PRIO

#define MAX_PRIO        (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO        (MAX_RT_PRIO + NICE_WIDTH / 2)

/*
 * Convert user-nice values [ -20 ... 0 ... 19 ]
 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
 * and back.
 */
#define NICE_TO_PRIO(nice)    ((nice) + DEFAULT_PRIO)
#define PRIO_TO_NICE(prio)    ((prio) - DEFAULT_PRIO)

/*
 * 'User priority' is the nice value converted to something we
 * can work with better when scaling various scheduler parameters,
 * it's a [ 0 ... 39 ] range.
 */
#define USER_PRIO(p)        ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p)    USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO        (USER_PRIO(MAX_PRIO))

参考文章：Linux之进程优先级多方面分析_linux fifo priority -51-CSDN博客

1.2 计算权重

内核中使用struct load_weight数据结构来记录调度实体的权重信息。

权重信息是根据优先级来计算的，通过task_struct->se.load来获取进程的权重信息。

因为权重仅适用于普通进程，普通进程的nice对应范围是-20~19。

struct task_struct {
...
    struct sched_entity se;
...
};

struct sched_entity {
    struct load_weight    load;        /* for load-balancing */
...
};

struct load_weight {
    unsigned long weight;----------------调度实体的权重
    u32 inv_weight;----------------------inverse weight，是全中一个中间计算结果。
};

set_load_weight()设置进程的权重值，通过task_struct->static_prio从prio_to_weight[]和prio_to_wmult[]获取。

static void set_load_weight(struct task_struct *p)
{
    int prio = p->static_prio - MAX_RT_PRIO;---------------------权重值取决于static_prio，减去100而不是120，对应了下面数组下标。
    struct load_weight *load = &p->se.load;

    /*
     * SCHED_IDLE tasks get minimal weight:
     */
    if (p->policy == SCHED_IDLE) {
        load->weight = scale_load(WEIGHT_IDLEPRIO);-------------IDLE调度策略进程使用固定优先级权重，取最低普通优先级权重的1/5。
        load->inv_weight = WMULT_IDLEPRIO;----------------------取最低普通优先级反转权重的5倍。
        return;
    }

    load->weight = scale_load(prio_to_weight[prio]);
    load->inv_weight = prio_to_wmult[prio];
}

prio_to_weight[]以nice 0为基准权重1024，然后将nice从-20~19预先计算出。set_load_weight()就可以通过优先级得到进程对应的权重。

prio_to_wmult[]为了方便计算vruntime而预先计算结果。

inv_weight=2^32/weight

static const int prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

static const u32 prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

第一，static_prio 决定权重值的选择；

第二，权重值已经是形成权重值表，方便内核直接使用，快速且高效，根本不用CPU参与运算，直接使用；

第三，权重值是CFS进程运行时间计算的参数，权重越大，分配的时间越多，权重值越小，分配的时间越少；

1.3 runtime 和vruntime 的计算

CFS 调度器没有时间片的概念了，而是根据实际的运行时间和虚拟运行时间来对任务进行排序，从而选择调度。那么，运行时间和虚拟运行时间是怎么计算的呢？看一下流程调用：

(1) Linux 内核默认的 sysctl_sched_latency 是 6ms，这个值用户态可设。sched_period 用于保证可运行任务都能至少运行一次的时间间隔；

(2) 当可运行任务大于 8 个的时候，sched_period 的计算则需要根据任务个数乘以最小调度颗粒值，这个值系统默认为 0.75ms；

(3) 每个任务的运行时间计算，是用 sched_period 值，去乘以该任务在整个 CFS 运行队列中的权重占比；

(4) 虚拟运行的时间 = 实际运行时间 * NICE_0_LOAD / 该任务的权重；

从上面计算公式可以看出，权重高的进程运行时间 runtime 更大，但是 vruntime 由于分子分母互相消除，权重高的进程的虚拟运行时间的增速却是一样的。

当调用CFS调度器类入队函数 enqueue_entity时，会计算上面的虚拟运行时间：

enqueue_entity->place_entity->sched_vslice->calc_delta_fair

static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

sched_slice:

/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
        //如果当前进程权重是NICE_0_WEIGHT，虚拟时间就是delta，不需要__calc_delta()计算。
        if (unlikely(se->load.weight != NICE_0_LOAD))
                delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

        return delta;
}

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
        if (unlikely(nr_running > sched_nr_latency))// 大于8
                return nr_running * sysctl_sched_min_granularity;//个数*0.75ms
        else
                return sysctl_sched_latency;//6ms
}

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        unsigned int nr_running = cfs_rq->nr_running;
        u64 slice;

        if (sched_feat(ALT_PERIOD))
                nr_running = rq_of(cfs_rq)->cfs.h_nr_running;

        slice = __sched_period(nr_running + !se->on_rq); //获取CFS运行队列调度周期

        for_each_sched_entity(se) {
                struct load_weight *load;
                stru