linux 调度笔记

最新推荐文章于 2024-01-03 14:09:10 发布

zwcq82

最新推荐文章于 2024-01-03 14:09:10 发布

阅读量634

点赞数

分类专栏： kernel

本文链接：https://blog.csdn.net/zwcq82/article/details/17716147

版权

kernel 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

时间片

a：如何计算时间片？

b：怎么判断时间片到了？

c：时间片到了后如何处理？

d：CFS的rebtree是根据vruntime进行排序的（vruntime到目前为止进程执行的虚拟时间）。那么如果Pa运行很久了，而Pb刚运行，这种情况进行分析。如何保证Pa在时间片内可以得到运行，而不至于latency大于CFS的最大延时时间（sched_latency_ns）。或者我对CFS的理解有问题，CFS的运行原理不是这样。

如何计算时间片？

normal 和RT的计算方法不同，normal没有采用时间片的概念，与jiffies和tick并没有任何关系，RT却仍然使用tic/jiffies的概念

SCHED_FIFO

其并没有时间片的概念；

SCHED_RR

有时间片概念，而且与tick/jiffies相关。

static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
{
        /*
         * Time slice is 0 for SCHED_FIFO tasks
         */
        if (task->policy == SCHED_RR)
                return sched_rr_timeslice;
        else
                return 0;
}

int sched_rr_timeslice = RR_TIMESLICE;

/*
* default timeslice is 100 msecs (used only for SCHED_RR tasks).
* Timeslices get refilled after they expire.
*/
#define RR_TIMESLICE (100 * HZ / 1000)

默认RR的时间片RR_TIMESLICE；可以通过/proc/sys/kernel/sched_rr_timeslice_ms 修改。因为其值与HZ有关，即与tick/jiffies相关。

SCHED_NORMAL/SCHED_BATCH

目前普通进程的不在用时间片的概念。其有两个参数：

1：sysctl_sched_latency

/*
* Targeted preemption latency for CPU-bound tasks:
* (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
*
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length
* and have no persistent notion like in traditional, time-slice
* based scheduling concepts.
*
* (to see the precise effective timeslice length of your workload,
* run vmstat and monitor the context-switches (cs) field)
*/
unsigned int sysctl_sched_latency = 6000000ULL;
unsigned int normalized_sysctl_sched_latency = 6000000ULL;

当前CFS运行队列中的时间片，即在这个周期内CFS runqueue中的进程必须得到调度。

可以通过/proc/kernel/sys/sched_latency_ns进行调整。

2：sysctl_sched_min_granularity

/*
* Minimal preemption granularity for CPU-bound tasks:
* (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
unsigned int sysctl_sched_min_granularity = 750000ULL;
unsigned int normalized_sysctl_sched_min_granularity = 750000ULL;

CFS runqueue中每个进程分到的最小时间片（是指在sysctl_sched_latency时间内），这样做的目的防止时间片太短，导致频繁进程切换导致性能低。

可以通过/proc/sys/kernel/sched_min_granularity_ns进行调整。

>CFS runqueue中的进程数目是固定的，如果上面两个参数决定了，那么最大进程数目不就也确定了，对于这个问题后面的讲到。

3： sched_nr_latency

/*
* is kept at sysctl_sched_latency / sysctl_sched_min_granularity
*/
static unsigned int sched_nr_latency = 8;

这个参数就是上面两个值想除的结果。

>>int sched_proc_update_handler（）

sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency, sysctl_sched_min_granularity);

4：如果CFS runqueue的数目大于sched_nr_latency，那么如何保证sysctl_sched_min_granularity/sysctl_sched_latency？

当出现这个现象时，必须要保证sysctl_sched_min_granularity，而不需要保证sysctl_sched_latency。保证运行最小时间才能保证不频率的进行切换。

/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small.
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
        u64 period = sysctl_sched_latency;
        unsigned long nr_latency = sched_nr_latency;

        if (unlikely(nr_running > nr_latency)) {
                period = sysctl_sched_min_granularity;
                period *= nr_running;
        }

        return period;
}

如何判断时间片到了？
RR/FIFO

对于RT进程，.task_tick = task_tick_rt,（仔细研究struct sched_class中的函数，基本会理解调度策略）

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
        struct sched_rt_entity *rt_se = &p->rt;

        update_curr_rt(rq);

        watchdog(rq, p);

        /*
         * RR tasks need a special form of timeslice management.
         * FIFO tasks have no timeslices.
         */
        if (p->policy != SCHED_RR)
                return;

        if (--p->rt.time_slice)
                return;

        p->rt.time_slice = sched_rr_timeslice;

        /*
         * Requeue to the end of queue if we (and all of our ancestors) are not
         * the only element on the queue
         */
        for_each_sched_rt_entity(rt_se) {
                if (rt_se->run_list.prev != rt_se->run_list.next) {
                        requeue_task_rt(rq, p, 0);
                        set_tsk_need_resched(p);
                        return;
                }
        }
}

1：对于FIFO没有时间片，不做任何处理

2：如果时间片到期后，该进程如何处理： Requeue to the end of queue。

3：问题：时间片仅仅是根据tick判断，那么就会出现RR进程刚运行就tick中断，那么实际运行时间就减少。这个问题会出现吗？如果出现，为什么会允许？（可以猜测，如果有RR，其他进程就得不到调度，所以这个问题产生的概率不高）。

应该是这样，最多出现一次这种情况，就是在tick中间，wakeup RR，然后得到调度，随后的tick都只能运行RR，除非高优先级出现。如果是这样，那么根据tick其精确度也很高。

/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
static void update_curr_rt(struct rq *rq)
{
        struct task_struct *curr = rq->curr;
        struct sched_rt_entity *rt_se = &curr->rt;
        struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
        u64 delta_exec;

        if (curr->sched_class != &rt_sched_class)
                return;

        delta_exec = rq_clock_task(rq) - curr->se.exec_start;
        if (unlikely((s64)delta_exec <= 0))
                return;

        schedstat_set(curr->se.statistics.exec_max,
                      max(curr->se.statistics.exec_max, delta_exec));

        curr->se.sum_exec_runtime += delta_exec;
        account_group_exec_runtime(curr, delta_exec);

        curr->se.exec_start = rq_clock_task(rq);
        cpuacct_charge(curr, delta_exec);

        sched_rt_avg_update(rq, delta_exec);

        if (!rt_bandwidth_enabled())
                return;

        for_each_sched_rt_entity(rt_se) {
                rt_rq = rt_rq_of_se(rt_se);

                if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
                        raw_spin_lock(&rt_rq->rt_runtime_lock);
                        rt_rq->rt_time += delta_exec;
                        if (sched_rt_runtime_exceeded(rt_rq))
                                resched_task(curr);
                        raw_spin_unlock(&rt_rq->rt_runtime_lock);
                }
        }
}

4：RR进程的运行时间统计很精确，并没有使用tick/jiffies，而是和CFS一样的精度。但是为什么要用tick/jiffies判断时间片，而不是采用sum_exec_runtime？

NORMAL/BATCH

对于NORMAL，也在task_tick中判断：.task_tick = task_tick_fair,

/*
* scheduler tick hitting a task of our scheduling class:
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &curr->se;

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                entity_tick(cfs_rq, se, queued);
        }

        if (numabalancing_enabled)
                task_tick_numa(rq, curr);

        update_rq_runnable_avg(rq, 1);
}

static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);

        /*
         * Ensure that runnable average is periodically updated.
         */
        update_entity_load_avg(curr, 1);
        update_cfs_rq_blocked_load(cfs_rq, 1);
        update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
        /*
          * queued ticks are scheduled to match the slice, so don't bother
         * validating it and just reschedule.
         */
        if (queued) {
                resched_task(rq_of(cfs_rq)->curr);
                return;
        }
        /*
         * don't let the period tick interfere with the hrtick preemption
         */
        if (!sched_feat(DOUBLE_TICK) &&
                        hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
                return;
#endif

        if (cfs_rq->nr_running > 1)
                check_preempt_tick(cfs_rq, curr);
}

1：fair比rt中多用了参数queued，在rt函数中这个参数并没有使用，而在fair中却使用了，而且

#ifdef CONFIG_SCHED_HRTICK条件下，如果queued==1，就需要重新调度该进程。

2：如何解释：关于queued的注释：queued ticks are scheduled to match the slice, so don't bother

validating it and just reschedule.

update_curr->__update_curr()

/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
              unsigned long delta_exec)
{
        unsigned long delta_exec_weighted;

        schedstat_set(curr->statistics.exec_max,
                      max((u64)delta_exec, curr->statistics.exec_max));

        curr->sum_exec_ runtime += delta_exec;
        schedstat_add(cfs_rq, exec_clock, delta_exec);
        delta_exec_weighted = calc_delta_fair(delta_exec, curr);

        curr-> vruntime += delta_exec_weighted;
        update_min_vruntime(cfs_rq);
}

3 :update_curr(cfs_rq);:统计当前进程的运行时间。在__update_curr中要特别区别两个值：delta_exec/delta_exec_weighted

了解它们的含义就基本了解CFS中的重要含义vruntime。delta_exec是运行的时间，单位是ns；而delta_exec_weighted是将运行时间根据某个公式转化成的vruntime。在CFS进程中的两个不同概念的时间：实际运行的时间/vruntime。他们就是根据

calc_delta_fair()进行转换的。对于这一部分，会单独进行讲解，应该是CFS的原理。

/*
* Preempt the current task with a newly woken task if needed:
*/
static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;

        ideal_runtime = sched_slice(cfs_rq, curr);
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
                resched_task(rq_of(cfs_rq)->curr);
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }

        /*
         * Ensure that a task that missed wakeup preemption by a
         * narrow margin doesn't have to wait for a full slice.
         * This also mitigates buddy induced latencies under load.
         */
        if (delta_exec < sysctl_sched_min_granularity)
                return;

        se = __pick_first_entity(cfs_rq);
        delta = curr->vruntime - se->vruntime;

        if (delta < 0)
                return;

        if (delta > ideal_runtime)
                resched_task(rq_of(cfs_rq)->curr);
}

4：如果CFS运行队列存在多个进程，那么就需要在更新完当前进程的运行时间后，判断是否时间片到了，然后进行调度。

当然是在queued!=1的情况下，如果queued==1，就不需要判断了。这里的调度是指仅仅设置需要调度的参数，而不是直接调度。resched_task与__schedule是完全不同的函数。

5：sched_slice就是判断进程分到的时间片，这是真实的时间，单位ns。该函数单独会讲。

6： if (delta_exec > ideal_runtime)：如有本次运行时间大于分到的时间，就需要调度了。

7： if (delta_exec < sysctl_sched_min_granularity)：前面讲过CFS进程有最小运行时间的保证，防止频繁进程上下文切换，导致这个动作占用cpu过多。但是对于这段话的注释，不理解。

8： if (delta > ideal_runtime)：1：为什么还需要判断vruntime？2：为什么将delta与ideal_runtime进行比较，它们是两个不同概念的时间：vruntime和runtime？参照7的注释，我认为7的注释应该狭义，注释描述的应该是下面的情况。大意：wakeup进程，加入运行队列后，在这个点就可以得到调度，而不需要等待该进程的slice完全用完后。如果是这样，那么为什么runtime/vruntime进行比较？

什么时候调用task_tick？

在kernel/sched/core.c中有两个函数调用task_tick：

1：static enum hrtimer_restart hrtick(struct hrtimer *timer)

>>rq->curr->sched_class->task_tick(rq, rq->curr, 1);

2:void scheduler_tick(void)

>>curr->sched_class->task_tick(rq, curr, 0);

他们最重要的区别就是：第三个参数queued不同，这样对NORMAL进程就有质的影响。

scheduler_tick(void)

/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*/
void scheduler_tick(void)

>>curr->sched_class->task_tick(rq, curr, 0);

1:queued=0，表明仅仅是更新进程的运行时间，然后做判断，而不是直接： resched_task(rq_of(cfs_rq)->curr);

2:该函数被时钟中断处理函数调用，周期是HZ，一个jiffies运行一次。

3：该函数仅仅是更新当前进程的运行时间，然后判断是否resched。并不直接参与调度。

4：中断返回会判断，TIF_NEED_RESCHED，如果条件满足，就可以调用__schedule().

hrtick()

注意：这里是hrtick而不是hrtimer的含义。tick 而且是hrtick。

/*
* High-resolution timer tick.
* Runs from hardirq context with interrupts disabled.
*/
static enum hrtimer_restart hrtick(struct hrtimer *timer)
{
        struct rq *rq = container_of(timer, struct rq, hrtick_timer);

        WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());

        raw_spin_lock(&rq->lock);
        update_rq_clock(rq);
        rq->curr->sched_class->task_tick(rq, rq->curr, 1);
        raw_spin_unlock(&rq->lock);

        return HRTIMER_NORESTART;
}

1:这是个HRTIMER，系统没有配置CONFIG_SCHED_HRTICK那么这个函数就没有。问题：如果没有HRTIMER那么后面讲所有东西如何实现的？

2：task_tick的第三个参数queued=1，表明对于fair调度，表明该进程时间片用完，需要resched_task(cur).但是对于RR，并没有用。问题：当然的sched_class是否会是rt_class，如果是这样那么rt的时间片就有可能在非HZ的情况下，时间片减少。这会产生吗？这是个bug吗？

因为RR运行高于NORNAL，而hrtick是为CFS服务的，即使存在这种现象，但是影响应该很小。即RR运行后就不会出现hrtick中断产生的情况。

3：返回值：HRTIMER_NORESTART。那么什么时候会active hrtimer？

4：什么时候enable/reenable this hrtimer？

初始化hrtick

static void init_rq_hrtick(struct rq *rq)
{
#ifdef CONFIG_SMP
        rq->hrtick_csd_pending = 0;

        rq->hrtick_csd.flags = 0;
        rq->hrtick_csd.func = __hrtick_start;
        rq->hrtick_csd.info = rq;
#endif

        hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        rq->hrtick_timer.function = hrtick;
}

1：hrtick属于rq；而不是CFS或者RR。rq是每一个CPU的进程运行队列。

restart hrtick

/*
* Called to set the hrtick timer state.
*
* called with rq->lock held and irqs disabled
*/
void hrtick_start(struct rq *rq, u64 delay)

1：对于SMP和UP该函数的实现不同，暂时不考虑两者的区别，但是都是restart this hrtick

2：参数delay的含义？

hrtick_start_fair（）

static void hrtick_start_fair(struct rq *rq, struct task_struct *p)

{
        struct sched_entity *se = &p->se;
        struct cfs_rq *cfs_rq = cfs_rq_of(se);

        WARN_ON(task_rq(p) != rq);

        if (cfs_rq->nr_running > 1) {
                u64 slice = sched_slice(cfs_rq, se);
                u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
                s64 delta = slice - ran;

                if (delta < 0) {
                        if (rq->curr == p)
                                resched_task(p);
                        return;
                }

                /*
                 * Don't schedule slices shorter than 10000ns, that just
                 * doesn't make sense. Rely on vruntime for fairness.
                 */
                if (rq->curr != p)
                        delta = max_t(s64, 10000LL, delta);

                hrtick_start(rq, delta);
        }
}

1:Only this function call hrtick_start().对于RR根本不会restart hrtick，也就是说hrtick仅仅对CFS进程有用。

2：slice：应该是ideal runtime。理论上分配给进程的时间片。ran：在这个时间片内已经运行的时间；

3：if (slice - ran)： < 0;用完时间片需要调度了；否则就设置运行时间：delta=slice - ran；

4：所以当hrtick运行时，第三个参数 queued=1，在task_tick_fair->entity_tick中，直接重现调度，表明这个进程的CFS时间片用完了。

5:pick_next_task_fair()/hrtick_update()函数调用该函数。

pick_next_task_fair

.pick_next_task = pick_next_task_fair,

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
        struct task_struct *p;
        struct cfs_rq *cfs_rq = &rq->cfs;
        struct sched_entity *se;

        if (!cfs_rq->nr_running)
                return NULL;

        do {
                se = pick_next_entity(cfs_rq);
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

        p = task_of(se);
        if (hrtick_enabled(rq))
                hrtick_start_fair(rq, p);

        return p;
}

1:该函数是选项cfs runqueue中的下一个进程，得到CPU运行。pick_next_task（）->pick_next_task()

a:__schedule():进程真正进程切换时

b: migrate_data()

/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
* 2) pick the "next" process, since someone really wants that to run
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
        struct sched_entity *se = __pick_first_entity(cfs_rq);
        struct sched_entity *left = se;

        /*
         * Avoid running the skip buddy, if running something else can
         * be done without getting too unfair.
         */
        if (cfs_rq->skip == se) {
                struct sched_entity *second = __pick_next_entity(se);
                if (second && wakeup_preempt_entity(second, left) < 1)
                        se = second;
        }

        /*
         * Prefer last buddy, try to return the CPU to a preempted task.
         */
        if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
                se = cfs_rq->last;

        /*
         * Someone really wants this to run. If it's not unfair, run it.
         */
        if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
                se = cfs_rq->next;

        clear_buddies(cfs_rq, se);

        return se;
}

2： pick_next_entity()这个函数比较复杂，暂时只考虑是选择cfs runqueue中的最合适的进程。

struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)

{
        struct rb_node *left = cfs_rq->rb_leftmost;

        if (!left)
                return NULL;

        return rb_entry(left, struct sched_entity, run_node);
}

3：选择rbtree的最左测节点，即rbtree中key最小值的cfs。key即vruntime。为了效率高，增加rb_leftmost，这样就不需要每次都遍历到最左侧的节点。

static void set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        /* 'current' is not kept within the tree. */
        if (se->on_rq) {
                /*
                 * Any task has to be enqueued before it get to execute on
                 * a CPU. So account for the time it spent waiting on the
                 * runqueue.
                 */
                update_stats_wait_end(cfs_rq, se);
                __dequeue_entity(cfs_rq, se);
        }

        update_stats_curr_start(cfs_rq, se);
        cfs_rq->curr = se;
#ifdef CONFIG_SCHEDSTATS
        /*
         * Track our maximum slice length, if the CPU's load is at
         * least twice that of our own weight (i.e. dont track it
         * when there are only lesser-weight tasks around):
         */
        if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
                se->statistics.slice_max = max(se->statistics.slice_max,
                        se->sum_exec_runtime - se->prev_sum_exec_runtime);
        }
#endif
        se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

4: set_next_entiry（）：设置下一次要选择的CFS进程。注意：选择好的进程，被移除rbtree.这点和RR是不同的，当前运行的RR进程仍然在运行队列中，而当前运行的CFS进程却被移除rbtree。

5：se->prev_sum_exec_runtime = se->sum_exec_runtime; ：等到统计时，就知道当前进程本次运行的绝对时间。

6：update_stats_wait_end：按照注释是统计进程进入runqueue到达到CPU的时间。判断CPU等待时间是个很重要的指标。

put_prev_task-->put_prev_task_fair-->put_prev_entity->update_stats_wait_start()

hrtick_update()

/*
* called from enqueue/dequeue and updates the hrtick when the
* current task is from our class and nr_running is low enough
* to matter.
*/
static void hrtick_update(struct rq *rq)
{
        struct task_struct *curr = rq->curr;

        if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
                return;

        if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
                hrtick_start_fair(rq, curr);
}

1:dequeue_task_fair()/enqueue_task_fair() call this function.将一个cfs进程从cfs runqueue队列中移除或者添加。

2：为什么判断nr_running < sched_nr_latency？

》》.enqueue_task = enqueue_task_fair,

/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;

        for_each_sched_entity(se) {
                if (se->on_rq)
                        break;
                cfs_rq = cfs_rq_of(se);
                enqueue_entity(cfs_rq, se, flags);

                /*
                 * end evaluation on encountering a throttled cfs_rq
                 *
                 * note: in the case of encountering a throttled cfs_rq we will
                 * post the final h_nr_running increment below.
                */
                if (cfs_rq_throttled(cfs_rq))
                        break;
                cfs_rq->h_nr_running++;

                flags = ENQUEUE_WAKEUP;
        }

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                cfs_rq->h_nr_running++;

                if (cfs_rq_throttled(cfs_rq))
                        break;

                update_cfs_shares(cfs_rq);
                update_entity_load_avg(se, 1);
        }

        if (!se) {
                update_rq_runnable_avg(rq, rq->nr_running);
                inc_nr_running(rq);
        }
        hrtick_update(rq);
}
1：注释说明：加入进程，改变nr_running所以会导致分配的时间片变化，所以需要重新计算，所以需要call hrtick_update()。

this function--> enqueue_entity--> account_entity_enqueue():cfs_rq->nr_running++;

hrtick_update->hrtick_start_fair-->sched_slice-->__sched_period():使用cfs_rq->nr_running。

加入进程导致时间片变化，因为__sched_period()与nr_runnig有关，重新计算后，当前的进程分配的时间片应该减少，所以需要更新hrtick。

问题：执行这个函数一定会导致nr_running加1吗？从代码看到如果se->or_rq那么说明，进程仍然在rebtree中。而不会执行enqueue_entity（），而增加cfs_rq->nr_running 是在该函数中处理的，如果不执行该函数，那么为什么要重新执行htick_update().

或者这么问：为什么会出现 if (se->on_rq)的情况？从调用情况看，好像这个情况不会出现。

或者这么问：se->on_rq == 1; 进程是否一定在rbtree中？应该是这样的。

>>从代码调用看，对于进程不存在这样的问题。可能与cgroup有关，需确定？基本确定不是bug，而是我理解有问题，存在不理解的场景。

从__schedule()进行分析：

将进程分类：

>>不执行 deactivate_task

>>执行deactivate_task：进程处于！TASK_RUNNING，而且是主动调scheduler().

a:什么情况执行 deactivate_task（）？状态！TASK_RUNNING ，而且是主动调度，而不是强制（set_task_state(TASK_IN/UN)之后，执行scheduler()之前，被抢占即执行preept_scheduler(),这种情况是不需要被移除runqueue的，必须给该进程cpu去主动执行scheduler().即这时if条件成立。

b：deactivate_task:

A：是否将进程从rbtree中移除？

dequeue_entity（）

>>if (se != cfs_rq->curr)

__dequeue_entity(cfs_rq, se);

se->on_rq = 0;
account_entity_dequeue(cfs_rq, se);

这表示什么意思？因为__dequeue_entity才是真正从rtree中移除设备。但是不论怎么都将on_rq=0,并经cfs_rq->nr_running--.

B：进程从run queue队列中移除。

c: put_prev_task--->put_prev_entity()

if (prev->on_rq) {

/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);

on_rq=1，表示进程不能被移除rbtree。属于这种情况：TASK_RUNNING（时间片用完或者被preempt），

！TASK_RUNNING && preept_scheduler（）.即属于__shcedule()中不执行deactivate_task（）

d:pick_next_task-->pick_next_task_fair-->set_next_entity():

/* 'current' is not kept within the tree. */
        if (se->on_rq) {
    ........
                __dequeue_entity(cfs_rq, se);
        }

        update_stats_curr_start(cfs_rq, se);
        cfs_rq->curr = se;

aa)为什么要判断se->on_rq? 会存在on_rq=0的情况吗？

bb)从rbtree中选择最合适的进程，通常下on_rq=1，那么就必须remove from rebtree。所以可以这样判断，当前运行的cfs进程是不在rbtree的。例外就是aa)?

cc） cfs_rq->curr = se;curr等于当前选择的进程的se。

e：执行deactivate_task进程：block的进程 && 主动scheduler（）

aa)deactivate_task--->dequeue_task_fair-->dequeue_entity()

if (se != cfs_rq->curr)

__dequeue_entity(cfs_rq, se);
se->on_rq = 0;
account_entity_dequeue(cfs_rq, se);

按照我的理解，cfs_rq->curr == se(pick_next_task_fair->set_next_entity（）：cfs_rq->curr = se;)。

应该是这样：保证cfs当前运行进程是不在rbtree的，所以当前的进程早在pick_next_task时就移除了，在这里就不需要做这个动作了。

找到：se != cfs_rq->curr的情况？

2: cfs_rq->h_nr_running 有什么用？

3：什么情况调用该函数：

aa)fork 新进程:do_fork-->wake_up_new_task-->activate_task(),将子进程放入运行队列中。

bb）wakeup a block进程：try_to_wake_up-->ttwu_queue-->ttwu_do_activate-->ttwu_activate-->activate_task--->enqueue_task:(sched_calss:enqueue_task)

》》.dequeue_task = dequeue_task_fair,

/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int task_sleep = flags & DEQUEUE_SLEEP;

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                dequeue_entity(cfs_rq, se, flags);

                /*
                 * end evaluation on encountering a throttled cfs_rq
                 *
                 * note: in the case of encountering a throttled cfs_rq we will
                 * post the final h_nr_running decrement below.
                */
                if (cfs_rq_throttled(cfs_rq))
                        break;
                cfs_rq->h_nr_running--;

                /* Don't dequeue parent if it has other entities besides us */
                if (cfs_rq->load.weight) {
                        /*
                         * Bias pick_next to pick a task from this cfs_rq, as
                         * p is sleeping when it is within its sched_slice.
                         */
                        if (task_sleep && parent_entity(se))
                                set_next_buddy(parent_entity(se));

                        /* avoid re-evaluating load for this entity */
                        se = parent_entity(se);
                        break;
                }
                flags |= DEQUEUE_SLEEP;
        }

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                cfs_rq->h_nr_running--;

                if (cfs_rq_throttled(cfs_rq))
                        break;

                update_cfs_shares(cfs_rq);
                update_entity_load_avg(se, 1);
        }

        if (!se) {
                dec_nr_running(rq);
                update_rq_runnable_avg(rq, 1);
        }
        hrtick_update(rq);
}

1：目的：判断se->on_rq == 1,如果成立表明进程放在rbtree中，那么需要从rbtree移除。

2：调用hrtick_update的原因还是因为cfs_rq->nr_running--,

时间片到了如何处理？
NORMAL

在上面我们进过，CFS && CONFIG_SCHED_HRTICK，CFS进程的是通过hrtimer来通知其时间片到期。除了这个，原来的时钟中断中也会判断。两者区别：hrtick 将queued=1，而tick将queue=0. tick中断还需判断是否需要resched：check_preempt_tick（）。

static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);

        /*
         * Ensure that runnable average is periodically updated.
         */
        update_entity_load_avg(curr, 1);
        update_cfs_rq_blocked_load(cfs_rq, 1);
        update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
        /*
         * queued ticks are scheduled to match the slice, so don't bother
         * validating it and just reschedule.
         */
        if (queued) {
                resched_task(rq_of(cfs_rq)->curr);
                return;
        }
        /*
         * don't let the period tick interfere with the hrtick preemption
         */
        if (!sched_feat(DOUBLE_TICK) &&
                        hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
                return;
#endif

        if (cfs_rq->nr_running > 1)
                check_preempt_tick(cfs_rq, curr);
}

pick_next_task_fair

最简单的情况，就是从cfs runqueue即一个rbtree中选择最左边的进程。rbtree的key=vruntime。

还会考虑其他因素，目前不理解。

1：for next/skip/last，其用意不懂？

RT进程中的FIFO并没有时间片的概念，仅仅RR有时间片。在上面我们也分析了，RR进程的时间片仍然使用tick的概念。

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
        struct sched_rt_entity *rt_se = &p->rt;

        update_curr_rt(rq);
》》当前进程运行时间，精度高于tick，与CFS的精度一样。
        watchdog(rq, p);

        /*
         * RR tasks need a special form of timeslice management.
         * FIFO tasks have no timeslices.
         */
        if (p->policy != SCHED_RR)
                return;

        if (--p->rt.time_slice)
                return;
》》为什么这里使用tick精度，而没有使用更高的精度统计，而且这些信息以及统计出来。

1：如何实现；2：如果中间block或者preept如何处理？

使用se->prev_sum_exec_runtime 来替代time_slice？这样精度就更高了。time_slice是unsigned int；而prev_sum_exec_runtime是u64.

        p->rt.time_slice = sched_rr_timeslice;

        /*
         * Requeue to the end of queue if we (and all of our ancestors) are not
         * the only element on the queue
         */
        for_each_sched_rt_entity(rt_se) {
                if (rt_se->run_list.prev != rt_se->run_list.next) {
                        requeue_task_rt(rq, p, 0);
                        set_tsk_need_resched(p);
                        return;
                }
        }

》》1：当前进程并没有从rr的runqueue队列中移除，这不同于CFS进程；

》》2：RT运行队列数据结构：

/* This is the priority-queue data structure of the RT scheduling class: */

     struct rt_prio_array {
             DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
             struct list_head queue[MAX_RT_PRIO];
     };

按照进程优先级放在不同的list中。如果时间片到了，只需要移到对尾即可。
}

pick_next_task_rt

在这里单独讲讲：如何选择下一个RT进程。我们知道RT进程又分为RR/FIFO。这两类之间有优先级吗？

pick_next_task_rt-->pick_next_rt_entity-->pick_next_rt_entity()

static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq, struct rt_rq *rt_rq)
{
        struct rt_prio_array *array = &rt_rq->active;
        struct sched_rt_entity *next = NULL;
        struct list_head *queue;
        int idx;

        idx = sched_find_first_bit(array->bitmap);
        BUG_ON(idx >= MAX_RT_PRIO);

        queue = array->queue + idx;
        next = list_entry(queue->next, struct sched_rt_entity, run_list);

        return next;
}

1:sched_find_first_bit：寻找第一个置位的。因为优先级从高到低：0最高，MAX_RT_PRIO -1最低；寻找运行队列中优先级最好的队列，然后从队列获得第一个对象

2：这里面并没有区别：RR/FIFO。所以对于RR/FIFO，它们本身没有优先级高低，都是按照prior区分优先级。他们唯一的区别就是运行时的逻辑，FIFO没有时间片概念，而RR有。

3：RR可以preept FIFO，反之亦然；check_preempt_curr_rt（）进程处理。

watchdog()的作用？

static void watchdog(struct rq *rq, struct task_struct *p)
{
        unsigned long soft, hard;

        /* max may change after cur was read, this will be fixed next tick */
        soft = task_rlimit(p, RLIMIT_RTTIME);
        hard = task_rlimit_max(p, RLIMIT_RTTIME);

        if (soft != RLIM_INFINITY) {
                unsigned long next;

                if (p->rt.watchdog_stamp != jiffies) {
                        p->rt.timeout++;
                        p->rt.watchdog_stamp = jiffies;
                }

                next = DIV_ROUND_UP(min(soft, hard), USEC_PER_SEC/HZ);
                if (p->rt.timeout > next)
                        p->cputime_expires.sched_exp = p->se.sum_exec_runtime;
        }
}

1：这里看到关于将一个数据结构里面的变量名称全部换成其他名字的用法：

struct task_cputime {
        cputime_t utime;
        cputime_t stime;
        unsigned long long sum_exec_runtime;
};
/* Alternate field names when used to cache expirations. */
#define prof_exp        stime
#define virt_exp        utime
#define sched_exp       sum_exec_runtime

其实也可以用union，但是这样做的好处是什么？

2：man getrlimit, setrlimit, prlimit - get/set resource limits

/proc/$pid/limits

CFS中如何保证进程在sched_latency_ns内得到调度？

从Documentation/scheduler/sched-design-CFS.txt中摘出这段话：

CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic

increasing value tracking the smallest vruntime among all tasks in the
runqueue. The total amount of work done by the system is tracked using
min_vruntime; that value is used to place newly activated entities on the left
side of the tree as much as possible.

>>min_vruntime is a monotonic increasing value。这是个很重要的概念，min_vruntime是个一直增长的数值。

>>1：当创建新进程task_fork_fair（）;2:wakeup 进程加入rbtree时，都会执行：place_entiry（）

static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
        u64 vruntime = cfs_rq->min_vruntime;

        /*
         * The 'current' period is already promised to the current tasks,
         * however the extra weight of the new task will slow them down a
         * little, place the new task so that it fits in the slot that
         * stays open at the end.
         */
        if (initial && sched_feat(START_DEBIT))
                vruntime += sched_vslice(cfs_rq, se);

        /* sleeps up to a single latency don't count. */
        if (!initial) {
                unsigned long thresh = sysctl_sched_latency;

                /*
                 * Halve their sleep time's effect, to allow
                 * for a gentler effect of sleepers:
                 */
                if (sched_feat(GENTLE_FAIR_SLEEPERS))
                        thresh >>= 1;

                vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        se->vruntime = max_vruntime(se->vruntime, vruntime);
}

一定保证新加入rbtree的vruntime不能比min_vruntime下，这样才能保证min_vruntime一直增加；

比如：进程sleep太长时间，那么它的vruntime肯定会远小于当前的min_vruntime,如果加入不修改进程的vruntime，那么就会导致min_vruntime变小，所以在加入的时候，必须更新休眠进程的vruntime。即： se->vruntime = max_vruntime(se->vruntime, vruntime);保证min_vruntime不变小。

>>update_min_vruntime()

static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
        u64 vruntime = cfs_rq->min_vruntime;

        if (cfs_rq->curr)
                vruntime = cfs_rq->curr->vruntime;

        if (cfs_rq->rb_leftmost) {
                struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
                                                   struct sched_entity,
                                                   run_node);

                if (!cfs_rq->curr)
                        vruntime = se->vruntime;
                else
                        vruntime = min_vruntime(vruntime, se->vruntime);
        }

        /* ensure we never gain time by being placed backwards. */
        cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
#ifndef CONFIG_64BIT
        smp_wmb();
        cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}

更新min_vruntime,下面的函数调用本函数：

1：__update_curr（）更新进程的占用CPU时间

2：dequeue_entity（）：将一个进程移除rbtree

下面也是从Documentation/scheduler/sched-design-CFS.txt中摘取的：

CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
p->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
As the system progresses forwards, the executed tasks are put into the tree
more and more to the right --- slowly but surely giving a chance for every task
to become the "leftmost task" and thus get on the CPU within a deterministic
amount of time.

>>虽然从代码没有看出如何保证运行队列的最大延迟，但是其文档却说明可以。

>>即 within a deterministic amount of time.是怎么做的？我猜测估计与vruntime的计算有关。

struc rq

每一个CPU一个rq，进程按照：RT和！RT存在；存在的进程：仅仅需要CPU的资源（并不是TASK_RUNNING，也包含TASK_IN/UN的进程）。

判断进程是否在rq唯一标志：se->on_rq == 1.而不是其他如task->state等。

CFS的运行队列

RT的运行队列

前面讲过RT的运行队列：

     /*
     * This is the priority-queue data structure of the RT scheduling class:
     */
     struct rt_prio_array {
             DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
             struct list_head queue[MAX_RT_PRIO];
     };

sched_rt_period_us

每一个RT runqueue都有个定时器。目的是为了防止RT进程占用all cputime，会主动释放CPU给NORMAL进程。

这段时间sched_rt_runtime_us（default 950000us），即一秒内留500ms给NORMAL进程。该参数是改变。

该参数是定时器的周期（default 1000000us，即一秒钟）。

定时器初始化：init_rt_bandwidth（）

clock/clock_task

上面相关的时间都是通过rq_clock_task（）获取的。

static inline u64 rq_clock_task(struct rq *rq)

     {
             return rq->clock_task;
     }

何时更新

void update_rq_clock(struct rq *rq)
{
        s64 delta;

        if (rq->skip_clock_update > 0)
                return;

        delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
        rq->clock += delta;
        update_rq_clock_task(rq, delta);
}

应该是更新‘NOW’

1：if （rq->skip_clock_update > 0）

2：sched_clock_cpu（）：后面会着重讲

3：who call this function and when？

A：hrtick（）CFS进程时间片到了的hrtimer处理函数

B：scheduler_tick（）：HZ中断处理函数

C：enqueue_task（）：新增进程到运行队列

D：dequeue_task（）：从运行队列移除

......

结论：只要引起计算时间的操作之前，都会主动update，这样才保证后面计算时使用的是正确的时间。

时间源

时间源来自： sched_clock_cpu（）.关于这个函数的实现在kernel/sched/clock.c中，实现有两种方式。

* sched_clock for unstable cpu clocks

* What:
*
* cpu_clock(i) provides a fast (execution time) high resolution
* clock with bounded drift between CPUs. The value of cpu_clock(i)
* is monotonic for constant i. The timestamp returned is in nanoseconds.
*
* ######################### BIG FAT WARNING ##########################
* # when comparing cpu_clock(i) to cpu_clock(j) for i != j, time can #
* # go backwards !!                                                  #
* ####################################################################
*
* There is no strict promise about the base, although it tends to start
* at 0 on boot (but people really shouldn't rely on that).
*
* cpu_clock(i)       -- can be used from any context, including NMI.
* sched_clock_cpu(i) -- must be used with local IRQs disabled (implied by NMI)
* local_clock()      -- is cpu_clock() on the current cpu.
*
* How:
*
* The implementation either uses sched_clock() when
* !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK, which means in that case the
* sched_clock() is assumed to provide these properties (mostly it means
* the architecture provides a globally synchronized highres time source).
*
* Otherwise it tries to create a semi stable clock from a mixture of other
* clocks, including:
*
* - GTOD (clock monotomic)
* - sched_clock()
* - explicit idle events
*
* We use GTOD as base and use sched_clock() deltas to improve resolution. The
* deltas are filtered to provide monotonicity and keeping it within an
* expected window.
*
* Furthermore, explicit sleep and wakeup hooks allow us to account for time
* that is otherwise invisible (TSC gets stopped).
*
*
* Notes:
*
* The !IRQ-safetly of sched_clock() and sched_clock_cpu() comes from things
* like cpufreq interrupts that can change the base clock (TSC) multiplier
* and cause funny jumps in time -- although the filtering provided by
* sched_clock_cpu() should mitigate serious artifacts we cannot rely on it
* in general since for !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK we fully rely on
* sched_clock().
*/

1:这个文件基本是根据CONFIG_HAVE_UNSTABLE_SCHED_CLOCK是否配置，实现了sched_clock的不同实现。

》》根据cpu内部的时钟作为时间来源，但是由于cpu可以变频所以才会出现这个配置？

2：文件最开始注释：sched_clock for unstable cpu clocks

3：实现/统计进程调度，应该采用高精度的，它们采用cpu_clock（）来作为时间源。

>* cpu_clock(i) provides a fast (execution time) high resolution

> clock with bounded drift between CPUs.

根据CONFIG_HAVE_UNSTABLE_SCHED_CLOCK，实现两种不同的实现方法

4：cpu_clock()/local_clock()最终会执行sched_clock_cpu()，不论是否配置CONFIG_HAVE_UNSTABLE_SCHED_CLOCK，当然配置不同，其实现不同。

sched_clock（）

/*
* Do not use outside of architecture code which knows its limitations.
*
* sched_clock() has no promise of monotonicity or bounded drift between
* CPUs, use (which you should not) requires disabling IRQs.
*
* Please use one of the three interfaces below.
*/
extern unsigned long long notrace sched_clock(void);

/*
* Scheduler clock - returns current time in nanosec units.
* This is default implementation.
* Architectures and sub-architectures can override this.
*/
unsigned long long __attribute__((weak)) sched_clock(void)
{
return (unsigned long long)(jiffies - INITIAL_JIFFIES)
* (NSEC_PER_SEC / HZ);
}

1：单位是ns，但是用jiffies计算，与HZ/TICK相关，精确度不高。而且注释明确说明，其他架构可以overwrite。

其他架构是如何overwrite此函数，提供更准确的实现。

2：jiffies不是GTOD，所以不是递增的，当然无法保证sched_clock()的ns是递增的。应该利用的是个相对时间，但是会出现后一个时间小于前一个时间，需要注意。

!CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

* The implementation either uses sched_clock() when

* !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK, which means in that case the
* sched_clock() is assumed to provide these properties (mostly it means
* the architecture provides a globally synchronized highres time source).

CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

如果系统配置：CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

* Otherwise it tries to create a semi stable clock from a mixture of other
* clocks, including:
*
* - GTOD (clock monotomic)
* - sched_clock()
* - explicit idle events
*
* We use GTOD as base and use sched_clock() deltas to improve resolution. The
* deltas are filtered to provide monotonicity and keeping it within an
* expected window.

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
struct sched_clock_data {
        u64                     tick_raw;
        u64                     tick_gtod;
        u64                     clock;
};

x86是配置CONFIG_HAVE_UNSTABLE_SCHED_CLOCK。查看内核代码，只有x86才有这个选项，其他体系架构应该都没有。

sched_clock_cpu

u64 sched_clock_cpu(int cpu)
{
        struct sched_clock_data *scd;
        u64 clock;

        WARN_ON_ONCE(!irqs_disabled());

        if (sched_clock_stable)
                return sched_clock();

        if (unlikely(!sched_clock_running))
                return 0ull;

        scd = cpu_sdc(cpu);

        if (cpu != smp_processor_id())
                clock = sched_clock_remote(scd);
        else
                clock = sched_clock_local(scd);

        return clock;
}

1：虽然配置CONFIG_HAVE_UNSTABLE_SCHED_CLOCK，但是也可以存在sched_clock_stable，即保证sched_clock是恒定的，这样就和配置CONFIG_HAVE_UNSTABLE_SCHED_CLOCK是一样的处理，都会直接调用sched_clock()。

2：/proc/sched_debug 显示该值

如何判断sched_clock_stable

引入：TSC（x86）： TSC的主体是位于CPU里面的一个64位的TSC寄存器。每个CPU时钟周期其值加一。

但是TSC不一定是固定的，这就是所谓的进程cpu 频率可以调整，省电模式。

cat /proc/cpuinfo|grep constant_tsc 判断当前cpu的TSC频率是否固定

early_init_amd（）

         * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate
         * with P/T states and does not stop in deep C-states
         */
        if (c->x86_power & (1 << 8)) {
                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
                if (!check_tsc_unstable())
                        sched_clock_stable = 1;

}

int check_tsc_unstable(void) { return tsc_unstable; }

/* TSC can be unstable due to cpufreq or due to unsynced TSCs*/

static int __read_mostly tsc_unstable;

》》AMD是判断TSC是否stable，来确定sched_clock_stable。

intel的代码与amd一致。都是判断TSC是否stable。

如果TSC不稳定，设置函数：mark_tsc_unstable（）。什么原因导致？cpufreq or unsynced TSCs（SMP）？

x86的sched_clock（）实现？

arch/x86/kernel/tsc.c:

/* We need to define a real function for sched_clock, to override the
weak default version */
#ifdef CONFIG_PARAVIRT
unsigned long long sched_clock(void)
{
return paravirt_sched_clock();
}
#else
unsigned long long sched_clock(void) __attribute__((alias("native_sched_clock")));
#endif

sched_clock()的实现应该是native_sched_clock（）。

/*
* Scheduler clock - returns current time in nanosec units.
*/
u64 native_sched_clock(void)
{
        u64 this_offset;

        /*
         * Fall back to jiffies if there's no TSC available:
         * ( But note that we still use it if the TSC is marked
         *   unstable. We do this because unlike Time Of Day,
         *   the scheduler clock tolerates small errors and it's
         *   very important for it to be as fast as the platform
         *   can achieve it. )
         */
        if (unlikely(tsc_disabled)) {
                /* No locking but a rare wrong value is not a big deal: */
                return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
        }

        /* read the Time Stamp Counter: */
        rdtscll(this_offset);

        /* return the value in ns */
        return __cycles_2_ns(this_offset);
}

CONFIG_HAVE_UNSTABLE_SCHED_CLOCK && ！sched_clock_stable

void sched_clock_tick(void)
{
        struct sched_clock_data *scd;
        u64 now, now_gtod;

        if (sched_clock_stable)
                return;

        if (unlikely(!sched_clock_running))
                return;

        WARN_ON_ONCE(!irqs_disabled());

        scd = this_scd();
        now_gtod = ktime_to_ns(ktime_get());
        now = sched_clock();

        scd->tick_raw = now;
        scd->tick_gtod = now_gtod;
        sched_clock_local(scd);
}

1：等深入学习linux timer后在分析

2：man 7 time/man clock_gettime

Collect scheduler statistics
CONFIG_SCHEDSTATS

Kernel hacking --->

[*] Kernel debugging

Collect scheduler statistics

struct sched_statistics

struct sched_statistics {

        u64                     wait_start;
        u64                     wait_max;
        u64                     wait_count;
        u64                     wait_sum;
        u64                     iowait_count;
        u64                     iowait_sum;

        u64                     sleep_start;
        u64                     sleep_max;
        s64                     sum_sleep_runtime;

        u64                     block_start;
        u64                     block_max;
        u64                     exec_max;
        u64                     slice_max;

        u64                     nr_migrations_cold;
        u64                     nr_failed_migrations_affine;
        u64                     nr_failed_migrations_running;
        u64                     nr_failed_migrations_hot;
        u64                     nr_forced_migrations;

        u64                     nr_wakeups;
        u64                     nr_wakeups_sync;
        u64                     nr_wakeups_migrate;
        u64                     nr_wakeups_local;
        u64                     nr_wakeups_remote;
        u64                     nr_wakeups_affine;
        u64                     nr_wakeups_affine_attempts;
        u64                     nr_wakeups_passive;
        u64                     nr_wakeups_idle;
};

下面会详细讲述这些字段的含义。

wait_**

wait_start,wait_max,wait_count,wait_sum;

wait_max：单位ns；wait_sum:单位ns；

wait_start:进程enqueue时开始统计；update_stats_wait_start（）

static void update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
                        rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
        schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
        schedstat_set(se->statistics.wait_sum, se->statistics.wait_sum +
                        rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
#ifdef CONFIG_SCHEDSTATS
        if (entity_is_task(se)) {
                trace_sched_stat_wait(task_of(se),
                        rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
        }
#endif
        schedstat_set(se->statistics.wait_start, 0);
}

/*
* Tracepoint for accounting wait time (time the task is runnable
* but not actually running due to scheduler contention).
*/
DEFINE_EVENT(sched_stat_template, sched_stat_wait,
             TP_PROTO(struct task_struct *tsk, u64 delay),
             TP_ARGS(tsk, delay));

对于wait_max，需要特别关注，它属于进程延时。用perf如何跟踪这个值？

其实还有个特例：如果遇到进程从某个cpu迁移到另外的cpu上，那么也会调用该函数。

__migrate_task-->dequeue_task()-->update_stats_dequeue().

iowait_**/sleep_**/block_**

进程阻塞是指state != TASK_RUNNING.

enqueue_sleeper

static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_SCHEDSTATS
        struct task_struct *tsk = NULL;

        if (entity_is_task(se))
                tsk = task_of(se);

        if (se->statistics.sleep_start) {
                u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.sleep_start;

                if ((s64)delta < 0)
                        delta = 0;

                if (unlikely(delta > se->statistics.sleep_max))
                        se->statistics.sleep_max = delta;

                se->statistics.sleep_start = 0;
                se->statistics.sum_sleep_runtime += delta;

                if (tsk) {
                        account_scheduler_latency(tsk, delta >> 10, 1);
                        trace_sched_stat_sleep(tsk, delta);
                }
        }
        if (se->statistics.block_start) {
                u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.block_start;

                if ((s64)delta < 0)
                        delta = 0;

                if (unlikely(delta > se->statistics.block_max))
                        se->statistics.block_max = delta;

                se->statistics.block_start = 0;
                se->statistics.sum_sleep_runtime += delta;

                if (tsk) {
                        if (tsk->in_iowait) {
                                se->statistics.iowait_sum += delta;
                                se->statistics.iowait_count++;
                                trace_sched_stat_iowait(tsk, delta);
                        }

                        trace_sched_stat_blocked(tsk, delta);

                        /*
                         * Blocking time is in units of nanosecs, so shift by
                         * 20 to get a milliseconds-range estimation of the
                         * amount of time that the task spent sleeping:
                         */
                        if (unlikely(prof_on == SLEEP_PROFILING)) {
                                profile_hits(SLEEP_PROFILING,
                                                (void *)get_wchan(tsk),
                                                delta >> 20);
                        }
                        account_scheduler_latency(tsk, delta >> 10, 0);
                }
        }
#endif
}

enqueue_entity()

  if (flags & ENQUEUE_WAKEUP) {
                place_entity(cfs_rq, se, 0);
                enqueue_sleeper(cfs_rq, se);
        }

1：唯一入口是将进程放到运行队列中时，开始统计这些数据。我们需要区别：sleep和block（iowait或者！iowait）。

注意enqueue_entity中的ENQUEUE_WAKEUP，这表示进程是被wake_up.阻塞进程当然需要wake_up.

2：sched_stat_sleep

     * Tracepoint for accounting sleep time (time the task is not runnable,
     * including iowait, see below).
     */
     DEFINE_EVENT(sched_stat_template, sched_stat_sleep,

3： sched_stat_iowait

     * Tracepoint for accounting iowait time (time the task is not runnable
     * due to waiting on IO to complete).
     */
     DEFINE_EVENT(sched_stat_template, sched_stat_iowait,

4：sched_stat_blocked

     /*
     * Tracepoint for accounting blocked time (time the task is in uninterruptible).
     */
     DEFINE_EVENT(sched_stat_template, sched_stat_blocked,

dequeue_entity()

          if (flags & DEQUEUE_SLEEP) {
#ifdef CONFIG_SCHEDSTATS
                if (entity_is_task(se)) {
                        struct task_struct *tsk = task_of(se);

                        if (tsk->state & TASK_INTERRUPTIBLE)
                                se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
                        if (tsk->state & TASK_UNINTERRUPTIBLE)
                                se->statistics.block_start = rq_clock(rq_of(cfs_rq));
                }
#endif
        }

1：唯一入口：将进程从运行队列移除的时候开始技术。注意参数DEQUEUE_SLEEP。因为dequeue还有其他原因，不仅仅是DEQUEUE_SLEEP。

2：sleep是：TASK_INTERRUPTIBLE；block：TASK_UNINTERRUPTIBLE。这样区别比较简单。

exec_max

1：exec_max与slice_max的区别？进程从得到cpu开始（set_next_entity），到释放cpu（不论是否仍然在运行队列）的这段时间，exec_max与slice_max的更新是否一样？

从下面的代码可以看到，slice_max仅仅是在set_next_entity判断，那么应该是上次cpu运行到下次cpu运行（当前其中的非占用cpu时间已经排除），也就是说只有一次更新。但是对于exec_max是否存在多次更新呢？

从代码看，update_curr存在其他情况的调用，也就是说，两次得到cpu之间会存在多次的update_curr。

A：entity_tick（）：如果此tick 非HRTICK，那么中间会更新exec_max，但是不会slice_max。

B：nice（）设置当前运行进程的nice。set_user_nice:

结论：exec_max <= slice_max;

slice_max

set_next_entity（）

#ifdef CONFIG_SCHEDSTATS

        /*
         * Track our maximum slice length, if the CPU's load is at
         * least twice that of our own weight (i.e. dont track it
         * when there are only lesser-weight tasks around):
         */
        if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
                se->statistics.slice_max = max(se->statistics.slice_max,
                        se->sum_exec_runtime - se->prev_sum_exec_runtime);
        }
#endif
        se->prev_sum_exec_runtime = se->sum_exec_runtime;

1：从运行队列中选择运行的进程（获取cpu的）。

2：if条件的含义？

3：上一次进程在CFS 运行周期中，实际占用cpu的时间。

/proc/$pid/sched

显示进程的 struct sched_statistics信息,还有其他信息。内核处理函数：proc_sched_show_task

nr_switches

显示进程的切换次数，仅仅在函数__schedule中产生。

nr_voluntary_switches：主动放弃cpu

nr_involuntary_switches：被动，包含：时间片用完；抢占

__schedule()

switch_count = &prev->nivcsw;

        if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
                if (unlikely(signal_pending_state(prev->state, prev))) {
                        prev->state = TASK_RUNNING;
                } else {
                        deactivate_task(rq, prev, DEQUEUE_SLEEP);
                        prev->on_rq = 0;

                        /*
                         * If a worker went to sleep, notify and ask workqueue
                         * whether it wants to wake up a task to maintain
                         * concurrency.
                         */
                        if (prev->flags & PF_WQ_WORKER) {
                                struct task_struct *to_wakeup;

                                to_wakeup = wq_worker_sleeping(prev, cpu);
                                if (to_wakeup)
                                        try_to_wake_up_local(to_wakeup);
                        }
                }
                switch_count = &prev->nvcsw;
        }

1：进程处于TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE，且不是被preept，才设置主动切换。

2：时间片用完，进程TASK_RUNNING，肯定是被动

3：进程TASK_RUNNING，被高优先级进程preept，preempt_count() & PREEMPT_ACTIVE != 0;

4：进程TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE，被preepmt，也属于被动。

5：cond_resched（）也属于被动释放

进程运行生命周期分析

1：fork

2：正常运行（user/kernel space，空间切换）

3：因时间片结束

4：被抢占（正常时/阻塞时）

5：阻塞（正常/抢占）

6：wakeup（加入运行队列）

7：获取cpu（加入运行队列--->get cpu）

8：结束

9：进程运行/调度信息统计

10：设计进程的状态图（将以上状态全部包含）

系统调度信息统计

1:/proc/$pid/

latency、sched、schedstat

2：/proc/sched_debug

3：/proc/schedstat

相关工具

1：nice/renice

2：chrt

3：taskset

zwcq82

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
linux 调度笔记

时间片a：如何计算时间片？b：怎么判断时间片到了？c：时间片到了后如何处理？d：CFS的rebtree是根据vruntime进行排序的（vruntime到目前为止进程执行的虚拟时间）。那么如果Pa运行很久了，而Pb刚运行，这种情况进行分析。如何保证Pa在时间片内可以得到运行，而不至于latency大于CFS的最大延时时间（sched_latency_ns）。或者我对CFS的理解有问
复制链接

扫一扫

专栏目录