LINUX系统CFS调度模型实现思考和仿真

papaofdoudou

已于 2024-07-01 09:07:30 修改

阅读量1k

点赞数 21

分类专栏： Linux 进程管理内存管理文章标签： linux 服务器

于 2024-04-05 00:26:38 首次发布

本文链接：https://blog.csdn.net/tugouxp/article/details/137386442

版权

Linux 同时被 3 个专栏收录

218 篇文章 30 订阅

订阅专栏

内存管理

34 篇文章 2 订阅

订阅专栏

进程管理

2 篇文章 0 订阅

订阅专栏

关于LINUX资源调度

计算机系统中管理资源的方式一般有两种，分别是时间分割和空间分割，主要思路是利用硬件的相似性，将资源在时间或者空间上分割成相似的部分，让软件以一致的逻辑执行。CPU运行特点是在时刻点A和时刻B运行机制是一样的，不同的只是执行现场，可以看作是在时间上有对称性的一种资源，比较适合的管理方法就是分割时间。而对于内存资源，不同时刻的数据是不可预期的，在时间尺度上没有相似性，但是在空间尺度上，A区和B区的晶体管作用是完全一样的，只是空间范围的起始不同，就比较适合应用空间分割的方式管理。

在LINUX操作系统实现中，通过时分的方式管理CPU，对应的是内核任务调度管理子模块。而用空分的方式管理内存，对应内核的内存管理系统，具体的说，内核将内存划分ZONE，再进一步用BUDDY算法对ZONE内的PAGE进行管理。

CPU是否能在空间维度上管理呢？或者内存是否可以在时间维度上进行管理呢？应该是可以的，比如CPU的多核和SMT/SMP设计，就是物理上配置了多组处理器或者流水线资源，实现空间维度的扩展。所以，运行LINUX系统的当代处理器，既支持时分，也支持空分管理，统统属于调度范畴。

内存在时间维度上管理的例子呢？或许LINUX系统中的内存交换机制可以看作内存在时间管理上的一个实现，为了在有限的内存提供尽量多的运行进程带宽，LINUX支持在内存紧张的时刻，将匿名页面交换到外部磁盘上，在需要的时候在再交换回来，实现对物理内存页面的分时复用。所以同CPU管理一样，LINUX系统对物理内存的管理也同时支持时分和空分。

CFS调度和优先级队列调度的区别

LINUX内核同时支持CFS调度和优先级队列调度，CFS调度算法用在选择SCHED_NORMAL策略的进程上，而内核支持的SCHED_RR实时调度策略则采用优先级队列的方式调度下一个要运行的进程。

至于这两种调度策略在内核中的区别，个人理解主要体现在对“优先级”的理解上。对于优先级队列这种方式来说，优先级的高低是选择下一个运行进程的唯一标准，在占有CPU这件事情上，高优先级的进程具有绝对的优先权，所以只要有高优先级的就绪进程存在，低优先级进程永远没有执行的机会，所以会有线程饥饿情况的发生。

CFS调度策略对“优先级”的理解则不同，对于CFS调度策略来说，优先级只是代表进程在CPU时间这个资源池中占据的“比重”或者“份数”，而并不表示一种绝对的优先权。所以即便就绪队列中存在优先级很低的线程，但是仍然能够有一定的“比重”获取CPU。

所以，对于优先级队列来说，它的调度策略为：

$next\_task = max(prio(A), prio(B), \cdots)$

对于CFS调度算法来说，它的调度策略是选择距离执行到“预期比重”进度最慢的任务。需要有一种指标来衡量这种进度，同样的执行时间，进度增量和权重成反比，权重越大，进度增量应该越慢。

假设有两个进程的权重为W1，W2， W1 < W2，选择W1作为其他任务的对照基准：

以两个权重分别为2和5的线程举例，按照CFS调度，其调度过程中进度变化如下表所示，进度最终得到相同。

表格中体现的逻辑是：进度增长率和执行时间变化率成反比，为了达到相同的进度，必须将执行时间多分配给那些高优先级的进程。

表格中的进度是一个比值，所以它没有单位，并且采用参考权重的任务的时间增长率和进度增长率是相同的，如果将参考进程的权重看作1，则上面的表格是另一个样子，但是不影响调度结果，可以想象有一个权重为1的任务存在：

所以，CFS调度算法保证的是，在持续调度的过程中，所有任务的执行进度和参照任务保持一致，其中参照任务是任意指定的，一般选择被大多数线程使用的任务权重作为参考基准，这样在计算过程中，比例因子为1，时间增量和进度增量相同，在统计进程的运行进度时更加方便。

在LINUX内核调度器中，有一个专有名字表示进度指标，叫做虚拟时间。观察上表可以看到对于高优先级进程，由于其权重较大，因此它的虚拟时间是按照比例缩小的，虚拟时间的流水和权重成反比，也就是说，基准权重的时间和正常时间流逝相同，权重大于基准始终的任务，其时间流逝会变慢，CFS调度器会有限选择虚拟时间最慢的线程进行调度，所以高权重的任务才有更多的运行机会。对于权重低于基准权重的低优先级任务来说，前时间流逝会比正常的时间更快，这样，它只要执行较少的时间就可以满足对进度的要求，这样，在一个完整的调度周期内（队列中所有进程都得到一次调度的时间，比如上表中的W1+W2），每个任务都有执行的机会，进入尽量在调度周期结束后对齐虚拟时间。

之所以说是尽量一致，是因为内核中的调度点并不一定恰好在被权重整除的整数单位上，在一个DELTA TIME中，总会以有进程比预期多执行一会儿，从而获得了较多的虚拟时间增量，而其它进程少执行了一些。就需要在下一个调度周期内进行补偿和奖励那些在上一个调度周期“吃亏”的线程。因为调度周期是有限的，而CFS调度能够保证每个调度周期内每个任务都有机会执行，误差只会在不对齐的调度时刻出现，所以任务的虚拟时间总体上不会相差太大，CFS调度是一个不断追求公平的动态的过程，它实现了程序上的公平，并通过动态纠偏保证了结果的公平，公平是一个理想的目标，CFS可能没办法保证任务在任何时刻点进度的一致，但是至少从系统开机的那一刻，就一直在朝着这个目标努力。

LINUX实际测试

起32个stress进程在CPU上运行数学计算，保持进程在用户态始终处于ready状态，可以看到运行结构是父进程17580创建了32个工作进程后，17580进入wait pid.父进程不再工作队列，所以其虚拟时间一直得不到更新。而其他工作子进程的虚拟时间大体保持同步。

root@zlcao-RedmiBook-14:/sys/kernel/debug/tracing# for i in  `pidof stress`; do cat /proc/$i/sched|grep vruntime; done
se.vruntime                                  :       1244447.484042
se.vruntime                                  :       1177755.552582
se.vruntime                                  :       1190714.768543
se.vruntime                                  :       1280956.952213
se.vruntime                                  :       1190723.045775
se.vruntime                                  :       1280978.042641
se.vruntime                                  :       1280987.731056
se.vruntime                                  :       1025434.306120
se.vruntime                                  :       1281006.743422
se.vruntime                                  :       1291930.178007
se.vruntime                                  :       1138416.091181
se.vruntime                                  :       1025461.000309
se.vruntime                                  :       1025469.412830
se.vruntime                                  :       1190784.807417
se.vruntime                                  :       1177831.652813
se.vruntime                                  :       1177837.035659
se.vruntime                                  :       1138461.572388
se.vruntime                                  :       1190814.136239
se.vruntime                                  :       1244563.996583
se.vruntime                                  :       1138479.876476
se.vruntime                                  :       1292021.081603
se.vruntime                                  :       1292030.452872
se.vruntime                                  :       1138500.913325
se.vruntime                                  :       1025543.679330
se.vruntime                                  :       1281133.391813
se.vruntime                                  :       1177916.405541
se.vruntime                                  :       1138530.595394
se.vruntime                                  :       1025567.840336
se.vruntime                                  :       1244638.788856
se.vruntime                                  :       1244645.502439
se.vruntime                                  :       1292108.509432
se.vruntime                                  :       1244657.520996
se.vruntime                                  :        727447.696491
root@zlcao-RedmiBook-14:/sys/kernel/debug/tracing# pstree -p 17580
stress(17580)─┬─stress(17581)
              ├─stress(17582)
              ├─stress(17583)
              ├─stress(17584)
              ├─stress(17585)
              ├─stress(17586)
              ├─stress(17587)
              ├─stress(17588)
              ├─stress(17589)
              ├─stress(17590)
              ├─stress(17591)
              ├─stress(17592)
              ├─stress(17593)
              ├─stress(17594)
              ├─stress(17595)
              ├─stress(17596)
              ├─stress(17597)
              ├─stress(17598)
              ├─stress(17599)
              ├─stress(17600)
              ├─stress(17601)
              ├─stress(17602)
              ├─stress(17603)
              ├─stress(17604)
              ├─stress(17605)
              ├─stress(17606)
              ├─stress(17607)
              ├─stress(17608)
              ├─stress(17609)
              ├─stress(17610)
              ├─stress(17611)
              └─stress(17612)
root@zlcao-RedmiBook-14:/sys/kernel/debug/tracing# cat /proc/17580/stack
[<0>] do_wait+0x1cb/0x230
[<0>] kernel_wait4+0x89/0x130
[<0>] __do_sys_wait4+0x95/0xa0
[<0>] __x64_sys_wait4+0x1e/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
root@zlcao-RedmiBook-14:/sys/kernel/debug/tracing#

默认情况下stress不会绑核，所以进程会分布到不同的核上执行，不同核上的CFS RQ可能有不同的虚拟时间，考虑到核间迁移的因素，有可能会出现进程迁移到目标核后虚拟运行时间后撤调整的情况，如下动图中可以看到进程的vruntime有一次短暂的后撤，这是由于进程发生了在不同核之间的迁移，不同核的CFS READYQUEUE上的虚拟时间不同导致的。

如果将所有的stress测试进程绑定到指定核上跑,避免进程迁移的影响，则可以观察到稳定的VUNTIME递增的情况,将所有的stress绑到核2上命令如下:

# taskset -c 2 stress -c 32

此时虚拟时钟稳定增加，不会回撤：

root@zlcao-RedmiBook-14:/proc# for i in  `pidof stress`; do cat /proc/$i/sched|grep vruntime; done
se.vruntime                                  :       1500394.995140
se.vruntime                                  :       1500398.051040
se.vruntime                                  :       1500396.242118
se.vruntime                                  :       1500396.603221
se.vruntime                                  :       1500394.832669
se.vruntime                                  :       1500394.851502
se.vruntime                                  :       1500394.885326
se.vruntime                                  :       1500395.936951
se.vruntime                                  :       1500398.300004
se.vruntime                                  :       1500395.390453
se.vruntime                                  :       1500395.752777
se.vruntime                                  :       1500395.783833
se.vruntime                                  :       1500396.889590
se.vruntime                                  :       1500398.371353
se.vruntime                                  :       1500398.367499
se.vruntime                                  :       1500395.002095
se.vruntime                                  :       1500396.348955
se.vruntime                                  :       1500395.081132
se.vruntime                                  :       1500395.407083
se.vruntime                                  :       1500395.819177
se.vruntime                                  :       1500395.017735
se.vruntime                                  :       1500395.278265
se.vruntime                                  :       1500396.333423
se.vruntime                                  :       1500395.131101
se.vruntime                                  :       1500395.291119
se.vruntime                                  :       1500395.172903
se.vruntime                                  :       1500397.777858
se.vruntime                                  :       1500397.491441
se.vruntime                                  :       1500395.535772
se.vruntime                                  :       1500398.786423
se.vruntime                                  :       1500395.266494
se.vruntime                                  :       1500395.223027
se.vruntime                                  :       1424450.417327
root@zlcao-RedmiBook-14:/proc# pstree -p 23911
stress(23911)─┬─stress(23912)
              ├─stress(23913)
              ├─stress(23914)
              ├─stress(23915)
              ├─stress(23916)
              ├─stress(23917)
              ├─stress(23918)
              ├─stress(23919)
              ├─stress(23921)
              ├─stress(23922)
              ├─stress(23923)
              ├─stress(23924)
              ├─stress(23925)
              ├─stress(23926)
              ├─stress(23927)
              ├─stress(23928)
              ├─stress(23930)
              ├─stress(23931)
              ├─stress(23932)
              ├─stress(23933)
              ├─stress(23934)
              ├─stress(23935)
              ├─stress(23936)
              ├─stress(23937)
              ├─stress(23938)
              ├─stress(23939)
              ├─stress(23940)
              ├─stress(23941)
              ├─stress(23942)
              ├─stress(23943)
              ├─stress(23944)
              └─stress(23945)
root@zlcao-RedmiBook-14:/proc#

CFS进程虚拟时间片计算

任务i的虚拟时间用如下公式计算：

$vruntime_i = realtime_i \times \frac{1024}{W_i}$

其中1024是参考优先级120的进程对应的权重，这是由于内核中最普遍的优先级为120，使用其作为参考时钟可以避免对虚拟时钟的计算（比例系数为1），当进程权重W>1024时，进程优先级大于120，时间变慢，当进程权重W<1024时，进程优先级小于120，此时进程的虚拟时间变快。对于权重为1024的线程，执行10 ms时，会原封不动地将这10 ms累加到虚拟时间上，但是，当它的权重增加到9548时（对应nice值为-10），虚拟时间只增加1ms。它们之间存在以下关系：

vruntime = exectime * (1024 / se->load.weight)=10*1024/9548=1ms.

虚拟时间的计算公式是可逆的，逆表达和正表达相同，所以可以用同一个函数实现双向变换。

$realtime_i =vruntime_i \times \frac{W_i}{1024}$

Linux内核代码中计算下一个进程的time slice和计算vruntime都用到了这个函数，总体上，函数要表达的是计算出部分（第二个参数）占整体（第三个参数）的比例，根据这个比例，得到virtual/real time.

为了消除运算耗时的除法，使计算简化，内核中使用了一些技巧，将公式变形如下：

$vruntime_i=realtime_i \times 1024 \times \frac{2^{32}}{W_i}\times \frac{1}{2^{32}}$

由于静态优先级对应的权重只有40种，因此2^32/Wi可以实现计算好，而1/2^32可以转化为右移32位，为此内核定义了两个数组，分别保存每个优先级的权重和事先计算的2^32/Wi的结果：

可以实际计一遍，上面两个数组对应相同索引项的乘积恰好是2^32.

由于realtime*1024可能超过32位，这个时候再乘以一个32位的数，可能超出64位溢出，所以先把realtime*1024的结果右移16位，再乘以inv_weight,最后再右移16位，这也是右移32位的等价计算，否则直接计算。

经过实际仿真测试，权重每降低一个级别，将会失去20%的处理器时间，和权重的差比相同：

根据这个规律，权重数组可以用函数

$y=\frac{1024}{1.25^x}$

近似,自变量x为NICE值，y为权重:

队列总权重计算

队列总权重记录在struct cfs_rq *cfs_rq->load.weight变量中，在任务入队和出队是调用update_load_add和update_load_sub更新队列总权重的大小。时间片是按线程权重占比进行分配，计算公式为：

slice = se->load.weight / cfs_rq->load.weight

其中，se->load.weight代表该se的权重值,cfs_rq->load.weight代表当前运行队列上所有调度实体的权重总和。此公式描述仅适用于未开启组调度的情况，当开启组调度后，情况会有些不同。

            bash-3977    [002] d..2    79.446678: update_load_add <-account_entity_enqueue
            bash-3977    [002] d..2    79.446682: <stack trace>
 => update_load_add
 => account_entity_enqueue
 => enqueue_entity
 => enqueue_task_fair
 => activate_task
 => wake_up_new_task
 => _do_fork
 => __x64_sys_clone
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
            bash-3977    [002] d..2    79.446941: update_load_sub <-account_entity_dequeue
            bash-3977    [002] d..2    79.446947: <stack trace>
 => update_load_sub
 => account_entity_dequeue
 => dequeue_entity
 => dequeue_task_fair
 => deactivate_task
 => __schedule
 => schedule
 => do_wait
 => kernel_wait4
 => __do_sys_wait4
 => __x64_sys_wait4
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
            bash-3977    [002] d..2    79.446949: update_load_sub <-account_entity_dequeue
            bash-3977    [002] d..2    79.446954: <stack trace>
 => update_load_sub
 => account_entity_dequeue
 => dequeue_entity
 => dequeue_task_fair
 => deactivate_task
 => __schedule
 => schedule
 => do_wait
 => kernel_wait4
 => __do_sys_wait4
 => __x64_sys_wait4
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
             cat-4123    [003] d.h1    79.447025: update_load_sub <-account_entity_dequeue
             cat-4123    [003] d.h1    79.447036: <stack trace>
 => update_load_sub
 => account_entity_dequeue
 => reweight_entity
 => update_cfs_group
 => task_tick_fair
 => scheduler_tick
 => update_process_times
 => tick_sched_handle
 => tick_sched_timer
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => smp_apic_timer_interrupt
 => apic_timer_interrupt
 => copy_user_enhanced_fast_string
 => __x64_sys_rt_sigaction
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
             cat-4123    [003] d.h1    79.447037: update_load_add <-account_entity_enqueue
             cat-4123    [003] d.h1    79.447043: <stack trace>
 => update_load_add
 => account_entity_enqueue
 => reweight_entity
 => update_cfs_group
 => task_tick_fair
 => scheduler_tick
 => update_process_times
 => tick_sched_handle
 => tick_sched_timer
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => smp_apic_timer_interrupt
 => apic_timer_interrupt
 => copy_user_enhanced_fast_string
 => __x64_sys_rt_sigaction
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
             cat-4123    [003] d.s4    79.447059: update_load_add <-account_entity_enqueue
             cat-4123    [003] d.s4    79.447067: <stack trace>
 => update_load_add
 => account_entity_enqueue
 => enqueue_entity
 => enqueue_task_fair
 => activate_task
 => ttwu_do_activate
 => try_to_wake_up
 => wake_up_process
 => swake_up_locked.part.4
 => swake_up_one
 => rcu_gp_kthread_wake
 => rcu_accelerate_cbs_unlocked
 => rcu_core
 => rcu_core_si
 => __do_softirq
 => irq_exit
 => smp_apic_timer_interrupt
 => apic_timer_interrupt
 => copy_user_enhanced_fast_string
 => __x64_sys_rt_sigaction
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
     rcu_preempt-10      [000] d..2    79.447106: update_load_sub <-account_entity_dequeue
     rcu_preempt-10      [000] d..2    79.447116: <stack trace>
 => update_load_sub
 => account_entity_dequeue
 => dequeue_entity
 => dequeue_task_fair
 => deactivate_task
 => __schedule
 => schedule
 => schedule_timeout
 => rcu_gp_kthread
 => kthread
 => ret_from_fork
             cat-4123    [003] d.h2    79.448023: update_load_sub <-account_entity_dequeue
             cat-4123    [003] d.h2    79.448033: <stack trace>
 => update_load_sub
 => account_entity_dequeue
 => reweight_entity
 => update_cfs_group
 => task_tick_fair
 => scheduler_tick
 => update_process_times
 => tick_sched_handle
 => tick_sched_timer
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => smp_apic_timer_interrupt
 => apic_timer_interrupt
 => memcg_kmem_put_cache
 => kmem_cache_alloc
 => vm_area_alloc
 => mmap_region
 => do_mmap
 => vm_mmap_pgoff
 => ksys_mmap_pgoff
 => __x64_sys_mmap
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

调度时间周期

调度周期是任务队列中的进程全部完成一次调度的总时间，由于任务队列中的任务是动态变化的，可能随时有新的任务添加进来，也可能随时有任务挂入其他对象，比如MUTE的队列进入休眠，所以，无法给出一个精确的调度周期，于是内核中的调度周期是一个经验值，用户可以通过sysrq服务调整优化：

从经验角度，调度周期能够保证所有的任务都能够在调度周期内被执行一次。

可以通过虚拟内存文件/proc/sys/kernel/sched_min_granularity_ns和/proc/sys/kernel/sched_latency_ns调整这两个调度参数的大小。

sysctl_sched_min_granularity是cpu密集型任务的最小抢占粒度，单位是纳秒,如果抢占逻辑发现线程执行时间还不够sysctl_sched_min_granularity，则不触发抢占。sysctl_sched_latency则表示默认的调度周期，单位为纳秒，如果进程数目小于8个，则使用默认的sysctl_sched_latency作为调度周期，如果进程数目大于8个，则选择nr_running*sysctl_sched_min_granularity作为调度周期。

sched_nr_latency=sysctl_sched_latency/sysctl_sched_min_granularity,表示在默认的调度粒度和调度周期下，任务多列中的任务数量的上限，由于默认的sysctl_sched_latency为6000000ULL,sysctl_sched_min_granularity为750000ULL，所以默认的sched_nr_latency = 8。

执行时间片

执行的时间片是realtime,知道了调度周期，队列总权重和任务总权重，则时间片就非常容易获得了：

$\mathbf{task_{slice} = sched\_period \times \frac{W_i}{\sum_{s=1}^{s=n}W_s}}$

这个逻辑是通过函数sched_slice实现的。

即使优先级较高的进程，也不应该始终占有CPU，使优先级较低的进程无法运行，这是CFS调度和RR调度的根本不同。当一个进程运行完毕它的时间片后，它应该让给其它优先级，自己被取代。为了实现这种机制，就绪队列的任务分为过期进程和活动进程两类。活动进程是指这些进程还没有用完它的时间片，因此允许他们运行。过期进程是那些已经用完时间片的活动进程，因为时间片用完了，被禁止在这个调度周期中被调度，直到所有的活动进程都过期后，进行下一个调度周期。

而对于实时调度器来说，没有过期进程，高优先级的进程始终活跃，永远不会过期，所以没有权重的概念。

CFS 管理队列

cfs调度器使用红黑数管理任务队列，红黑树的根结点是struct cfs_rq结构的tasks_timeline字段。

cfs调度器的红黑树上保存的是处于"就绪态"的任务；之所以打上引号，是因为linux内核中并没有一个叫做就绪态的任务状态，而只有TASK_RUNNING；但实际上，真正运行态的任务，每个CPU只有一个(即rq.curr)，它并不在cfs调度器的红黑树上，即使current->se.on_rq为1，具体可以参考set_next_task()/put_prev_task()函数，出队和入队是必须的，因为只有入队操作才会触发红黑树根据vruntime重新排序,如下图所示的调用堆栈，pick_next_task_fair选择的将要投入运行的任务被从红黑数中dequeue出来，而将要被调度走的人物如果是on_rq状态，表明这次是非抢占调度，任务仍然在就绪队列中，只是等待下个调度周期再激活，所以需要将其enqueue到红黑树中。

红黑树入对和出队调用堆栈:

Linux CFS调度_nr_running-CSDN博客

init_cfs_rq的调用上下文是在系统初始化时调用sched_init里面，针对每一个CPU，初始化一个CFS RQ队列。另外一个上下文涉及到任务组，任务组作为一个独立的调度实体共享CPU时间，是一套单独的实现：

cfs为每个调度实体SE维护虚拟时间变量vruntime,同时每个运行队列cfs_rq维护另一个变量min_vruntime，用来记录队列中所有进程(也包括当前进程，即使当前进程在运行期不再CFS RQ中）的最小虚拟运行时间。update_min_vruntime保证了min_vruntime始终单调递增，不会会退。

红黑数的键值

CFS红黑数使用调度实体的vruntime作为调度键值，插入时会作KEY值的比较运算：

观察LINUX CFS调度

可以通过文件/proc/#pid/sched查看每个进程的调度数据，其中se.sum_exec_runtime记录的是进程总的CPU realtime占用时间，相当于是对进程每次调度运行器内delta time的累加，是单调递增的。而se.vruntime记录的则是进程的虚拟时钟。

为了避免进程在不同CPU之间迁移的干扰，我们将stress, htop,top三个测试进程绑定到同一个CPU上执行:

$taskset -c 2 stress -c 2
$taskset -c 2 top
$taskset -c 2 htop

在进程运行期间，对三个进程的se.vruntime和se.sum_exec_runtime进行采样，把采样数据随时间的变化绘制出来，结果如下：

下图是三个进程虚拟时钟的变化曲线，蓝色是stress压测进程，橙色是htop进程，绿色则是top进程，可以发现如下规律：

1.stress的原理是重复进行开方计算，属于计算稠密进程，采样器内进程始终位于就绪队列，所以其虚拟始终基本上持续现行增加。

2.htop/top的规律类似，他们都属于交互式/周期任务进程，既可以被交互动作唤醒，比如敲击键盘，也可以被周期性的时钟唤醒（周期刷新top/htop屏幕），所以可以看到，其虚拟时间的增长呈现出阶梯状，原因就在当进程进入周期休眠时，从就绪队列移出，此后虚拟时钟不再增加，所以和蓝色的stress进程拉开差距，当被周期时钟唤醒时，任务重新进入就序队列，由于休眠了一段时间，此时vruntime小于就绪队列的vruntime，htop/top进程能够立即抢占stress获得调度，此时虚拟时钟再度跟上stress vruntime，周而复始，呈现出阶梯状的形态。并且从橙色和绿色阶梯的长度对比可以得到，top的刷新周期大约是htop的两倍。

(实际上为了避免由于休眠过长，休眠进程的虚拟时钟过小，导致被唤醒后被调度器过度补偿，调度器通常会在vruntime的基础上在加上一个正数进行调整，使被唤醒的进程的虚拟时钟不至于和队列平均值差距太大）.

下图中，IO密集型进程和CPU密集型进程的区别一目了然,计算密集型的任务其虚拟时钟始终保持和CPU CFS队列的虚拟时钟同步，而IO密集型的任务则会反复上演 “滞后CPU虚拟时钟->同步->滞后。。。。”的剧情。

上图也可以看到，即便任务的种类，工作特点不同，CFS也能够保证任务间的虚拟运行时钟大体上一致。

CFS虚拟时间是真实时间向权重归一化的结果，虽然虚拟时钟能够保持其头并进，但是不同进程在获取真实CPU时间的能力上是有巨大差别的，还是上面的三个进程，我们将进程真实占用CPU的时间绘制出来，就可以看到这种巨大差别了，如下图所示，由于交互式任务的大部分虚拟实践是在唤醒时刻被调度器“补偿”的，并不是依靠真刀真枪抢到的，所以其HTOP/TOP进程的真实执行时间和计算密集进程stress差别巨大，一个天，两个地。但是从CFS调度器的角度看，结果是公平的，因为虚拟时钟始终保持一致。

不绑核会怎样？

前面提到，当任务在不同的CPU上迁移时，由于每个CPU上的CFS队列的虚拟时间不同，可能会出现虚拟时钟的回撤，如果不绑核，应该能够观察到这种回撤，如下图所示，三个进程的虚拟时钟都发生了回撤，说明三个进程都在vruntime的回撤点发生了核间迁移。另外，回撤点也只发生在跨核迁移的时候，当进程稳定的在特定核上运行时，虚拟时钟总是单调递增的。

无论迁移与否，进程的累计执行时间都是单调递增的，如下图，进程执行时间单调递增，符合预期。

钡餐测试

对于一个双核四线程处理器，每个线程（CPU）对应一个cfs rq队列，所以一共有四个vruntime的执行流，为了把不同的执行流体现出来，创建四个同样helloworld进程，进程中周期性的休眠1秒，分别绑到四个CPU上，让四个vruntime的轨迹体现出来。之后再次不绑核运行stress/htop/top三个测试进程，从下图可以看到，四个核上的vruntime执行流清晰展现出来，并且执行流间随机性出现细线连接，这些细线代表的就是进程在跨核迁移执行。另外也可以看出，不同核上的VRUNTIME是单调递增的。不会回撤。

se.vruntime有64位，目前CPU的主频>1GHZ，故时钟周期是纳秒级别的。 64位寄存器的溢出时间计算：若CPU主频是3GHZ，1s内增加3000000000,64bit寄存器溢出需要的时间：2^64/3*10^9=6148914691.2s=194年，故一般实际应用中是不会溢出的。

迁移进程的处理

前面多次提到进程在不同的CPU之间迁移，会导致虚拟时间重新调整，甚至可能会突变较大的数值，在图形上表现为图形悬崖式的下落和抬高。那么具体是怎么调整的呢？

首先，迁移流程调用sched_class的migrate_task_rq_fair。从当前CPU RQ退出，退出前se->vruntime -= cfs_rq->min_vruntime;减去当前队列的最小虚拟时间之后，剩下的表示SE在当前CPU队列里面的相对优势，这个相对优势可以在目标处理器队列上继承。在目标CPU队列的enqueue_entity中再加回来。

之后，增加p->se.nr_migrations++迁移计数：

进程迁移上下文堆栈

vruntime的变化趋势

/proc/sched_debug文件输出中包含了内核调度器的细节数据，包含每个CPU/调度组的min_vruntime信息，将4核8线程的8个CPU的调度队列的min_vruntime提取绘制出来，便哈趋势如下图所示：

关于此图说明如下：

1.抓取周期为1晚的时间，大约11个小时左右，可以看到vruntime单调递增。

2.8个处理器的vruntime互相独立，互相之间不做同步，但是对于负载相近的处理器，其变化曲线类似，图中原本应该有8条变化曲线，当采样周期比较长时，vruntime积累会导致纵坐标绘制区分度降低，导致相近CPU的曲线重合，所以上图中看上去只有四条曲线，实际上是有8条，如果仅仅绘制几分钟，可以看到8条VRUNTIME变化曲线。

3.可以看到曲线曲率有明显增加的一个阶段，之后又降低，这是因为测试过程中，在其中一个核上绑上了stress进程执行。同时又不绑核运行了TOP，TOP没有绑核，导致了所有CPU VRUNTIME增长变快，曲率增加，但是因为有一个核绑定了STRESS进程，导致这个核的VRUNTIME增加最快，斜率是所有核中最大的（橙色的那条），运行一段时间后，推出两个进程，所有核的负载又开始均衡，8核曲线斜率逐渐平行。但由于橙色线代表的核的虚拟时钟经过了stress的一段提速，所以vruntime在较高水平和其核平行。

4.所以可以看到，不同核上的vruntime的变化率是不同的，负载高的核的vruntime增长块，负载低的核vruntime增加慢。如果一个核没有负载，只跑IDLE，由于IDLE进程并非CFS调度类，则此核的VRUNTIME不会增加。

5.有多少个SMT核(PU),就有多少个VRUNTIE时间线，每个核的vruntime时间线彼此独立，单调递增。

验证点1：nice 0(CFS优先级为120)的进程虚拟时钟==物理时钟

测试进程优先级为CFS-120，被绑定到一个核上不作迁移，先后抓取四个时间点的调度信息，其se.vruntime和se.sum_exec_runtime分别代表虚拟时钟和实际执行时钟，其数值关系如下表：

验证点2：nice -20(CFS优先级为100)的进程虚拟时钟&物理时钟=权重之比的倒数：

增长率的计算公式为：

$\frac{se2.vruntime - se1.vruntime}{se2.sum\_exec\_runtime-se1.sum\_exec\_runtime}$

可以看到，衡量虚拟时间和实际时间增长速度的增长率为1，说明NICE为0的CFS调度进程，其虚拟时钟等于物理时钟。

下面动画展示了虚拟时钟和实际时钟之间的关系，横坐标是se.vruntime, 纵坐标是se.sum_exec_runtime，根据图形可知，两者是一次线性关系，其次，横坐标和纵坐标之间的间最小刻度同步变化情况下，直线和坐标轴之间的夹角接近45度（正方形的对角线），说明斜率为1，所以，CFS NICE 0的进程，其虚拟时钟等于物理时钟。

实际上，由于不可能在CPU的RQ时间为0的时候开始测试（只有这个CPU上第一个开始运行的进程能够享受到这个待遇），所以测试进程启动时，CPU就绪队列的虚拟时间已经不为0了，所以严格上这条直线和X轴正方向有交点，焦点横坐标表示进程刚刚启动时，CPU就绪队列上的虚拟时钟值。

$ taskset -c 1 stress -c 1

下图展示了三个采用CFS调度策略，NICE值分别为-20，0，19的CPU密集型任务，运行过程中其虚拟时间和世界时间的相对变化情况，其中，蓝色的线表示NICE值为-20的最高优先级的进程，绿色线为NICE值是19的低优先级进程，橙色线表示NICE值为0的中等优先级进程，可以发现如下规律：

1.三个进程的虚拟时钟和物理时钟都成线性变化关系，说明变化率保持不变。

2.高优先级的物理时钟增长块，表现为斜率最大，蓝色线的斜率是橙色线的86倍，橙色线斜率是绿色线的68倍。和权重成正比。

3.在同样的虚拟时钟下，权重高的进程，其实际执行的时间越多。

4.对比橙色线和绿色线的增长很有意思，绿色线表示最低优先级的进程，所以每被调度一次，获得一部分实际执行时间，其虚拟时钟会步进一大步，导致看上虽然橙色进程一致在运行，其虚拟时钟却始终在追赶绿色进程。同样的道理，蓝色进程看橙色进程也是这样，只是由于两个进程相对调度点更多，绘图过程的分辨率不足以精确表示出来。

5.由于权重上的优势，即便进程启动时间很晚，其实际执行时间也会后来居上，逐渐超越优先级较低的进程。

就绪队列vruntime的行进速度和进程数量成反比

就绪队列中的进程数量越多，vruntime的增长越慢，实验如下：

当队列中设置MAX_ENTRY 为80，MAX_TICKS_TEST 为1000000UL运行结果为：

保持MAX_TICKS_TEST不变，MAX_ENTRY 增加1倍，变为160：

可以看到，当进程数量增加1倍，vruntime从115014.051033变为57507.025516，缩小了1倍，所以就绪队列中的任务越多，VRUNTIME增加越慢。

内核调度时钟

CPU调度时钟通过调用sched_clock获取，单位是纳秒，调度队列会记录当前CPU上的调度时钟，而不在计算虚拟时钟时刻再调用sched_clock获取，这样做的目标是避免将调度器中的执行时间算入当前进程的虚拟时间，提升公平性。

 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
 gnome-terminal--4349    [002] d... 10984.592878: update_rq_clock <-try_to_wake_up
 gnome-terminal--4349    [002] d... 10984.592880: <stack trace>
 => update_rq_clock
 => try_to_wake_up
 => wake_up_process
 => insert_work
 => __queue_work
 => queue_work_on
 => tty_insert_flip_string_and_push_buffer
 => pty_write
 => n_tty_write
 => tty_write
 => __vfs_write
 => vfs_write
 => ksys_write
 => __x64_sys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
          <idle>-0       [006] dN.. 10984.592884: update_rq_clock <-__schedule
          <idle>-0       [006] dN.. 10984.592885: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
   kworker/u16:2-8356    [006] d... 10984.592889: update_rq_clock <-try_to_wake_up
   kworker/u16:2-8356    [006] d... 10984.592890: <stack trace>
 => update_rq_clock
 => try_to_wake_up
 => default_wake_function
 => pollwake
 => __wake_up_common
 => __wake_up_common_lock
 => __wake_up
 => n_tty_receive_buf_common
 => n_tty_receive_buf2
 => tty_ldisc_receive_buf
 => tty_port_default_receive_buf
 => flush_to_ldisc
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork
   kworker/u16:2-8356    [006] d... 10984.592893: update_rq_clock <-__schedule
   kworker/u16:2-8356    [006] d... 10984.592894: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule
 => worker_thread
 => kthread
 => ret_from_fork
 gnome-terminal--4349    [002] d... 10984.592906: update_rq_clock <-__schedule
 gnome-terminal--4349    [002] d... 10984.592907: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule
 => schedule_hrtimeout_range_clock
 => schedule_hrtimeout_range
 => poll_schedule_timeout.constprop.13
 => do_sys_poll
 => __x64_sys_poll
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
          <idle>-0       [003] dN.. 10984.592932: update_rq_clock <-__schedule
          <idle>-0       [003] dN.. 10984.592934: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
            bash-11236   [003] d... 10984.592947: update_rq_clock <-try_to_wake_up
            bash-11236   [003] d... 10984.592948: <stack trace>
 => update_rq_clock
 => try_to_wake_up
 => wake_up_process
 => insert_work
 => __queue_work
 => queue_work_on
 => tty_insert_flip_string_and_push_buffer
 => pty_write
 => do_output_char
 => n_tty_write
 => tty_write
 => __vfs_write
 => vfs_write
 => ksys_write
 => __x64_sys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
          <idle>-0       [006] dN.. 10984.592989: update_rq_clock <-__schedule
          <idle>-0       [006] dN.. 10984.592991: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
   kworker/u16:2-8356    [006] d... 10984.592993: update_rq_clock <-try_to_wake_up
   kworker/u16:2-8356    [006] d... 10984.592994: <stack trace>
 => update_rq_clock
 => try_to_wake_up
 => default_wake_function
 => pollwake
 => __wake_up_common
 => __wake_up_common_lock
 => __wake_up
 => n_tty_receive_buf_common
 => n_tty_receive_buf2
 => tty_ldisc_receive_buf
 => tty_port_default_receive_buf
 => flush_to_ldisc
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork
   kworker/u16:2-8356    [006] d... 10984.592996: update_rq_clock <-__schedule
          <idle>-0       [002] dN.. 10984.592997: update_rq_clock <-__schedule
   kworker/u16:2-8356    [006] d... 10984.592997: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule
 => worker_thread
 => kthread
 => ret_from_fork
          <idle>-0       [002] dN.. 10984.592998: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
 gnome-terminal--4349    [002] d... 10984.593031: update_rq_clock <-__schedule
 gnome-terminal--4349    [002] d... 10984.593032: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule
 => schedule_hrtimeout_range_clock
 => schedule_hrtimeout_range
 => poll_schedule_timeout.constprop.13
 => do_sys_poll
 => __x64_sys_poll
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
            bash-11236   [003] d... 10984.593059: update_rq_clock <-task_fork_fair
            bash-11236   [003] d... 10984.593060: <stack trace>
 => update_rq_clock
 => task_fork_fair
 => sched_fork
 => copy_process
 => _do_fork
 => __x64_sys_clone
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
            bash-11236   [003] d... 10984.593138: update_rq_clock <-cpu_cgroup_fork
            bash-11236   [003] d... 10984.593139: <stack trace>
 => update_rq_clock
 => cpu_cgroup_fork
 => cgroup_post_fork
 => copy_process
 => _do_fork
 => __x64_sys_clone
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
            bash-11236   [003] d... 10984.593141: update_rq_clock <-wake_up_new_task
            bash-11236   [003] d... 10984.593141: <stack trace>
 => update_rq_clock
 => wake_up_new_task
 => _do_fork
 => __x64_sys_clone
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
          <idle>-0       [006] dN.. 10984.593198: update_rq_clock <-__schedule
          <idle>-0       [006] dN.. 10984.593199: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
            bash-11236   [003] d... 10984.593203: update_rq_clock <-__schedule
            bash-11236   [003] d... 10984.593204: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule
 => do_wait
 => kernel_wait4
 => __do_sys_wait4
 => __x64_sys_wait4
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
          <idle>-0       [001] d.h. 10984.593287: update_rq_clock <-try_to_wake_up
          <idle>-0       [001] d.h. 10984.593289: <stack trace>
 => update_rq_clock
 => try_to_wake_up
 => wake_up_process
 => hrtimer_wakeup
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => smp_apic_timer_interrupt
 => apic_timer_interrupt
 => cpuidle_enter_state
 => cpuidle_enter
 => call_cpuidle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
          <idle>-0       [001] dN.. 10984.593292: update_rq_clock <-__schedule
          <idle>-0       [001] dN.. 10984.593292: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule_idle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64
            Xorg-3128    [001] d... 10984.593298: update_rq_clock <-__schedule
            Xorg-3128    [001] d... 10984.593299: <stack trace>
 => update_rq_clock
 => __schedule
 => schedule
 => schedule_hrtimeout_range_clock
 => schedule_hrtimeout_range
 => ep_poll
 => do_epoll_wait
 => __x64_sys_epoll_wait
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
root@zlcao-RedmiBook-14:/sys/kernel/debug/tracing#

CFS算法仿真

下面是参考内核CFS调度算法实现的算法模型仿真程序，分别使用红黑数和链表两种方式管理任务队列，支持动态添加任务：

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <float.h>
#include <unistd.h>
#include <pthread.h>
#include <time.h>
#include <math.h>
#include "rbtree.h"

#define CFS_USE_RB_TREE
// A CFS Scheduler CModel.
#define DBG(fmt, ...) do { printf("%s line %d, "fmt, __func__, __LINE__, ##__VA_ARGS__); } while (0)
#define assert(expr) \
    if (!(expr)) { \
        printf("Assertion failed! %s,%s,%s,line=%d\n",\
            #expr,__FILE__,__func__,__LINE__); \
        while(1); \
    }

static pthread_mutex_t cfs_mutex;
double min_vruntime = 0.0f;
void update_min_vruntime(double vruntime)
{
	min_vruntime = vruntime;
}
/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const double sched_prio_to_weight[40] = {
#if 1
	/* -20 */     88761,     71755,     56483,     46273,     36291,
	/* -15 */     29154,     23254,     18705,     14949,     11916,
	/* -10 */      9548,      7620,      6100,      4904,      3906,
	/*  -5 */      3121,      2501,      1991,      1586,      1277,
	/*   0 */      1024,       820,       655,       526,       423,
	/*   5 */       335,       272,       215,       172,       137,
	/*  10 */       110,        87,        70,        56,        45,
	/*  15 */        36,        29,        23,        18,        15,
#else
	/* -20 */     10,     10,     10,     10,     10,
	/* -15 */     10,     10,     10,     10,     10,
	/* -10 */     10,     10,     10,     10,     10,
	/*  -5 */     10,     10,     10,     10,     10,
	/*   0 */     10,     10,     10,     10,     10,
	/*   5 */     10,     10,     10,     10,     10,
	/*  10 */     10,     10,     10,     10,     10,
	/*  15 */     10,     10,     10,     10,     10,
#endif
};

typedef struct sched_entity {
	struct rb_node node;
	int prio;
	int pid;
	double weight;
	double vruntime;
	double realtime;
	int ctx_switch;
	int on_rq;
} sched_entity_t;

struct rb_root cfs_root_tree = RB_ROOT;
int sched_period(void)
{
	return rand() % 10 + 1;
}

#define MAX_ENTRY 80
//#define SCHED_PERIOD 100
#define SCHED_PERIOD sched_period()
static sched_entity_t *tasks;

// https://zhuanlan.zhihu.com/p/673572911
// in linux kernel, cfs_rq->load used for statistic the total
// weight in rq, refer update_load_add / update_load_sub.
double caculate_total_weight(void)
{
	double total = 0.0;
#if 1
	struct rb_node *node;
	sched_entity_t *task;

	for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {
		task = rb_entry(node, sched_entity_t, node);
		total += task->weight;
	}
#else
	int i;
	for (i = 0; i < MAX_ENTRY; i ++) {
		total += tasks[i].weight;
	}
#endif

	return total;
}

double caculate_realtime(int task)
{
	double real = 0.0;

	real = SCHED_PERIOD * tasks[task].weight / caculate_total_weight();

	//printf("%s line %d, real %f.\n",  __func__, __LINE__, real);
	return real;
}

double caculate_vruntime(sched_entity_t *task, int deltatime)
{
	double vruntime = 0.0;

	vruntime = deltatime * sched_prio_to_weight[20] / task->weight ;

	//printf("%s line %d, vruntime %f.\n",  __func__, __LINE__, vruntime);
	return vruntime;
}

#define MAX_TICKS_TEST  100000000UL
double vruntime_total(unsigned long worldtime)
{
	return worldtime * sched_prio_to_weight[20] / caculate_total_weight();
}

double realtime_total(sched_entity_t *task, unsigned long worldtime)
{
#if 1
	return worldtime * task->weight / caculate_total_weight();
#else
	return vruntime_total(worldtime) * task->weight / sched_prio_to_weight[20];
#endif
}

static int compare_prio(sched_entity_t *task1, sched_entity_t *task2)
{
#if 0
	if (task1->vruntime < task2->vruntime) {
		// task1 prior than task2.
		return -1;
	} else if (task1->vruntime > task2->vruntime) {
		// task2 prior than task1.
		return 1;
	} else {
		if (task1->weight > task2->weight) {
			return -1;
		} else {
			// task2 prior than task1.
			return 1;
		}
	}
#else
	double res = (task1->vruntime == task2->vruntime) ? task1->weight - task2->weight : task2->vruntime - task1->vruntime;
	return res > 0.0f ? -1 : 1;
#endif
}

static int cfs_rq_delete(struct rb_root *root, sched_entity_t *task)
{
	rb_erase(&task->node, root);
	return !RB_EMPTY_ROOT(root);
}

static int cfs_rq_insert(struct rb_root *root,  sched_entity_t *task)
{
	int ret;
	struct rb_node **tmp = &(root->rb_node), *parent = NULL;

	/* Figure out where to put new node */
	while (*tmp) {

		sched_entity_t *this = rb_entry(*tmp, sched_entity_t, node);

		parent = *tmp;

		ret = compare_prio(task, this);

		if (ret < 0)
			tmp = &((*tmp)->rb_left);
		else if (ret > 0)
			tmp = &((*tmp)->rb_right);
		else
			return -1;
	}

	/* Add new node and rebalance tree. */
	rb_link_node(&task->node, parent, tmp);
	rb_insert_color(&task->node, root);

	return 0;
}

static void cfs_rq_destroy(struct rb_root *root)
{
	struct rb_node *node, *next;
	sched_entity_t *task;

	node = rb_first(root);
	while (node) {
		next = rb_next(node);
		task = rb_entry(node, sched_entity_t, node);
		rb_erase(node, root);
		node = next;
	}

	if (!RB_EMPTY_ROOT(root)) {
		printf("%s line %d, rb is not empty.\n", __func__, __LINE__);
	}

	return;
}

void print_rbtree(struct rb_root *tree)
{
	struct rb_node *node;
	sched_entity_t *task;

	for (node = rb_first(tree); node; node = rb_next(node)) {
		task = rb_entry(node, sched_entity_t, node);
		printf("%s line %d, task(%d) prio %d,weight %f vruntime %f, on rq %d.\n",
		       __func__, __LINE__, task->pid, task->prio, task->weight, task->vruntime, task->on_rq);
	}

	return;
}

void init_cfs_rbtree(void)
{
	int i;

	for (i = 0; i < MAX_ENTRY; i ++) {
		tasks[i].on_rq = 1;
		cfs_rq_insert(&cfs_root_tree, &tasks[i]);
	}

	print_rbtree(&cfs_root_tree);
	return;
}

#ifdef CFS_USE_RB_TREE
// O(logn) scheduler base on rbtree.
sched_entity_t *schedule(void)
{
	struct rb_node *node;
	sched_entity_t *task;

	node = rb_first(&cfs_root_tree);
	task = rb_entry(node, sched_entity_t, node);

	return task;
}

#else
// A O(n) linear scheuler impl.
sched_entity_t *schedule(void)
{
	int i;
	int taskid = -1;
	double minruntime = DBL_MAX;

	// schedule policy:
	// 1.first find the task with the minum vruntime.
	// 2.if multiple task the the same minum vruntime, then
	// select the weighter one.
	for (i = 0; i < MAX_ENTRY; i ++) {
		if (minruntime > tasks[i].vruntime) {
			minruntime = tasks[i].vruntime;
			taskid = i;
		} else if (minruntime == tasks[i].vruntime) {
			if (tasks[i].weight > tasks[taskid].weight) {
				taskid = i;
			}
		}
	}

	return &tasks[taskid];
}
#endif

double list_task_info(unsigned long worldtime)
{
	double total = 0.0;

	printf("==================================================================================================================================================================\n");
#if 1
	struct rb_node *node;
	sched_entity_t *task;

	for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {
		task = rb_entry(node, sched_entity_t, node);

		total += task->realtime;
		printf("task(pid%d) vuntime %f, realtime %f, prio %d, weight %f, switches %d, ideal real time %f.\n",
		       task->pid, task->vruntime, task->realtime, task->prio, task->weight, task->ctx_switch,
		       realtime_total(task, worldtime));
	}
#else
	int i;
	for (i = 0; i < MAX_ENTRY; i ++) {
		double ratio = 0.0;

		if (i > 0) {
			ratio = (tasks[i - 1].realtime - tasks[i].realtime) / tasks[i].realtime;

		}
		total += tasks[i].realtime;
		printf("task %d(pid%d) vuntime %f, realtime %f, prio %d, weight %f, incretio %f, switches %d, ideal real time %f.\n",
		       i, tasks[i].pid, tasks[i].vruntime, tasks[i].realtime, tasks[i].prio, tasks[i].weight, ratio * 100, tasks[i].ctx_switch,
		       realtime_total(&tasks[i], worldtime));
	}
#endif
	double staticis(void);
	printf("fangcha %f.\n", staticis());
	printf("==================================================================================================================================================================\n");
	return total;
}

static void *fork_thread(void *arg)
{
	sched_entity_t *task;
	unsigned long *pworldtime = (unsigned long *)arg;
	static unsigned int pid = MAX_ENTRY;
	int i;

	while (1) {
		task = malloc(sizeof(sched_entity_t));
		memset(task, 0x00, sizeof(sched_entity_t));

		i = rand() % 40;
		pthread_mutex_lock(&cfs_mutex);
		task->prio = -20 + i;
		task->weight = sched_prio_to_weight[i];
		//task->vruntime = vruntime_total(*pworldtime) +1.5f;
		task->vruntime = min_vruntime + 1.5f;
		task->realtime = 0;
		task->ctx_switch = 0;
		task->pid = pid ++;
		task->on_rq = 0;
		cfs_rq_insert(&cfs_root_tree, task);
		task->on_rq = 1;
		pthread_mutex_unlock(&cfs_mutex);

		sleep(1);
	}

	return NULL;
}

static void *exit_thread(void *arg)
{
	unsigned long *pwordtime = (unsigned long *)arg;

	while (1) {
		sleep(1);
	}
	return NULL;
}

double staticis(void)
{
	double statics = 0.0f;
	double average = 0.0f;
	double total = 0.0f;
	int count = 0;
	struct rb_node *node;
	sched_entity_t *task;

	for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {
		task = rb_entry(node, sched_entity_t, node);

		total += task->vruntime;
		count ++;
	}
	average = total / count;
	for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {
		task = rb_entry(node, sched_entity_t, node);

		statics += (average - task->vruntime) * (average - task->vruntime);
	}

	return sqrt(statics / count);
}
int main(void)
{
	int i;
	unsigned long ticks;
	double total = 0.0;
	unsigned long worldtime = 0;
	pthread_t t1, t2;

	pthread_mutex_init(&cfs_mutex, NULL);

	tasks = malloc(sizeof(sched_entity_t) * MAX_ENTRY);
	memset(tasks, 0x00, sizeof(sched_entity_t) * MAX_ENTRY);
	if (!tasks) {
		printf("%s line %d, fatal errro, alloc failure.\n",
		       __func__, __LINE__);
		return -1;
	}

	srand((unsigned int)time(0));
	for (i = 0; i < MAX_ENTRY; i ++) {
		tasks[i].prio = -20 + i % 40;
		tasks[i].weight = sched_prio_to_weight[i % 40];
		tasks[i].vruntime = 0;
		tasks[i].realtime = 0;
		tasks[i].ctx_switch = 0;
		tasks[i].pid = i;
		tasks[i].on_rq = 0;
	}

#ifdef CFS_USE_RB_TREE
	init_cfs_rbtree();
#endif

	// should be first.
	printf("%s line %d, first schedule select %ld.\n", __func__, __LINE__, schedule() - tasks);

	pthread_create(&t1, NULL, fork_thread, &worldtime);
	pthread_create(&t2, NULL, exit_thread, &worldtime);
	for (ticks = 0; /* ticks < MAX_TICKS_TEST */ 1; ticks ++) {
		double deltatime, vruntime;
		sched_entity_t *task;

		pthread_mutex_lock(&cfs_mutex);
		task = schedule();

#ifdef CFS_USE_RB_TREE
		cfs_rq_delete(&cfs_root_tree, task);
		task->on_rq = 0;

#endif
		deltatime = SCHED_PERIOD;

		vruntime = caculate_vruntime(task, deltatime);

		task->vruntime += vruntime;
		task->realtime += deltatime;
		task->ctx_switch ++;
		worldtime += deltatime;
		update_min_vruntime(task->vruntime);

#ifdef CFS_USE_RB_TREE
		// USE dequeue and enqueue to trigger the reorder of cfs rbtree ready queue.
		// this also the same in linux kernel, refer put_prev_entity(enqueue) and set_next_task(dequeue)
		// the only differenct is in linux kenrel the on_rq still keep no matter put_prev_entity/set_next_entity
		// involation.

		cfs_rq_insert(&cfs_root_tree, task);
		task->on_rq = 1;
#endif
		if (ticks % 1000 == 0) {
			list_task_info(worldtime);
		}
		pthread_mutex_unlock(&cfs_mutex);
	}

	pthread_mutex_lock(&cfs_mutex);
	total = list_task_info(worldtime);
	assert(total == worldtime);
	printf("vruntime %f.\n", vruntime_total(worldtime));
	pthread_mutex_unlock(&cfs_mutex);

	pthread_join(t1, NULL);
	pthread_join(t2, NULL);

#ifdef CFS_USE_RB_TREE
	print_rbtree(&cfs_root_tree);
	cfs_rq_destroy(&cfs_root_tree);
#endif
	free(tasks);
	return 0;
}

仿真结果，按照CFS算法调度大约1000个进程，VRUNTIME的方差始终控制在70左右，下面截图中显示所有进程的VRUNTIME集中在[308177.014345,308736.000000]，对照方差，VRUNTIME表示的所有任务的执行进度是非常集中的，说明CFS的调度策略确实照顾到了所有优先级的进程，使大家的执行进度基本保持在一个动态的一致范围之内。

仿真调度器的实现：

调度器从任务数组或者红黑数中选择虚拟运行时间最小的任务作为下次投入运行的任务，如果两个任务的虚拟运行时间相同，则选择权重更大的任务。

调度器并没有想象的那样高深，在linux系统中，它仅仅是一个函数，它的唯一使命就是根据调度算法逻辑，从就绪队列中选择下一个要投入到CPU运行的任务。

查看Linux进程的调度策略：

$ sudo chrt -p 1

vruntime是否会溢出

内核中使用64位无符号数记录虚拟时间，最大为0xffffffffffffffff:

如果以毫秒速度递增，要让64位数字溢出需要(unsigned long)-1 / 31536000000ms =584942417年，由于权重和时间流逝成反比，所以权重最小的虚拟时钟增长最快，根据内核定义的权重列表，NICE为19的进程其权重为15，这样其虚拟时钟的增长速度是世界时间的1024/15 = 68.3倍. 所以，即便有一个NICE为19的进程在CPU上不间断运行，要使虚拟时间溢出，也需要584942417/68.3=8564310年，也就是856万年。这个时间远远超过了人类文明的时间。

$\frac{vruntime_i}{realtime_i} = \frac{1024}{W_i}$

以我的PC为例，测试ktime_raw_get返回值的时间粒度：

实际测试显示ktime_raw_get返回的时间测量粒度为纳秒,内核中使用的TSC高精度timekeeping的中断频率和CPU的频率是一致的，能够达到超过纳秒级的精度：

 => read_tsc
 => tk_clock_read
 => timekeeping_get_delta
 => timekeeping_get_ns
 => ktime_get
 => tick_sched_timer
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => smp_apic_timer_interrupt
 => apic_timer_interrupt
 => cpuidle_enter_state
 => cpuidle_enter
 => call_cpuidle
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => secondary_startup_64

这个结论也可以从内核ktime_to_ns函数的实现看的出来：

如果以纳秒速度递增，64位数字溢出需要584年，按照最低权重NICE=19的时间加速度，也需要8.6年才能溢出。

更别提最大的权重（nice -20 ,权重88761），根据如上公式，其时间流逝仅仅比正常时间慢了86.681倍。所以现实时钟的增长速度要大于虚拟时钟的增长速度，在虚拟时钟溢出之前，现实时钟早就溢出了，根本不用考虑这种情况，而且目前的计算都是假设任务队列中只有一个就绪任务，根据前文分析知道，任务队列中的就绪任务越多，队列的虚拟时间增加越慢，实际运行中，一个CPU队列可能处理多个任务，所以实际VRUNTIME的运行速度要比理论计算的再慢很多倍。

更不用说世界上第一台UNIX系统开始运行的时间了，即便以纳秒计算，也需要585年才能溢出，这个时间相当于服务器从明朝中期开始运行到现在，人类文明历经中世纪的黑暗，近代科学的萌芽以及四次工业革命的洗礼发展到现在的程度。

即便根据上面的计算结论，似乎讨论内核中vruntime溢出的情况似乎是杞人忧天，没有太大意义，但是在计算机的发展过程中，我们不止一次的领教过指数增长的威力，曾经BILL GATES也信誓旦旦的说“对任何人而言，640K的内存足够了 ”，可是现在我们每个人随身携带的存储容量何止千万倍。所以，从计算机发展的角度看，预测是有局限性的，即使是最聪明的人也不能总是预测未来。作为科技行业的先驱，盖茨低估了创新的快速步伐和对计算机内存日益增长的需求。内核是一群富有经验和谦逊太多的人开发和维护的，所以他们的代码也体现了对计算发展的谦虚和尊重。就拿VRUNTIME来说，内核开发者显然并不认为VRUNTIME溢出是一件多么遥不可及的事情，所以在RQ初始化中，特意设置了一个接近溢出的数值(u64)(-(1LL << 20))，让溢出在开机不久后发生，这样可以让问题提前暴露出来。

内核默认给出的数值太过接近溢出，不好观察，所以我们设置一个相对较大，但是距离溢出还比较远的值0xf000000000000000初始化vruntime. 注意修改需要包含两类实体task_group和rq.修改完成后，重新编译内核，发现每个进程的vruntime都变成了一个负数，这说明进程的VRUNTIME来源于rq->min_vruntime.

这一点从max_vruntime实现可以看出来，虽然两个参数是无符号数，但是整个计算过程却是按照有符号数进行的，所以，能够区分出，0比0xfxxxxxxxxxxxxxxx大，从UINT类型来看，永远都是递增的。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stddef.h>

typedef long long __s64;
typedef unsigned long long __u64;

typedef __u64 u64;
typedef __s64 s64;

static u64 max_vruntime(u64 max_vruntime, u64 vruntime)
{
	s64 delta = (s64)(vruntime - max_vruntime);
	if (delta > 0)
		max_vruntime = vruntime;

	return max_vruntime;
}

int main(void)
{
	u64 data1 = 0xf000000000000000;
	u64 data2 = 0xf000000000000001;

	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data1, data2), data2-data1);
	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data2, data1), data1-data2);

	data1 = 0xf000000000000000;
	data2 = 0x0000000000000000;

	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data1, data2), data2-data1);
	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data2, data1), data1-data2);

	data1 = 0x0000000000000000;
	data2 = 0x0000000000000008;

	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data1, data2), data2-data1);
	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data2, data1), data1-data2);

	data1 = 0x7000000000000000;
	data2 = 0x8000000000000000;

	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data1, data2), data2-data1);
	printf("max is 0x%llx, 0x%llx.\n", max_vruntime(data2, data1), data1-data2);
	return 0;
}

所以，内核有考虑到VRUNTIME的溢出情况，保证在VRUNTIME溢出的时候，能够正确处理。