Linux scheduler(cfs) introduction

最新推荐文章于 2024-02-06 21:55:25 发布

glmwu

最新推荐文章于 2024-02-06 21:55:25 发布

阅读量2.2k

点赞数

分类专栏：进程调度文章标签： Linux kernel cfs schedule load balance

本文链接：https://blog.csdn.net/glmwu/article/details/19937425

版权

进程调度专栏收录该内容

1 篇文章 0 订阅

订阅专栏

公司让做个cfs相关的培训，整理了个ppt，图片均来自网络，源作者在此不一一说明，深表歉意~~

The Outline

Basic concepts about Linux process & thread
Basic concepts about SMP
Linux bootup with BP and how to boot AP
Completely Fair Schedule(CFS) and RT Sched
How to load balance
How to debug smp issue with oprofile and other tools

0. Basic Concepts

Program
   A program is a combination of intructions and data, which are put together to perform a task.

Process
   A process is an abstraction created to embody the state of a program during its execution. Therefore, a process can also be viewed as an instance of a program, or called “a running program”.
   API:    fork(clone->do_fork->copy_process->copy_flag(CLONE_*: FILES, VM, FS, SIGHAND,    NEWNS, NEWPID, NEWNET, NEWUTS(NEW* for namespace, LXC)...))
       exec, wait(for freeing zombie child, have to wait, it is better to handle SIGCHLD from    child process)

Thread
   LWP. A process can have multiple execution contexts that work together to accomplish its goal. These different execution contexts are called “threads”. These threads share same virtual address space with the process.
   API:    pthread_create(fork), pthread_attr_init, pthread_attr_setschedpolicy,    pthread_attr_setschedparam, pthread_attr_getschedparam, pthread_exit,    pthread_cancel, pthread_join(like as wait, free child resource), pthread_detach(self    free all resource)

Kernel Thread
   A thread without user-mode virtual address space. All instructions and data of a kernel thread are in Kernel VA Space, and in Linux they are usually linked together with kernel as a part of kernel image, such as kswapd, kflushd and ksoftirqd.
   API:    kthread_run(create+wakeup), kthread_create, kthread_stop

task_struct{
state, pid, mm, active_mm, prio, se, cpus_allowed, children, sibling, fs, files, signal, cgroups}

((struct thread_info *)(sp & ~(THREAD_SIZE – 1)))->task
THREAD_SIZE=8K

init process 0
       #define INIT_TASK(tsk) {   \
       .state       = 0,   \
       .thread_info       = &init_thread_info,   \
       ……
       .mm       = NULL,       \\\NULL for kernel task
       .active_mm       = &init_mm,       \
       ……
   }
In init_mm, pgd= swapper_pg_dir(0xC0004000)

Former 768 PGD entries for User VA Space
All User process shares the same kernel PGD.
init process 0 becomes “idle” process ( “idle” kernel thread).

1. Scheduler History

2.4 O(n)

One simple runqueue

nice and counter

2.6 O(1)

Kernel preemptible

Bitmap

runqueue(active/expire) for per-cpu and per-priority

static_prio and time slice

2. CFS Introduction

From kernel 2.6.23 version
For task runtime balance
Sched policy:SCHED_NORMAL, SCHED_FIFO, SCHED_RR, SCHED_IDLE, SCHED_BATCH(for cpu consume type)
sched entity, sched class
sched domain , sched group

3.1 priority

Normal task Priority
static_prio(nice:-20~19->100~139), prio(dynamic for Priority Inversion): low value for high priority
RT task priority
rt_priority(0~99): high value for high priority
normal_prio(0~139): unify for nomarl and rt, low value for high priority
nice(), setpriority() for normal task
sched_setscheduler() for RT task

3.2 SCHED_FIFO and SCHED_RR

pick_next_rt_task: Find the highest rt_priority task list by bitmap; pick the first task of the highest rt_priority task list, after running, put it to tail of list
SCHED_FIFO: run till initiative schedule or be preempted by higher priority rt task
SCHED_RR: run till initiative schedule or be preempted by higher priority rt task, or out of timeslice.
RT task(throttling): sched_rt_period_us(1000000us), sched_rt_runtime

4.1 RB Tree for Normal Task Organization

4.2 Node sequence of RB Tree

RB tree, O(log n)
The RB node sequence is determined by vruntime(The most left leaf is the lowest vruntime value)
The speed of Consuming vruntime is determined by prio
Constant arrary for static_prio to weight:
prio_to_weight[ ]/prio_to_wmult[ ](1 nice~10%weight)
vruntime = delta_exec * (NICE_0_LOAD / weight)

4.3 task_struct to RB tree in CFS

4.4 From rq to task

rq is per cpu variable

4.5 Sched_class for Sched Policy

fair_sched_class/rt_sched_class

struct sched_class

{

    const struct sched_class *next;
   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
   void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
   void (*yield_task) (struct rq *rq);
   struct task_struct * (*pick_next_task) (struct rq *rq);
   void (*put_prev_task) (struct rq *rq, struct task_struct *p);
   void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
   void (*post_schedule) (struct rq *this_rq);
   void (*task_waking) (struct rq *this_rq, struct task_struct *task);
   void (*task_woken) (struct rq *this_rq, struct task_struct *task);

   void (*switched_from) (struct rq *this_rq, struct task_struct *task,
                   int running);
   void (*switched_to) (struct rq *this_rq, struct task_struct *task,
                 int running);

...... }

4.6 Sched domain

4.7 Group and Domain

5. How to schedule

l Load Balance Rule

l 1) In cpu_domain level, all cpus share cache and cpu_power(power in cpu group), free to load balance

l 2) In core_domain level, all cpus share L2 cache, load balance when core domain is imbalance

l 3) In phys_domain level, must flush all cache

4) In numa_domain level, load balance costs much

5.1 Two Schedule Entry

l When no running task in current rq||initiative calling||sleep||from system space to user space, call schedule()

l When hrtimer(HZ) timeout, call scheduler_tick()

5.1.1 Schedule()

    1) Disable preempt
   2) cancel hrtimer
   3) clear flag of need resched
   4.1) If task is in unrunnable&&preemptible
       4.1.1) If has signal,set it to running state
       4.1.2) Else remove the task from the runqueue
   5) If no task on rq
       5.1) idle load balance(for each domain from down to up)
   6) put_prev_task
   7) pick_next_task
       7.1) if (rq->nr_running == rq->cfs.nr_running)//it is SCHED_NORMAL(fai sched class)
           call fair_sched_class.pick_next_task
       7.2) else
           for each class in class list from sched_class_highest, call
       class->pick_next_task, untill find task
   8) sched_info_switch
   9) context_switch
   10) Enable preempt

11) Check if need resched, if so go to 1)

5.1.2 How to pick and put task for fair Sched

pick_next_task(fair scheduler) on rq
       1) Find the rb_leftmost node on the cfs_rq
       2) get sched entity from rb node with rb_entry()
       3) if my_q of the se is null(it is a task)
           3.1) find it and dequeue entity from rq
       4) else, go to 1)
put_prev_task
       6.1) if it belongs to sched group(parent != null)
           it will be put in every iterator entity's rq from it to parent)
       6.2) else(parent == null)
           put it into current rq

5.1.3 How to pick and put task for RT Sched

pick_next_task_rt:
   Find the highest rt_priority task list by bitmap; pick the first task of the highest rt_priority task list.
put_prev_task_rt:
   //if task is in running state&&has allowed cpu
   if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
       1) del it from current node of list
       2) add it to tail of list

5.1.4 Scheduler_tick()

    1) update rq clock and load
   2) notify sched_class to update process's vruntime
   3) update timestamp of next load balance
   4) raise softirq for load balance

5.1.5 How to load balance

    1) find busiest group in the domain
   2) find busiest rq in the busiest group
   3) if running tasks >1 on the busiest rq
   3.1) Disable local irq and lock the rq
   3.2) move task from the busiest rq to current rq
   3.3) If this_cpu is not current cpu, send ipi to wake up this_cpu
   3.4) If this_cpu is set to all_pinned(affinity), clear it from busiest cpu mask
   3.5) Update schedule some flag and info
   3.6) return
   4) if running tasks <= 1 && moving task is fail
       4.1) if the busiest cpu needs active_balance
       4.2) set the busiest cpu's push_cpu with this_cpu
       4.3) wake up the busiest cpu's migration_thread to move task to other cpu, then    it can be in idle state.
       4.4) update some info
   5) if not active_balance for busiest cpu
       5.1)bring forward hrtimer by set min_interval to sd->balance_interval
   6) else
       6.1)delay hrtimer by sd->balance_interval *= 2

5.2 How to find busiest group and rq

Condition:

    avg_load>prev.max_load&&
   sum_nr_running>group_capacity||
   group_imb

   struct sg_lb_stats {
   unsigned long avg_load; /*Avg load across the CPUs of the group */
   unsigned long group_load; /* Total load over the CPUs of the group */
   unsigned long sum_nr_running; /* Nr tasks running in the group */
   unsigned long sum_weighted_load; /* Weighted load of group's tasks */
   unsigned long group_capacity;
   int group_imb; /* Is there an imbalance in the group ? */
   };

find max rq->load.weight in group

6.1 Oprofile

Sampling: event based and time based
Two part:
   Kernel module oprofile.ko: for saving sampling data in memory
       get performance counter                        register_timer_hook

   User daemon oprofiled: get sampling data, save it to file and parse.

6.2 config and compile

Kernel:
   1) menuconfig:
       enable Oprofile in profiling menu
       enable Local APIC and IO-APIC in Processor type and features menu
   2) .config: set CONFIG_PROFILING=y and CONFIG_OPROFILE=y
oprofile toolkit compile:
   ./configure --with-kernel-support
   make
   make install

6.3 oprofile toolkit

oprofiled
opcontrol: user interface
opannotate: comments source code for sampling data
opreport: binary and symble map
ophelp: list supported events
opgprof: generate gprof format data(a program analyzer)
opstack： generate call stack, with call-graph patch for kernel
oparchive： archive raw sampling data
op_import： change data format

6.4 How to use oprofile

# opcontrol --setup --ctr0-event=CPU_CLK_UNHALTED
--ctr0-count=600000 --vmlinux=/usr/src/linux-*/vmlinux
# opcontrol --start
# opcontrol --stop/--shutdown/--dump(/var/lib/oprofile/samples /oprofiled.log)
# opcontrol --status
# opcontrol --list-events
# opcontrol --event=L2_CACHE_MISS:500 --event=L1_DTLB_MISS_AND_L2_DTLB_HIT:500
# opreport -l ./testbinary

Command: detail referrence(http://linux.chinaunix.net/techdoc/system/2009/08/28/1132702.shtml)
cat /proc/cpuinfo&meminfo
grep processor /proc/cpuinfo | wc -l
top/ps
sar -q 1 5(load)/sar -u 2 3(cpu utilization)/sar -r(mem)

vmstat   all purpose performance tool                     yes(in kernel base version)
mpstat   provides statistics per CPU                      no
sar      all purpose performance monitoring tool          no
iostat   provides disk statistics                             no
netstat provides network statistics                      yes
dstat    monitoring statistics aggregator                 no
iptraf   traffic monitoring dashboard                     no
netperf Network bandwidth tool                           no
ethtool reports on Ethernet interface configuration     yes
iperf    Network bandwidth tool                               no
tcptrace Packet analysis tool                                 no

glmwu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Linux scheduler(cfs) introduction

要给公司做个cfs培训（偏重于load balance），整了个ppt，图片都是从网上搜的，原作者不详，在此深表歉意。2幻灯片3The Outlinel Basic concepts about Linux process & threadl Basic concepts about SMPl Completely Fair Schedule(CFS)l How to
复制链接

扫一扫