Linux scheduler(cfs) introduction

公司让做个cfs相关的培训,整理了个ppt,图片均来自网络,源作者在此不一一说明,深表歉意~~

The Outline

Basic concepts about Linux process & thread
Basic concepts about SMP
Linux bootup with BP and how to boot AP
Completely Fair Schedule(CFS) and RT Sched
How to load balance
How to debug smp issue with oprofile and other tools

0. Basic Concepts

Program
    A program is a combination of intructions and data, which are put together to perform a task.

Process
    A process is an abstraction created to embody the state of a program during its execution. Therefore, a process can also be viewed as an instance of a program, or called “a running program”.
    API:     fork(clone->do_fork->copy_process->copy_flag(CLONE_*: FILES, VM, FS, SIGHAND,     NEWNS, NEWPID, NEWNET, NEWUTS(NEW* for namespace, LXC)...))
        exec, wait(for freeing zombie child, have to wait, it is better to handle SIGCHLD from     child process)

Thread
    LWP. A process can have multiple execution contexts that work together to accomplish its goal. These different execution contexts are called “threads”. These threads share same virtual address space with the process.
    API:     pthread_create(fork), pthread_attr_init, pthread_attr_setschedpolicy,     pthread_attr_setschedparam, pthread_attr_getschedparam, pthread_exit,     pthread_cancel, pthread_join(like as wait, free child resource), pthread_detach(self     free all resource)

Kernel Thread
    A thread without user-mode virtual address space. All instructions and data of a kernel thread are in Kernel VA Space, and in Linux they are usually linked together with kernel as a part of kernel image,  such as kswapd, kflushd and ksoftirqd.
    API:     kthread_run(create+wakeup), kthread_create, kthread_stop

task_struct{
    state, pid, mm, active_mm, prio, se, cpus_allowed, children, sibling, fs, files, signal, cgroups}

((struct thread_info *)(sp & ~(THREAD_SIZE – 1)))->task
    THREAD_SIZE=8K

init process 0
        #define INIT_TASK(tsk) {    \
        .state        = 0,    \
        .thread_info        = &init_thread_info,    \
        ……
        .mm        = NULL,        \\\NULL for kernel task
        .active_mm        = &init_mm,        \
        ……
    }
In init_mm, pgd= swapper_pg_dir(0xC0004000)

Former 768 PGD entries for User VA Space
All User process shares the same kernel PGD.
init process 0 becomes “idle” process ( “idle” kernel thread).


1. Scheduler History

2.4 O(n)

One simple runqueue

 nice and counter

 

2.6 O(1)

Kernel preemptible

Bitmap

runqueue(active/expire) for per-cpu and per-priority

static_prio and time slice


2. CFS Introduction

From kernel 2.6.23 version
For task runtime balance
Sched policy:SCHED_NORMAL, SCHED_FIFO, SCHED_RR, SCHED_IDLE, SCHED_BATCH(for cpu consume type)
sched entity, sched class
sched domain , sched group


3.1 priority

Normal task Priority
    static_prio(nice:-20~19->100~139), prio(dynamic for Priority Inversion): low value for high priority
RT task priority
    rt_priority(0~99): high value for high priority
normal_prio(0~139): unify for nomarl and rt, low value for high priority
nice(), setpriority() for normal task
sched_setscheduler() for RT task

3.2 SCHED_FIFO and SCHED_RR

pick_next_rt_task: Find the highest rt_priority task list by bitmap; pick the first task of the highest rt_priority task list, after running, put it to tail of list
SCHED_FIFO: run till initiative schedule or be preempted by higher priority rt task
SCHED_RR: run till initiative schedule or be preempted by higher priority rt task, or out of timeslice.
RT task(throttling): sched_rt_period_us(1000000us), sched_rt_runtime


4.1 RB Tree for Normal Task Organization

 

4.2 Node sequence of RB Tree

RB tree, O(log n)
The RB node sequence is determined by vruntime(The most left leaf is the lowest vruntime value)
The speed of Consuming vruntime is determined by prio
Constant arrary for static_prio to weight:
    prio_to_weight[ ]/prio_to_wmult[ ](1 nice~10%weight)
vruntime = delta_exec * (NICE_0_LOAD / weight)

4.3 task_struct to RB tree in CFS

 

 

4.4 From rq to task

rq is per cpu variable


4.5 Sched_class for Sched Policy

fair_sched_class/rt_sched_class

struct sched_class 

 {

    const struct sched_class *next;
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task) (struct rq *rq);
    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);
    void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
    void (*post_schedule) (struct rq *this_rq);
    void (*task_waking) (struct rq *this_rq, struct task_struct *task);
    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*switched_from) (struct rq *this_rq, struct task_struct *task,
                   int running);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task,
                 int running);

 ...... }


4.6 Sched domain



4.7 Group and Domain

 


5. How to schedule

l Load Balance Rule

l  1) In cpu_domain level,  all cpus share cache and cpu_power(power in cpu group), free to load balance

l  2) In core_domain level, all cpus share L2 cache, load balance when core domain is imbalance

l  3) In phys_domain level, must flush all cache

4) In numa_domain level, load balance costs much


5.1 Two Schedule Entry

l When no running task in current rq||initiative calling||sleep||from system space to user space, call schedule()

 

l When hrtimer(HZ) timeout, call scheduler_tick()


5.1.1 Schedule()

    1) Disable preempt
    2) cancel hrtimer  
    3) clear flag of need resched
    4.1) If task is in unrunnable&&preemptible
        4.1.1) If has signal,set it to running state
        4.1.2) Else remove the task from the runqueue
    5) If no task on rq
        5.1) idle load balance(for each domain from down to up)
    6) put_prev_task
    7) pick_next_task
        7.1) if (rq->nr_running == rq->cfs.nr_running)//it is SCHED_NORMAL(fai sched class)
            call fair_sched_class.pick_next_task
        7.2) else
            for each class in class list from sched_class_highest, call
        class->pick_next_task, untill find task
    8) sched_info_switch
    9) context_switch
    10) Enable preempt

    11) Check if need resched, if so go to 1)


5.1.2 How to pick and put task for fair Sched

pick_next_task(fair scheduler) on rq
        1) Find the rb_leftmost node on the cfs_rq
        2) get sched entity from rb node with rb_entry()
        3) if my_q of the se is null(it is a task)
            3.1) find it and dequeue entity from rq
        4) else, go to 1)
put_prev_task
        6.1) if it belongs to sched group(parent != null)
            it will be put in every iterator entity's rq from it to parent)
        6.2) else(parent == null)
            put it into current rq

5.1.3 How to pick and put task for RT Sched

pick_next_task_rt:
    Find the highest rt_priority task list by bitmap; pick the first task of the highest rt_priority task list.
put_prev_task_rt:
    //if task is in running state&&has allowed cpu
    if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
        1) del it from current node of list
        2) add it to tail of list


5.1.4 Scheduler_tick()

    1) update rq clock and load
    2) notify sched_class to update process's vruntime
    3) update timestamp of next load balance
    4) raise softirq for load balance


5.1.5 How to load balance

    1) find busiest group in the domain
    2) find busiest rq in the busiest group
    3) if running tasks >1 on the busiest rq
    3.1) Disable local irq and lock the rq
    3.2) move task from the busiest rq to current rq
    3.3) If this_cpu is not current cpu, send ipi to wake up this_cpu
    3.4) If this_cpu is set to all_pinned(affinity), clear it from busiest cpu mask
    3.5) Update schedule some flag and info
    3.6) return
    4) if running tasks <= 1 && moving task is fail
        4.1) if the busiest cpu needs active_balance
        4.2) set the busiest cpu's push_cpu with this_cpu
        4.3) wake up the busiest cpu's migration_thread to move task to other cpu, then     it can be in idle state.
        4.4) update some info
    5) if not active_balance for busiest cpu
        5.1)bring forward hrtimer by set min_interval to sd->balance_interval
    6) else
        6.1)delay hrtimer by sd->balance_interval *= 2


5.2 How to find busiest group and rq

Condition: 

    avg_load>prev.max_load&&
    sum_nr_running>group_capacity||
    group_imb

    struct sg_lb_stats {
    unsigned long avg_load; /*Avg load across the CPUs of the group */
    unsigned long group_load; /* Total load over the CPUs of the group */
    unsigned long sum_nr_running; /* Nr tasks running in the group */
    unsigned long sum_weighted_load; /* Weighted load of group's tasks */
    unsigned long group_capacity;
    int group_imb; /* Is there an imbalance in the group ? */
    };

find max rq->load.weight in group


6.1 Oprofile

Sampling: event based and time based
Two part:
    Kernel module oprofile.ko: for saving sampling data in memory
        get performance counter                         register_timer_hook
        
    User daemon oprofiled: get sampling data, save it to file and parse.

6.2 config and compile

Kernel:
    1) menuconfig:
        enable Oprofile in profiling menu
        enable Local APIC and IO-APIC in Processor type and features menu
    2) .config:  set CONFIG_PROFILING=y and CONFIG_OPROFILE=y
oprofile toolkit compile:
    ./configure --with-kernel-support
    make
    make install

6.3 oprofile toolkit

oprofiled
opcontrol: user interface
opannotate: comments source code for sampling data
opreport: binary and symble map
ophelp: list supported events
opgprof: generate gprof format data(a program analyzer)
opstack: generate call stack, with call-graph patch for kernel
oparchive: archive raw sampling data
op_import: change data format

6.4 How to use oprofile

# opcontrol --setup --ctr0-event=CPU_CLK_UNHALTED
--ctr0-count=600000 --vmlinux=/usr/src/linux-*/vmlinux
# opcontrol --start
# opcontrol --stop/--shutdown/--dump(/var/lib/oprofile/samples /oprofiled.log)
# opcontrol --status
# opcontrol --list-events
# opcontrol --event=L2_CACHE_MISS:500 --event=L1_DTLB_MISS_AND_L2_DTLB_HIT:500
# opreport -l ./testbinary

Command: detail referrence(http://linux.chinaunix.net/techdoc/system/2009/08/28/1132702.shtml)
cat /proc/cpuinfo&meminfo
grep processor /proc/cpuinfo | wc -l
top/ps
sar -q 1 5(load)/sar -u 2 3(cpu utilization)/sar -r(mem)

vmstat   all purpose performance tool                      yes(in kernel base version)
mpstat   provides statistics per CPU                       no
sar      all purpose performance monitoring tool           no
iostat   provides disk statistics                              no
netstat  provides network statistics                       yes
dstat    monitoring statistics aggregator                  no
iptraf   traffic monitoring dashboard                      no
netperf  Network bandwidth tool                            no
ethtool  reports on Ethernet interface configuration       yes
iperf    Network bandwidth tool                                no
tcptrace Packet analysis tool                                  no


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值