WALT算法简介

liglei

已于 2024-10-23 09:43:37 修改

阅读量2.7k

点赞数 8

文章标签： walt

于 2024-06-03 19:18:52 首次发布

本文链接：https://blog.csdn.net/liglei/article/details/139422243

版权

WALT(Windows-Assist Load Tracing)算法是由Qcom开发，通过把时间划分为窗口，对

task运行时间和CPU负载进行跟踪计算的方法。为任务调度、迁移、负载均衡及CPU调频

提供输入。

WALT相对PELT算法，更能及时反映负载变化，更适用于移动设备的交互场景。

1. 不同cpu不同频点的相对runtime

task在不同cpu的不同频点上运行，运行时长是不同的。那么怎么度量task的大小，体现task对CPU的能力需求呢？我们知道task在某个cpu某个频点的绝对runtime时间，此runtime期间这个cpu的能力，和最大capacity cpu在最高频点运行的能力，如果能有对应关系，就可以把这个绝对runtime时间，归一化到相对最大cpu能力的runtime时间。

WALT中，使用wrq->task_exec_scale表征cpu当前频点的能力。把capacity最大的cpu在最高频点运行的能力定义为1024。task_exec_scale的计算方法如下：

赋值调用路径：walt_update_task_ravg->update_task_rq_cpu_cycles

如果不使用cpu cycles计算负载：

task_exec_scale=

{cpu_cur_freq(cpu)/(wrq->cluster->max_possible_freq)}

*arch_scale_cpu_capacity(cpu)

如果使用cpu cycles计算负载：

task_exec_scale=

{(cycles_delta /time_delta)/(wrq->cluster->max_possible_freq)}

*arch_scale_cpu_capacity(cpu)

wrq->task_exec_scale会在cpu频点发生变化时从新计算更新，此时就可以通过这个归一化后的cpu能力，计算task的相对runtime了。

WALT是通过scale_exec_time函数实现的：

delta * (wrq->task_exec_scale) / 1024

例如，此时task 在cpu0 1G频点对应的task_exec_scale=200运行了10ms，那么task的相对运行时间即为:

10*200/1024=1.953125(ms)

2. scaled runtime

在实际项目中，WALT定义了5个历史窗口，每个窗口长16ms(可根据帧率动态设置为8ms)。在每次对task runtime delta进行更新时，都要转换为相对runtime。

但对调度器默认的单位util，还要做进一步的归一化。WALT中把task 在capacity最大cpu最高频点连续运行一个窗口(16ms)时间，定义为最大uti值1024。所以把相对runtime按一个窗口归一化到1024的util值为：

runtime_scaled = (delta/sched_ravg_window)*1024

实现函数为：

scale_time_to_util(runtime)

一、关键数据结构

1. 变量

sched_ravg_window（EXPORT_SYMBOL()）

walt负载统计窗口大小

定义变量时初始化：20000000

walt初始化时：walt_init->walt_tunables->DEFAULT_SCHED_RAVG_WINDOW：

300HZ时对齐刷新率：（3333333*5）；其它刷新率16000000

单位：ns

NUM_LOAD_INDICES

#define NUM_LOAD_INDICES 1000

把window size（现在是16ms，不会根据动态window size调整）分为1000份

sched_load_granule

unsigned int __read_mostly sched_load_granule

sched_load_granule = DEFAULT_SCHED_RAVG_WINDOW / NUM_LOAD_INDICES

#define DEFAULT_SCHED_RAVG_WINDOW 16000000

计算得知window size分为1000份之后的单位粒度是16000ns=16us

2. 结构体

walt_task_struct

struct walt_task_struct {
    u32             flags;
    u64             mark_start;
    u64             window_start;
    u32             sum, demand;
    u32             coloc_demand;
    u32             sum_history[RAVG_HIST_SIZE];
    u16             sum_history_util[RAVG_HIST_SIZE];
    u32             curr_window_cpu[WALT_NR_CPUS];
    u32             prev_window_cpu[WALT_NR_CPUS];
    u32             curr_window, prev_window;
    u8              busy_buckets[NUM_BUSY_BUCKETS];
    u16             bucket_bitmask;
    u16             demand_scaled;
    u16             pred_demand_scaled;
    u64             active_time;
    u64             last_win_size;
    int             boost;
    bool                wake_up_idle;
    bool                misfit;
    bool                rtg_high_prio;
    u8              low_latency;
    u64             boost_period;
    u64             boost_expires;
    u64             last_sleep_ts;
    u32             init_load_pct;
    u32             unfilter;
    u64             last_wake_ts;
    u64             last_enqueued_ts;
    struct walt_related_thread_group __rcu  *grp;
    struct list_head        grp_list;
    u64             cpu_cycles;
    bool                iowaited;
    int             prev_on_rq;
    int             prev_on_rq_cpu;
    struct list_head        mvp_list;
    u64             sum_exec_snapshot_for_slice;
    u64             sum_exec_snapshot_for_total;
    u64             total_exec;
    int             mvp_prio;
    int             cidx;
    int             load_boost;
    int64_t             boosted_task_load;
    int             prev_cpu;
    int             new_cpu;
    u8              enqueue_after_migration;
    u8              hung_detect_status;
    int             pipeline_cpu;
};

3.mark_start：同步wrq->mark_start

记录task某些event的开始时间，这些event在walt.h task_event中定义。

4.window_start

标识了per task window。同步于wrq->window_start：

walt_update_task_ravg->update_cpu_busy_time:

    /*
     * Handle per-task window rollover. We don't care about the
     * idle task.
     */
    if (new_window) {
        if (!is_idle_task(p))
            rollover_task_window(p, full_window);
        wts->window_start = window_start;
    }

新task wts->window_start初始化为0，在第一次调用walt_update_task_ravg时同步为wrq->window_start

5.wts->sum += scale_exec_time(delta, rq, wts)

计算当前窗口task运行时间总和, 这些时间是实际运行时间delta按cpu当前频率和cpu capacity归一化到最大capacity cpu最高频率的运行时长。比如在最大capacity cpu上以最高频点运行的task，实际运行时间delta和统计到sum的时间是1:1的。

更新路径：

walt_update_task_ravg->update_task_demand->add_to_task_demand->(wts->sum += scale_exec_time(delta, rq, wts))

walt_update_task_ravg->update_task_demand->update_history->(wts->sum = 0), 在event跨窗口时，才会复位为0。

5.wts->demand

历史窗口中统计的相对runtime，通过不同的策略计算出当前采用的相对runtime值

#define WINDOW_STATS_RECENT 0

#define WINDOW_STATS_MAX 1

#define WINDOW_STATS_MAX_RECENT_AVG 2

#define WINDOW_STATS_AVG 3

6.wts->coloc_demand

5个历史窗口task 相对runtime的平均值

wts->coloc_demand = div64_u64((sum += hist[i]), RAVG_HIST_SIZE)

7.wts->sum_history[5]

记录了task 5个历史窗口的相对runtime。相对 runtime，即为task实际running时间归一化到最大capacity cpu最高频点后的相对运行时间, 最大值为16ms。

walt_update_task_ravg()->update_task_demand()->update_history()->

hist[wts->cidx] = runtime

8.wts->sum_history_util[5]

把task在每个历史窗口累加的相对runtime(0ms~16ms), 归一化到(0~1024)的util值, 这里称之为scaled runtime：

walt_update_task_ravg()->update_task_demand()->update_history()->

scale_time_to_util(runtime)=(wts->sum/sched_ravg_window)*1024

9.curr_window_cpu[WALT_NR_CPUS]

统计task在每个cpu上当前窗口中相对runtime

10.prev_window_cpu[WALT_NR_CPUS]

统计task在每个cpu上前一窗口中相对runtime

11.curr_window, prev_window

walt_update_task_ravg()->update_cpu_busy_time()

统计task在当前窗和前一窗中，相对runtime。包含task运行期间的irqtime。

12.参考如何更新wts->bucket_bitmask及busy_buckets[bidx]

13.参考如何更新wts->bucket_bitmask及busy_buckets[bidx]

14.demand_scaled

参考5），把计算出当前的相对runtime值归一化到scaled runtime(0~1024).

walt_update_task_ravg()->update_task_demand()->update_history()->

scale_time_to_util(demand)

16.active_time

is_new_task中判断此值小于100ms就认为是新task，rollover_task_window是唯一调用路径。

28.last_sleep_ts:

记录task被开始sleep的时间，在__schedule->android_rvh_schedule中，当prev!=next且!prev->on_rq时, 更新prev task的last_sleep_ts。prev task不再被调度，且不在rq的wait list时，即表示prev task被schedule out后，进入非running状态。

41.wts->cidx

current idx, 索引hist和hist_util数组中当前窗口，存储task相对running time和util。在window更新时，在update_history中更新索引值，达到滚动更新历史窗口信息的目的。

task_event

task event标识了task的一组行为，这些行为发生的时间点，是对task demand变化更新的关键时间点。理解walt算法，首先要理解这些event的含义及如何通过这些event更新task demand和cpu 负载。

enum task_event {
    PUT_PREV_TASK   = 0,
    PICK_NEXT_TASK  = 1,
    TASK_WAKE   = 2,
    TASK_MIGRATE    = 3,
    TASK_UPDATE = 4,
    IRQ_UPDATE  = 5,
};

PUT_PREV_TASK/PICK_NEXT_TASK：

调用路径__schedule->android_rvh_schedule()->

walt_update_task_ravg(prev, rq, PUT_PREV_TASK/PICK_NEXT_TASK, wallclock, 0). CPU上发生任务调度,通过pick_next_task选出cpu上将要运行的next task后，通过hook调用核心函数对prev task和next task的util及cpu的负载进行更新。PUT_PREV_TASK标识prev task从rq上被调度走时对prev task的util和cpu负载进行更新。

TASK_WAKE:

调用路径try_to_wake_up->android_rvh_try_to_wake_up(*p)->

walt_update_task_ravg(p, rq, TASK_WAKE, wallclock, 0)

当wakeup一个task，对非current(当前cpu)&&非runnalbe(不在run queue wait list中) &&非running(其它cpu)的task p，在select_task_rq选择target cpu之前，通过hook调用核心函数。此时的task_cpu(p)为task最后一次running的cpu。

walt_rq

struct walt_rq {
    struct task_struct  *push_task;
    struct walt_sched_cluster *cluster;
    struct cpumask      freq_domain_cpumask;
    struct walt_sched_stats walt_stats;


    u64         window_start;
    u32         prev_window_size;
    unsigned long       walt_flags;


    u64         avg_irqload;
    u64         last_irq_window;
    u64         prev_irq_time;
    struct task_struct  *ed_task;
    u64         task_exec_scale;
    u64         old_busy_time;
    u64         old_estimated_time;
    u64         curr_runnable_sum;
    u64         prev_runnable_sum;
    u64         nt_curr_runnable_sum;
    u64         nt_prev_runnable_sum;
    struct group_cpu_time   grp_time;
    struct load_subtractions load_subs[NUM_TRACKED_WINDOWS];
    DECLARE_BITMAP_ARRAY(top_tasks_bitmap,
            NUM_TRACKED_WINDOWS, NUM_LOAD_INDICES);
    u8          *top_tasks[NUM_TRACKED_WINDOWS];
    u8          curr_table;
    int         prev_top;
    int         curr_top;
    bool            notif_pending;
    bool            high_irqload;
    u64         last_cc_update;
    u64         cycles;
    u64         util;
    struct list_head    mvp_tasks;
    int                     num_mvp_tasks;
    u64         latest_clock;
    u32         enqueue_counter;
};

15.task_exec_scale: task所在cpu的当前频点归一化到cpu最大能力1024的值，体现当前频点cpu的能力.

例如在最大cluster中cpu在最高频点运行， task_exec_scale=1024.

赋值调用路径：walt_update_task_ravg->update_task_rq_cpu_cycles

!use_cycle_counter：如果不使用cpu cycles计算负载：task_exec_scale=

{cpu_cur_freq(cpu)/(wrq->cluster->max_possible_freq)}

*arch_scale_cpu_capacity(cpu)

use_cycle_counter：如果使用cpu cycles计算负载：task_exec_scale=

{(cycles_delta /time_delta)/(wrq->cluster->max_possible_freq)}

*arch_scale_cpu_capacity(cpu)

(https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/cpu-capacity.txt)

18.curr_runnable_sum

当前窗口cpu 负载信息

19.prev_runnable_sum

前一窗cpu负载信息

20.nt_curr_runnable_sum

cpu当前窗new task负载信息, wts->active_time小于100ms属于new task

21.nt_prev_runnable_sum

cpu前一窗new task负载信息

24.unsigned long top_tasks_bitmap[2][BITS_TO_LONGS(1000)]，

跟踪curr和prev两个窗口。BITS_TO_LONGS(1000)=16, 即为大小为16*64=1024bit

26.u8 *top_tasks[2]

两个指针元素的指针数组, 分别指向大小1000 Byte(u8)类型的mem。记录当前cpu上最近两个窗口curr和prev上，翻转？

27.curr_table 0?1

使用两个window进行跟踪，标识哪个是curr的，curr和prev构成一个环形数组，不停翻转.

在update_window_start->rollover_top_tasks(rq, full_window)翻转更新：

wrq->curr_table

wrq->prev_top = curr_top

wrq->curr_top = 0

wrq->top_tasks[prev/curr]

wrq->top_tasks_bitmap[prev/curr]

28.prev_top

前窗最大task runtime落在1000个bucket中的index

29.curr_top

是index值，当前窗口的max load_to_index(wts->curr_window)，记录在当前窗口最大task runtime落在的bucket id。如当前窗口运行过task和相对runtime分别为：

A(2ms) B(6ms) C(15ms)，

则：wrq->curr_top = load_to_index(15ms)=15*1000/16

34.util

walt gov调频获取cpu util：

waltgov_update_freq->waltgov_get_util(wg_cpu)->cpu_util_freq_walt()->__cpu_util_freq_walt()

wrq->util = scale_time_to_util(freq_policy_load(rq, reason))

37.记录最近一次调用update_window_start的时间戳，但未必满足更新条件(delta \geq sched_ravg_window)对wrq->window_start进行更新

walt_sched_cluster

174 struct walt_sched_cluster {
175 raw_spinlock_t load_lock;
176 struct list_head list;
177 struct cpumask cpus;
178 int id;
179 /*
180 * max_possible_freq = maximum supported by hardware
181 * max_freq = max freq as per cpufreq limits
182 */
183 unsigned int cur_freq;
184 unsigned int max_possible_freq;
185 unsigned int max_freq;
186 unsigned int walt_internal_freq_limit;
187 u64 aggr_grp_load;
188 unsigned long util_to_cost[1024];
189 u64 found_ts;
190 };

186 walt_internal_freq_limit:

二、核心函数

1. 函数调用关系

walt_update_task_ravg是WALT算法的入口函数，

主要函数调用关系如下图。主要涉及:

1. 随着时间推移会产生新的window，更新window的

start time

update_window_start

2.计算 wrq->task_exec_scale， task_exec_scale反

映了在当前频点下cpu的能力，参见：

《二、task_runtime 1. 不同cpu不同频点的相对runtim》

update_task_rq_cpu_cycles

3. 对task runtime和历史记录的更新，实现函数为：

update_task_demand

4. 对task 和 rq curr和prev两个window统计信息的更新

update_cpu_busy_time

5. 更新task预测需求

update_task_pred_demand

2. 函数说明

a） update_window_start(rq, wallclock, event)

- - 更新wrq->window_start += (u64)nr_windows * (u64)sched_ravg_window
  - 更新wrq->prev_window_size = sched_ravg_window
  - 更新wrq的历史窗口统计信息，初始化新窗口信息：
    update_window_start()->rollover_cpu_window(rq, nr_windows > 1)

rollover_cpu_window(struct rq *rq, bool full_window)
{
    struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));
    u64 curr_sum = wrq->curr_runnable_sum;
    u64 nt_curr_sum = wrq->nt_curr_runnable_sum;
    u64 grp_curr_sum = wrq->grp_time.curr_runnable_sum;
    u64 grp_nt_curr_sum = wrq->grp_time.nt_curr_runnable_sum;


    if (unlikely(full_window)) {
        curr_sum = 0;
        nt_curr_sum = 0;
        grp_curr_sum = 0;
        grp_nt_curr_sum = 0;
    }


    wrq->prev_runnable_sum = curr_sum;
    wrq->nt_prev_runnable_sum = nt_curr_sum;
    wrq->grp_time.prev_runnable_sum = grp_curr_sum;
    wrq->grp_time.nt_prev_runnable_sum = grp_nt_curr_sum;


    wrq->curr_runnable_sum = 0;
    wrq->nt_curr_runnable_sum = 0;
    wrq->grp_time.curr_runnable_sum = 0;
    wrq->grp_time.nt_curr_runnable_sum = 0;
}

9）~21）：

scheduler_tick()->trace_android_rvh_tick_entry(rq)->update_window_start在每个tick(250HZ 4ms)中断处理函数中触发更新wrq->window_start，所以超过一个window_size(8ms/16ms)满足更新条件(delta >=sched_ravg_window)后，在第二个window_size内还没更新wrq->window_start的情况(full_window=2)极少发生:(异常task长时间关中断运行，或tickless的idle cpu(https://docs.kernel.org/timers/no_hz.html))

当这种情况发生时，因为前窗更新window start异常，prev值只参考前窗，所以(nr_windows > 1)后更新的prev值设置为0.

更新
- update_window_start()->rollover_top_tasks(struct rq *rq, bool full_window)

static void rollover_top_tasks(struct rq *rq, bool full_window)
{
    struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));
    u8 curr_table = wrq->curr_table;
    u8 prev_table = 1 - curr_table;
    int curr_top = wrq->curr_top;


    clear_top_tasks_table(wrq->top_tasks[prev_table]);
    clear_top_tasks_bitmap(wrq->top_tasks_bitmap[prev_table]);


    if (full_window) {
        curr_top = 0;
        clear_top_tasks_table(wrq->top_tasks[curr_table]);
        clear_top_tasks_bitmap(wrq->top_tasks_bitmap[curr_table]);
    }


    wrq->curr_table = prev_table;
    wrq->prev_top = curr_top;
    wrq->curr_top = 0;
}

b）update_task_rq_cpu_cycles(p, rq, event, wallclock, irqtime)

主要计算 wrq->task_exec_scale， task_exec_scale反映了在当前频点下cpu的能力，参见walt_rq.task_exec_scale说明。

c）u64 update_task_demand(p, rq, event, wallclock)

此函数是WALT算法计算task demand核心函数，把此次event发生时需要更新的task busy time delta，归一化到最大capacity、最高频点的相对时间，即相对runtime，并累加到wts->sum，并前滚更新task的历史窗口信息。根据5个历史窗口得到最终表示task大小的demand。详细参见《三、算法 1. 如何更新task demnd》

d）update_cpu_busy_time(p, rq, event, wallclock, irqtime)

详细参考《三、算法 2. 如何更新cpu负载信息》

e）update_task_pred_demand(rq, p, event)

static void update_task_pred_demand(struct rq *rq, struct task_struct *p, int event)
{
    u16 new_pred_demand_scaled;
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;
    u16 curr_window_scaled;


    if (is_idle_task(p))
        return;


    if (event != PUT_PREV_TASK && event != TASK_UPDATE &&
            (!SCHED_FREQ_ACCOUNT_WAIT_TIME ||
             (event != TASK_MIGRATE &&
             event != PICK_NEXT_TASK)))
        return;


    /*
     * TASK_UPDATE can be called on sleeping task, when its moved between
     * related groups
     */
    if (event == TASK_UPDATE) {
        if (!p->on_rq && !SCHED_FREQ_ACCOUNT_WAIT_TIME)
            return;
    }


    curr_window_scaled = scale_time_to_util(wts->curr_window);
    if (wts->pred_demand_scaled >= curr_window_scaled)
        return;


    new_pred_demand_scaled = get_pred_busy(p, busy_to_bucket(curr_window_scaled),
                 curr_window_scaled, wts->bucket_bitmask);


    if (task_on_rq_queued(p) && (!task_has_dl_policy(p) ||
                !p->dl.dl_throttled))
        fixup_walt_sched_stats_common(rq, p,
                wts->demand_scaled,
                new_pred_demand_scaled);


    wts->pred_demand_scaled = new_pred_demand_scaled;
}

10~14.只处理PUT_PREV_TASK，TASK_UPDATE，TASK_MIGRATE，PICK_NEXT_TASK四个event，SCHED_FREQ_ACCOUNT_WAIT_TIME=0 (default)

20~22.当sleep task在不同group间迁移时，TASK_UPDATE会被调用。SCHED_FREQ_ACCOUNT_WAIT_TIME=0(default)，不对sleep task 更新。

25.(wts->curr_window/sched_ravg_window)*1024, 把wts->curr_window以ns为单位的数值归一化到1024，归一化后的值为util。

29.参考《三、算法 1.如何更新task runtime》

f）run_walt_irq_work_rollover(old_window_start, rq)

三、算法

1. 如何更新task runtime

WALT(Windows-Assist Load Tracing)算法是由Qcom开发，通过把时间划分为窗口，对task运行时间和CPU负载进行跟踪计算的方法。为任务调度、迁移、负载均衡及CPU调频提供输入。

WALT相对PELT算法，更能及时反映负载变化，更适用于移动设备的交互场景。

不同cpu不同频点的相对runtime

WALT中，使用wrq->task_exec_scale表征cpu当前频点的能力。把capacity最大的cpu在最高频点运行一个window定义为1024。task_exec_scale的计算方法如下：

赋值调用路径：walt_update_task_ravg->update_task_rq_cpu_cycles

如果不使用cpu cycles计算负载：

task_exec_scale=

{cpu_cur_freq(cpu)/(wrq->cluster->max_possible_freq)}

*arch_scale_cpu_capacity(cpu)

如果使用cpu cycles计算负载：

task_exec_scale=

{(cycles_delta /time_delta)/(wrq->cluster->max_possible_freq)}

*arch_scale_cpu_capacity(cpu)

wrq->task_exec_scale会在cpu频点发生变化时从新计算更新，此时就可以通过这个归一化后的cpu能力，计算task的相对runtime了。

WALT是通过scale_exec_time函数实现的：

delta * (wrq->task_exec_scale) / 1024

例如，此时task 在cpu0 1G频点对应的task_exec_scale=200运行了10ms，那么task的相对运行时间即为:

10*200/1024=1.953125(ms)

scaled runtime

在实际项目中，WALT定义了5个历史窗口，每个窗口长16ms(可根据帧率动态设置为8ms)。在每次对task runtime delta进行更新时，都要转换为相对runtime。

runtime_scaled = (delta/sched_ravg_window)*1024

实现函数为：

scale_time_to_util(runtime)

ms: wts->mark_start: event的开始时间

ws: wrq->window_start

wc: wallclock: 当前event时间戳，walt_sched_clock()->sched_clock()

在计算task的demand时分三种情况：

- event未跨越窗口
- event跨两个窗口
- event跨多个窗口

update_task_demand
()是WALT算法的核心函数，计算task 相对runtime并更新历史窗口(默认配置记录5个history window)。涉及三个主要函数：

1) account_busy_for_task_demand(struct rq *rq, struct task_struct *p, int event)
判断task的busy time，对应event是否应该累加task running time。

static int
account_busy_for_task_demand(struct rq *rq, struct task_struct *p, int event)
{
    /*
     * No need to bother updating task demand for the idle task.
     */
    if (is_idle_task(p))
        return 0;


    /*
     * When a task is waking up it is completing a segment of non-busy
     * time. Likewise, if wait time is not treated as busy time, then
     * when a task begins to run or is migrated, it is not running and
     * is completing a segment of non-busy time.
     */
    if (event == TASK_WAKE || (!SCHED_ACCOUNT_WAIT_TIME &&
             (event == PICK_NEXT_TASK || event == TASK_MIGRATE)))
        return 0;


    /*
     * The idle exit time is not accounted for the first task _picked_ up to
     * run on the idle CPU.
     */
    if (event == PICK_NEXT_TASK && rq->curr == rq->idle)
        return 0;


    /*
     * TASK_UPDATE can be called on sleeping task, when its moved between
     * related groups
     */
    if (event == TASK_UPDATE) {
        if (rq->curr == p)
            return 1;


        return p->on_rq ? SCHED_ACCOUNT_WAIT_TIME : 0;
    }


    return 1;
}

如何确定event发生时, task在rq上是busy状态, 并应该被累加到task running time中呢？

7.不统计idle task的demand。

16.TASK_WAKE视为非busytime。SCHED_ACCOUNT_WAIT_TIME = flase, PICK_NEXT_TASK, TASK_MIGRATE event视为非busy time，不统计task demand。

24.如果设置了SCHED_ACCOUNT_WAIT_TIME，并且event=PICK_NEXT_TASK，当task被调度到idle cpu运行时，cpu退出idle的时间不能视为task的busy time，不统计。

31.event == TASK_UPDATE 且

- - - task p正在running，属于busy time。
    - task p在等待队列中，可以通过配置 SCHED_ACCOUNT_WAIT_TIME 决定是否把task在等待队列中的时间视为busy time;
      非running非runnable不统计。

38.排除以上情况，默认是需要更新的。

2) update_history(rq, p, runtime, samples, event)

WALT记录了task 5个历史窗口运行时间。在event跨窗口时统计task busy time，因为产生了新的历史窗口，需要对task历史窗口runtime进行滚动更新。使用wts->cidx作为hist[]数组中current window的索引，通过更新索引值达到滚动更新历史信息。这里的runtime，是task在被更新窗口内的runtime，即wts->sum。当跨越并更新full window时，为window_size(16ms)的相对runtime(大小取决于当前频点计算出的wrq->task_exec_scale)。

static void update_history(struct rq *rq, struct task_struct *p,
             u32 runtime, int samples, int event)
{  
    runtime_scaled = scale_time_to_util(runtime);
    /* Push new 'runtime' value onto stack */
    for (; samples > 0; samples--) {
        hist[wts->cidx] = runtime;
        hist_util[wts->cidx] = runtime_scaled;
        wts->cidx = ++(wts->cidx) % RAVG_HIST_SIZE;
    }

wts->demand的计算策略

WINDOW_STATS_RECENT	取上一窗的wts->sum作为demand
WINDOW_STATS_MAX	取5个历史窗口中最大值作为task demand
WINDOW_STATS_MAX_RECENT_AVG	取 max{5个历史窗口的平均值，上一窗的wts->sum}，作为task demand
WINDOW_STATS_AVG	取5个窗口的平均值作为task demnad

如何计算predict task demand？

static inline u16 predict_and_update_buckets(
            struct task_struct *p, u16 runtime_scaled) {
    int bidx;
    u32 pred_demand_scaled;
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;


    bidx = busy_to_bucket(runtime_scaled);
    pred_demand_scaled = get_pred_busy(p, bidx, runtime_scaled, wts->bucket_bitmask);
    bucket_increase(wts->busy_buckets, &wts->bucket_bitmask, bidx);


    return pred_demand_scaled;
}

在调用update_history更新task的历史runtime后，会调用get_pred_busy()预测task的runtime，即未来task对cpu能力的需求。把0~1024的scaled runtime划分为16个buckets，每个buckets对应64 util，把要更新的runtime_scaled通过busy_to_bucket映射到对应的bucket id，作为起始start bucket id，根据wts->bucket_bitmask, 向上寻找第一个置位的bit n, 表示在之前的窗口, task的runtime落到这个bit所对应的busy_buckets[n]，这是一个busy bucket。如果hist_util[5]记录的5个最近历史window的task runtime有落在这个busy_buckets[n]范围内的(例如hist_util[m])，则这个hist_util[m]作为预测的task predict demand返回。

如何更新wts->bucket_bitmask及busy_buckets[bidx]？

#define INC_STEP 8
#define DEC_STEP 2
#define CONSISTENT_THRES 16
#define INC_STEP_BIG 16
static inline void bucket_increase(u8 *buckets, u16 *bucket_bitmask, int idx)
{
    int i, step;


    for (i = 0; i < NUM_BUSY_BUCKETS; i++) {
        if (idx != i) {
            if (buckets[i] > DEC_STEP)
                buckets[i] -= DEC_STEP;
            else {
                buckets[i] = 0;
                *bucket_bitmask &= ~BIT_MASK(i);
            }
        } else {
            step = buckets[i] >= CONSISTENT_THRES ?
                        INC_STEP_BIG : INC_STEP;
            if (buckets[i] > U8_MAX - step)
                buckets[i] = U8_MAX;
            else
                buckets[i] += step;
            *bucket_bitmask |= BIT_MASK(i);
        }
    }
}

wts->bucket_bitmask的16个bit位对应16个busy_buckets，表示对应的bucket是否还处于busy状态。处于busy状态的bucket会在计算task的predict demand时被选中。当调用update_history更新task的历史负载时，需要更新进历史窗口的task scaled runtime通过busy_to_bucket(runtime_scaled)获取到runtime_scaled所属的bucket id(参数idx)，并将bucket_bitmask中的对应的bit置位，表示task在历史窗口的runtime曾经落在这个bucket，表征了task在历史窗口中运行的大小特性。每次调用update_history更新的runtime，runtime落在的bucket，对应的busy_buckets[bidx]值会步进增加，其它busy_buckets则会衰减，当衰减到小于DEC_STEP时，busy_buckets[i]被清零，对应的bit mask被清除。这样busy_buckets[bidx]就反映了近期task runtime特征，值越大表明task在历史窗口的负载越多的落在这个bucket。bucket_bitmask[bit_n]=0表示task的runtime已经很久没有落在这个bucket区间了。目前bucket_bitmask[bidx]的值除了步进或衰减以设置或清除bucket_bitmask ，没有其它用途。

3) add_to_task_demand(struct rq *rq, struct task_struct *p, u64 delta)

static u64 add_to_task_demand(struct rq *rq, struct task_struct *p, u64 delta)
{
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;


    delta = scale_exec_time(delta, rq, wts);
    wts->sum += delta;
    if (unlikely(wts->sum > sched_ravg_window))
        wts->sum = sched_ravg_window;


    return delta;
}

当前窗口task busy time统计，把实际运行时间delta按cpu当前频率和capacity归一化到最大capacity cpu上最高频率运行的时长。累加到到wts->sum。wts->sum最大值为sched_ravg_window，即16ms。在event跨窗口时调用update_history归0，从新开始统计新窗口task的相对运行时间。参考wts.sum。

2. 如何更新cpu负载信息

update_cpu_busy_time是cpu负载信息更新核心函数, 将task runningtime或irqtime更新到:

wrq->grp_time->curr/prev_runnable_sum

wrq->grp_time->nt_curr/prev_runnable_sum

wts->curr/prev_window

wts->curr/prev_window_cpu[cpu]

task属于grp时才更grp_time对应的 sum，irq只更新到wrq对应的sum。

关键函数包括：

1) rollover_task_window(p, full_window)

如果非idle task跨窗口，需要将curr_window赋值给prev_window, rollover_task_window task的runtime统计：
wts->curr/prev_window
wts->curr/prev_window_cpu[i]

wts->active_time

static void rollover_task_window(struct task_struct *p, bool full_window)
{
    u32 *curr_cpu_windows = empty_windows;
    u32 curr_window;
    int i;
    struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(task_rq(p)));
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;


    /* Rollover the sum */
    curr_window = 0;


    if (!full_window) {
        curr_window = wts->curr_window;
        curr_cpu_windows = wts->curr_window_cpu;
    }


    wts->prev_window = curr_window;
    wts->curr_window = 0;


    /* Roll over individual CPU contributions */
    for (i = 0; i < nr_cpu_ids; i++) {
        wts->prev_window_cpu[i] = curr_cpu_windows[i];
        wts->curr_window_cpu[i] = 0;
    }


    if (is_new_task(p))
        wts->active_time += wrq->prev_window_size;
}

3 10 17 22.当经历一个完整窗口，表示可能存在异常，前窗清零：

wts->prev_window=0，

wts->prev_window_cpu[i]=0

13 14 17 22. 当只是跨一个窗口，更新：

wts->prev_window = wts->curr_window

wts->prev_window_cpu[i] = wts->curr_window_cpu[i]

18 23.wts->curr_window和wts->curr_window_cpu[i]在跨窗口时会被重置为0

2) account_busy_for_cpu_time(rq, p, irqtime, event)

walt_update_task_ravg->update_cpu_busy_time->account_busy_for_cpu_time

static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p,
                     u64 irqtime, int event)
{
    if (is_idle_task(p)) {
        /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
        if (event == PICK_NEXT_TASK)
            return 0;


        /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
        return irqtime || cpu_is_waiting_on_io(rq);
    }


    if (event == TASK_WAKE)
        return 0;


    if (event == PUT_PREV_TASK || event == IRQ_UPDATE)
        return 1;


    /*
     * TASK_UPDATE can be called on sleeping task, when its moved between
     * related groups
     */
    if (event == TASK_UPDATE) {
        if (rq->curr == p)
            return 1;


        return p->on_rq ? SCHED_FREQ_ACCOUNT_WAIT_TIME : 0;
    }


    /* TASK_MIGRATE, PICK_NEXT_TASK left */
    return SCHED_FREQ_ACCOUNT_WAIT_TIME;
}

4.先看task是否是idle task。对于非idle task，再根据event具体判断cpu的busy time

10.p是idle task，此时只有更新irqtime和cpu在等待io(rq->nr_iowait >0), cpu才属于busytime

13.非idle task的TASK_WAKE event发生时，cpu不是busytime

16.非idle taks的PUT_PREV_TASK 和 IRQ_UPDATE属于busytime

23.非idle taks，TASK_UPDATE时

24.p是curr task，cpu在busytime
27.p不是curr task，但是在runqueue的wait list中，取决于是否设置了SCHED_FREQ_ACCOUNT_WAIT_TIME
27.p不是curr task，也不是runnable状态，cpu不在busytime

31.非idle task，TASK_MIGRATE, PICK_NEXT_TASK对cpu来说，取决于是否设置了SCHED_FREQ_ACCOUNT_WAIT_TIME

3) update_top_tasks(p, rq, old_curr_window, new_window, full_window)

调用路径：

walt_update_task_ravg->update_cpu_busy_time->update_top_tasks

这里的old_curr_window是此次更新前的wts->curr_window。此函数作用是在更新

wrq->top_tasks[curr/prev],

wrq->top_tasks_bitmap[curr/prev]

wrq->top_tasks[2]是两个指针的指针数组，分别记录task在curr和prev window的相对runtime。分别指向1000个u8类型的mem, 即把0~16ms分成1000个buckets，每个bucket为16us。task在curr/prev window的相对runtime如果落在某个bucket，对应bucket的值+1，当bucket值等于1时，置位对应的top_tasks_bitmap位，表示此bucket为busy bucket。当bucket值从1变为0时，clear对应的top_tasks_bitmap位，表示此bucket为非busy bucket。

static void update_top_tasks(struct task_struct *p, struct rq *rq,
        u32 old_curr_window, int new_window, bool full_window)
{
    struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));
    struct walt_task_struct *wts = (struct walt_task_struct *) p->android_vendor_data1;
    u8 curr = wrq->curr_table;
    u8 prev = 1 - curr;
    u8 *curr_table = wrq->top_tasks[curr];
    u8 *prev_table = wrq->top_tasks[prev];
    int old_index, new_index, update_index;
    u32 curr_window = wts->curr_window;
    u32 prev_window = wts->prev_window;
    bool zero_index_update;


    if (old_curr_window == curr_window && !new_window)
        return;
    
    old_index = load_to_index(old_curr_window);
    new_index = load_to_index(curr_window);
    
    if (!new_window) {
        zero_index_update = !old_curr_window && curr_window;
        if (old_index != new_index || zero_index_update) {
            if (old_curr_window)
                curr_table[old_index] -= 1;
            if (curr_window)
                curr_table[new_index] += 1;
            if (new_index > wrq->curr_top)
                wrq->curr_top = new_index;
        }


        if (!curr_table[old_index])
            __clear_bit(NUM_LOAD_INDICES - old_index - 1,
                wrq->top_tasks_bitmap[curr]);


        if (curr_table[new_index] == 1)
            __set_bit(NUM_LOAD_INDICES - new_index - 1,
                wrq->top_tasks_bitmap[curr]);


        return;
    }

没跨窗口

21.没跨窗口，更新前的wts->curr_window是0，当前非0，表示task在这个window里第一次被统计

23.如果此次更新task的runtime发生了变化，或者是第一次被统计busy time

25.curr_table指向1000个u8类型的mem, 更新前task runtime所属bucket id即为old_index，因为在同一window中发生了再次更新，原来task runtime落在的curr_table[old_index]减一，当前task runtime落在的curr_table[new_index]加一。

跨一个或多个窗口

    /*
     * The window has rolled over for this task. By the time we get
     * here, curr/prev swaps would has already occurred. So we need
     * to use prev_window for the new index.
     */
    update_index = load_to_index(prev_window);


    if (full_window) {
        /*
         * Two cases here. Either 'p' ran for the entire window or
         * it didn't run at all. In either case there is no entry
         * in the prev table. If 'p' ran the entire window, we just
         * need to create a new entry in the prev table. In this case
         * update_index will be correspond to sched_ravg_window
         * so we can unconditionally update the top index.
         */
        if (prev_window) {
            prev_table[update_index] += 1;
            wrq->prev_top = update_index;
        }


        if (prev_table[update_index] == 1)
            __set_bit(NUM_LOAD_INDICES - update_index - 1,
                wrq->top_tasks_bitmap[prev]);
    } else {
        zero_index_update = !old_curr_window && prev_window;
        if (old_index != update_index || zero_index_update) {
            if (old_curr_window)
                prev_table[old_index] -= 1;


            prev_table[update_index] += 1;


            if (update_index > wrq->prev_top)
                wrq->prev_top = update_index;


            if (!prev_table[old_index])
                __clear_bit(NUM_LOAD_INDICES - old_index - 1,
                        wrq->top_tasks_bitmap[prev]);


            if (prev_table[update_index] == 1)
                __set_bit(NUM_LOAD_INDICES - update_index - 1,

8.跨多个窗口

25.跨一个窗口

top task的使用

waltgov_update_freq->waltgov_get_util(wg_cpu)->cpu_util_freq_walt()->__cpu_util_freq_walt()->

util = scale_time_to_util(freq_policy_load(rq, reason))

wrq->util = util->

freq_policy_load(rq, reason)->tt_load = top_task_load(rq)

WALT在调频获取cpu的util时，会使用top task load：

freq_policy_load->tt_load = top_task_load(rq)

tt_load = (wrq->prev_top + 1) * sched_load_granule

这里的wrq->prev_top表示前窗中，最大相对runtime时长的task，落在1000个bucket中的index值。

 551 /*
 552  * Special case the last index and provide a fast path for index = 0.
 553  * Note that sched_load_granule can change underneath us if we are not
 554  * holding any runqueue locks while calling the two functions below.
 555  */
 556 static u32 top_task_load(struct rq *rq)
 557 {
 558     struct walt_rq *wrq = &per_cpu(walt_rq, cpu_of(rq));
 559     int index = wrq->prev_top;
 560     u8 prev = 1 - wrq->curr_table;
 561 
 562     if (!index) {
 563         int msb = NUM_LOAD_INDICES - 1;
 564 
 565         if (!test_bit(msb, wrq->top_tasks_bitmap[prev]))
 566             return 0;
 567         else
 568             return sched_load_granule;
 569     } else if (index == NUM_LOAD_INDICES - 1) {
 570         return sched_ravg_window;
 571     } else {
 572         return (index + 1) * sched_load_granule;
 573     }
 574 }

packing cpu

walt_find_and_choose_cluster_packing_cpu

1068 /* walt_find_and_choose_cluster_packing_cpu - Return a packing_cpu choice common for this cluster.                                                                                                                                     
1069  * @start_cpu:  The cpu from the cluster to choose from                                                                                                                                                                                
1070  *                                                                                                                                                                                                                                     
1071  * If the cluster has a 32bit capable cpu return it regardless                                                                                                                                                                         
1072  * of whether it is halted or not.                                                                                                                                                                                                     
1073  *                                                                                                                                                                                                                                     
1074  * If the cluster does not have a 32 bit capable cpu, find the                                                                                                                                                                         
1075  * first unhalted, active cpu in this cluster.                                                                                                                                                                                         
1076  *                                                                                                                                                                                                                                     
1077  * Returns -1 if packing_cpu if not found or is unsuitable to be packed on  to                                                                                                                                                         
1078  * Returns a valid cpu number if packing_cpu is found and is usable                                                                                                                                                                    
1079  */                                                                                                                                                                                                                                    
1080 static inline int walt_find_and_choose_cluster_packing_cpu(int start_cpu, struct task_struct *p)                                                                                                                                       
1081 {                                                                                                                                                                                                                                      
1082     struct walt_rq *wrq = &per_cpu(walt_rq, start_cpu);                                                                                                                                                                                
1083     struct walt_sched_cluster *cluster = wrq->cluster;                                                                                                                                                                                 
1084     cpumask_t unhalted_cpus;                                                                                                                                                                                                           
1085     int packing_cpu;                                                                                                                                                                                                                   
1086                                                                                                                                                                                                                                        
1087     /* if idle_enough feature is not enabled */                                                                                                                                                                                        
1088     if (!sysctl_sched_idle_enough)                                                                                                                                                                                                     
1089         return -1;                                                                                                                                                                                                                     
1090     if (!sysctl_sched_cluster_util_thres_pct)                                                                                                                                                                                          
1091         return -1;                                                                                                                                                                                                                     
1092                                                                                                                                                                                                                                        
1093     /* find all unhalted active cpus */                                                                                                                                                                                                
1094     cpumask_andnot(&unhalted_cpus, cpu_active_mask, cpu_halt_mask);                                                                                                                                                                    
1095                                                                                                                                                                                                                                        
1096     /* find all unhalted active cpus in this cluster */                                                                                                                                                                                
1097     cpumask_and(&unhalted_cpus, &unhalted_cpus, &cluster->cpus);                                                                                                                                                                       
1098                                                                                                                                                                                                                                        
1099     if (is_compat_thread(task_thread_info(p)))                                                                                                                                                                                         
1100         /* try to find a packing cpu within 32 bit subset */                                                                                                                                                                           
1101         cpumask_and(&unhalted_cpus, &unhalted_cpus, system_32bit_el0_cpumask());                                                                                                                                                       
1102                                                                                                                                                                                                                                        
1103     /* return the first found unhalted, active cpu, in this cluster */                                                                                                                                                                 
1104     packing_cpu = cpumask_first(&unhalted_cpus);                                                                                                                                                                                       
1105                                                                                                                                                                                                                                        
1106     /* packing cpu must be a valid cpu for runqueue lookup */                                                                                                                                                                          
1107     if (packing_cpu >= nr_cpu_ids)                                                                                                                                                                                                     
1108         return -1;                                                                                                                                                                                                                     
1109                                                                                                                                                                                                                                        
1110     /* if cpu is not allowed for this task */                                                                                                                                                                                          
1111     if (!cpumask_test_cpu(packing_cpu, p->cpus_ptr))                                                                                                                                                                                   
1112         return -1;
1113                                                                                                                                                                                                                                        
1114     /* if cluster util is high */                                                                                                                                                                                                      
1115     if (sched_get_cluster_util_pct(cluster) >= sysctl_sched_cluster_util_thres_pct)                                                                                                                                                    
1116         return -1;                                                                                                                                                                                                                     
1117                                                                                                                                                                                                                                        
1118     /* if cpu utilization is high */                                                                                                                                                                                                   
1119     if (cpu_util(packing_cpu) >= sysctl_sched_idle_enough)                                                                                                                                                                             
1120         return -1;                                                                                                                                                                                                                     
1121                                                                                                                                                                                                                                        
1122     /* don't pack big tasks */                                                                                                                                                                                                         
1123     if (task_util(p) >= sysctl_sched_idle_enough)                                                                                                                                                                                      
1124         return -1;                                                                                                                                                                                                                     
1125                                                                                                                                                                                                                                        
1126     if (task_reject_partialhalt_cpu(p, packing_cpu))                                                                                                                                                                                   
1127         return -1;                                                                                                                                                                                                                     
1128                                                                                                                                                                                                                                        
1129     /* don't pack if running at a freq higher than 43.9pct of its fmax */                                                                                                                                                              
1130     if (arch_scale_freq_capacity(packing_cpu) > 450)                                                                                                                                                                                   
1131         return -1;                                                                                                                                                                                                                     
1132                                                                                                                                                                                                                                        
1133     /* the packing cpu can be used, so pack! */                                                                                                                                                                                        
1134     return packing_cpu;                                                                                                                                                                                                                
1135 }

代码看起来是的，

1. 首先要设置sysctl_sched_idle_enough，表示让cpu尽可能进入idle状态，packing才会生效；

2. 在start cpu对应的cluster里，考虑cpu_active_mask，cpu_halt_mask，如果是32位进程考虑支持32位进程的cpu mask system_32bit_el0_cpumask，考虑task的cpu affinity p->cpus_ptr。在符合这些条件的cpu中找packing cpu。

3. 检查cluster的cpu使用率(前一个window)是否大于sysctl_sched_cluster_util_thres_pct(40), 只有cluster负载轻时才packing

4. 检查找到的packing cpu的cpu util是否大于sysctl_sched_idle_enough(30), 只有cpu的util小于30时才能被选座packing cpu。

5. 检查将要packing的task的util是否大于sysctl_sched_idle_enough(30)，只有task util小于30的小task才能packing

6. 检查packing cpu是否能是partial halt cpu（加入uifirst和 frameboost检查）

7. 检查packing cpu当前频点的freq_scale是否小于450. 要求packing cpu的当前要处在较低频点。freq_scale=1024*(cpu_curr_freq/cpu_max_freq)