schedutil调频的目标是CPU需要多少性能就给多少,这样是否就能达到功耗与性能的平衡了呢?只能说,在系统中只有一个核的情况下是可能的。但是现在手机行业是一个极度内卷的行业,没有个八核处理器都不好意思说是手机,更别说单核处理器,更甚,手机不仅内卷出八核处理器,还内卷出了异构多处理器。HMP(异构多处理器)架构是为了兼容性能和功耗,在一个chip中,封装两类ARM Core,一类为高性能核(通常称大核)用于高负载场景,例如游戏、视频软编码等,一类为低性能核(通常称小核)用于低负载场景,例如短视频、电话、音乐等。
由于异构多处理器的引入,CPU之间负载的分配就显得尤其重要。例如,当一个cluster中的CPU负载分布如下图所示,调频策略会根据cluster中最大的CPU负载来设定目标频率,这样虽然满足了CPU1对性能的需求,但是对于其它CPU来说,这样的性能显然是比较浪费的。
再如下图所示,0-3是小核,4-7是大核,所有核上的负载都不高,完全可以由小核去运行,就像一个货物本来可以用一辆轿车来运输的,却用了大卡车来运算,这样能效比当然低。EAS绿色节能调度算法的目标是在保证系统性能的前提下,通过负载的分配尽可能的降低功耗。
为了达到性能与功耗的平衡,内核引入了EAS(Energy Aware Scheduling)调度机制。EAS在选核的时候,会计算在哪个核上运行,即能满足进程对CPU计算能力的需求又功耗最小。
1.能效模型(Energy Model)
EAS调度器不但要考虑CPU运算能力(影响因素:大小核、频率),还要考虑功耗,这就需要知道CPU在各种运算能力下的功耗值。内核用结构体struct em_cap_state来存储CPU的频率功耗信息。成员unsigned long frequency表示频率,unsigned long power表示该频率下的功耗,unsigned long cost是为了计算方便引入的参数,cost = power * max_frequency / frequency。
struct em_cap_state {
unsigned long frequency; //频率
unsigned long power; //功耗
unsigned long cost;
};
//计算cost
static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
struct em_data_callback *cb)
{
unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
unsigned long power, freq, prev_freq = 0;
int i, ret, cpu = cpumask_first(span);
struct em_cap_state *table;
struct em_perf_domain *pd;
u64 fmax;
.....................................................................
//计算cost
fmax = (u64) table[nr_states - 1].frequency;
for (i = 0; i < nr_states; i++) {
unsigned long power_res = em_scale_power(table[i].power);
table[i].cost = div64_u64(fmax * power_res,
table[i].frequency);
}
pd->table = table;
pd->nr_cap_states = nr_states;
cpumask_copy(to_cpumask(pd->cpus), span);
em_debug_create_pd(pd, cpu);
return pd;
.............................................................
}
CPU与能效模型的关系如下图所示,每个CPU都对应一个运行队列runqueue,用来管理运行在该CPU上的线程。同一个cluster中的CPU性能相同,拥有同一个性能域struct perf_domain。struct perf_domain的成员struct em_perf_domain *em_pd存储了各个频点核对应的功耗。
struct rq {
..........................................................
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
..........................................................
struct root_domain *rd;
struct sched_domain __rcu *sd;
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
.........................................................
};
struct root_domain {
atomic_t refcount;
atomic_t rto_count;
struct rcu_head rcu;
cpumask_var_t span;
cpumask_var_t online;
..........................................................
/*
* NULL-terminated list of performance domains intersecting with the
* CPUs of the rd. Protected by RCU.
*/
struct perf_domain __rcu *pd;
};
struct perf_domain {
struct em_perf_domain *em_pd;
struct perf_domain *next;
struct rcu_head rcu;
};
struct em_perf_domain {
struct em_cap_state *table; //各个频率的功耗
int nr_cap_states;
unsigned long cpus[0];
};
2.EAS选核过程
EAS选核的接口是find_energy_efficient_cpu()--选择能效最好的CPU。遍历每个性能域中的每一个CPU,计算假如进程迁移到这个CPU后的功耗增量,选择功耗增量最小的那个CPU。需要注意的是,有两个限制的地方:第一是,如果CPU的负载已经达到最高负载的80%则不选该CPU;第二是,迁移到目标CPU与继续在当前CPU上运行,功耗节省不到6%,则继续在当前CPU上运行。
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
unsigned long cpu_cap, util, base_energy = 0;
int cpu, best_energy_cpu = prev_cpu;
struct sched_domain *sd;
struct perf_domain *pd;
rcu_read_lock();
pd = rcu_dereference(rd->pd);
if (!pd || READ_ONCE(rd->overutilized))
goto fail;
/*
* Energy-aware wake-up happens on the lowest sched_domain starting
* from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
*/
sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
sd = sd->parent;
if (!sd)
goto fail;
sync_entity_load_avg(&p->se);
if (!task_util_est(p))
goto unlock;
//遍历每个性能域
for (; pd; pd = pd->next) {
unsigned long cur_delta, spare_cap, max_spare_cap = 0;
unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;
/* Compute the 'base' energy of the pd, without @p */
base_energy_pd = compute_energy(p, -1, pd);
base_energy += base_energy_pd;
//遍历性能域中的每个CPU
for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
//迁移到cpu后,cpu的负载
util = cpu_util_next(cpu, p, cpu);
cpu_cap = capacity_of(cpu);
//CPU的负载已经达到最高负载的80%则不选该CPU
if (!fits_capacity(util, cpu_cap))
continue;
/* Always use prev_cpu as a candidate. */
if (cpu == prev_cpu) {
prev_delta = compute_energy(p, prev_cpu, pd);
prev_delta -= base_energy_pd;
best_delta = min(best_delta, prev_delta);
}
//找出该性能域中,负载增量最小的CPU
//同一性能域,能效模型相同,只对比负载即可
spare_cap = cpu_cap - util;
if (spare_cap > max_spare_cap) {
max_spare_cap = spare_cap;
max_spare_cap_cpu = cpu;
}
}
/* 对比迁移到不同性能域后的功耗增量,选择功耗增量最小的那个
* 不同性能域的能效模型不一样,需要计算功耗增量来对比*/
if (max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) {
cur_delta = compute_energy(p, max_spare_cap_cpu, pd); //迁移到max_spare_cap_cpu后的性能域功耗
cur_delta -= base_energy_pd; //功耗增量
if (cur_delta < best_delta) { //选择小的那一个
best_delta = cur_delta;
best_energy_cpu = max_spare_cap_cpu;
}
}
}
unlock:
rcu_read_unlock();
//迁移到目标CPU与继续在当前CPU上运行,功耗节省不到6%,则继续在当前CPU上运行
if (prev_delta == ULONG_MAX)
return best_energy_cpu;
if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
return best_energy_cpu;
return prev_cpu;
fail:
rcu_read_unlock();
return -1;
}