d519329 sched/fair: Update util_est only on util_avg updates
a07630b sched/cpufreq/schedutil: Use util_est for OPP selection
f9be3e5 sched/fair: Use util_est in LB and WU paths
7f65ea4 sched/fair: Add util_est on top of PELT
-
简介
PELT util_avg在big task 睡眠长时间后会被decay的很小,running task的PELT信号可以在ms级更新,实时变化的util不能客观反映task的util以支持调度器决策,因此引入util_est, 更稳定的反映task/cfs_rq的util。AOSP kernel4.14已经实现了这个feature。
-
数据结构
-
函数调用关系
-
代码逻辑
-
参数
task_sleep: DEQUEUE_SLEEP - task is no longer runnable
1.1.UTIL_AVG_UNCHANGED使用task util_est.enqueued最后一个bit,标识task的util_avg是否被更新过了。只有task的util_avg被更新过,才会更新task的util_est。对于频繁运行的u秒级小task(util_avg最小更新粒度是1ms),才不会在每次dequeue时更新util_est,这样减小了更新util_est的开销,同时也减小了在task util_avg未更新时代无意义操作。
-
代码注释
3751 static void
3752 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
3753 {
3754 long last_ewma_diff;
3755 struct util_est ue;
3756 int cpu;
3757
3758 if (!sched_feat(UTIL_EST))
3759 return;
3760
3761 /* Update root cfs_rq's estimated utilization */
3762 ue.enqueued = cfs_rq->avg.util_est.enqueued;
3763 ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
3764 WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
3765
3766 /*
3767 * Skip update of task's estimated utilization when the task has not
3768 * yet completed an activation, e.g. being migrated.
3769 *如果task被迁移走而不是sleep,不需要更新task的util_est/
3770 if (!task_sleep)
3771 return;
3772
3773 /*
3774 * If the PELT values haven't changed since enqueue time,
3775 * skip the util_est update.
3776 */
UTIL_AVG_UNCHANGED :d519329 sched/fair: Update util_est only on util_avg updates
util_est的最后一位标示了task从入队后的PELT value util_avg是否被更新了,如果没被更新(task很小微妙级,util_avg更新最小粒度是1ms),也不需要更新util_est了,这样减小了更新util_est的开销。
1. UTIL_AVG_UNCHANGED在哪清零?
__update_load_avg_se->cfs_se_util_change(&se->avg)把task的avg->util_est.enqueued的UTIL_AVG_UNCHANGED位清零,即更新task和它所在的rq util后,置位标志位,标示task的util已经发生变化了。
3777 ue = p->se.avg.util_est;
3778 if (ue.enqueued & UTIL_AVG_UNCHANGED)
3779 return;
3780
3781 /*
3782 * Skip update of task's estimated utilization when its EWMA is
3783 * already ~1% close to its last activation value.
3784 */
2. UTIL_AVG_UNCHANGED在哪置位?
赋值:util_est.enqueued=util_avg,并置位UTIL_AVG_UNCHANGED
3785 ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
kernel5.4增加了UTIL_EST_FASTUP sched feature,加速更新ewma,取max{ewma(t-1),enqueued},enqueued值则是task_util(p)的值
3874 if (sched_feat(UTIL_EST_FASTUP)) {
3875 if (ue.ewma < ue.enqueued) {
3876 ue.ewma = ue.enqueued;
3877 goto done;
3878 }
3879 }
3786 last_ewma_diff = ue.enqueued - ue.ewma;
3787 if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
3788 return;
3789
3790 /*
3791 * To avoid overestimation of actual task utilization, skip updates if
3792 * we cannot grant there is idle time in this CPU.
3793 */
3794 cpu = cpu_of(rq_of(cfs_rq));
3795 if (task_util(p) > capacity_orig_of(cpu))
3796 return;
3797
3798 /*
3799 * Update Task's estimated utilization
3800 *
3801 * When *p completes an activation we can consolidate another sample
3802 * of the task size. This is done by storing the current PELT value
3803 * as ue.enqueued and by using this value to update the Exponential
3804 * Weighted Moving Average (EWMA):
3805 *
3806 * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
3807 * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
3808 * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
3809 * = w * ( last_ewma_diff ) + ewma(t-1)
3810 * = w * (last_ewma_diff + ewma(t-1) / w)
3811 *
ewma值:最新的task_util(p)和上一次ewma的值按比例更新到最新的ewma值,现在w=0.25,即最新的task util占新ewma值的1/4权重,上次ewma的值占3/4的权重,w值越小,平滑越明显,即ewma的变化趋势越平缓。
3812 * Where 'w' is the weight of new samples, which is configured to be
3813 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
3814 */
3815 ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
3816 ue.ewma += last_ewma_diff;
3817 ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
3. 更新p->se.avg.util_est
3818 WRITE_ONCE(p->se.avg.util_est, ue);
3819 }