前言
这篇文章会介绍 CFS 的基本原理并从源码上分析流程,内容顺序是按照我看源码的顺序展开的。
NOTES:
- 本文在名词上「task」和「进程」是混用的,都是指代
task_struct
结构体。 - 实际上由于本文作者非常菜,所以内容很浅。
基本设计和公平调度
CFS 的实现存放在 fair.c 文件中,毕竟全称是完全公平调度器(Completely Fair Scheduler)。
那怎样才算公平?你可以通过作者对 CFS 的介绍(sched-design-CFS.txt)来了解,总结如下:
- 作者希望模拟一个理想硬件,即使是单 cpu,task 也是可以并行处理的(假设有 2 个 task,那就同时刻均分 50% 的 cpu 计算能力)。
- 而对比真实硬件,一个 task 在执行,另一个 task 就需要等待,对于被等待的 task 来说,属于受到不公平的待遇,作者在这里用了一个等待时间的概念去统计这个不公平程度,等待时间越长,越是意味着迫切需要运行。
- 为了尽可能模拟理想硬件,那就缩短等待时间,每个关键路径让等待最长的运行,于是有了 timeline sort 的做法,并且在现在的实现中改用运行时间而非等待时间去统计,因此有了现在的
vruntime
和rbtree
。
个人认为,公平性是体现在 2 个方面:
- 一是保证(runqueue 上的)所有 task 能在一个调度周期(或者说是调度延迟)都能得到至少一次运行的机会(进一步推导出每个 task 不能运行超过该周期的长度)。
- 二是每次挑选最小运行时间的实体来执行,在这种贪心思想的长期运行下所有实体的运行时间会趋于相同(理想情况下,所有 task 的虚拟运行时间相等)。
总而言之,CFS 的设计是非常朴素的。
从进程创建说起:调度初始化
在 fork()
或者 clone()
流程中,对调度进行初始化的入口在子流程 sched_fork()
完成。
RTFSC/sched_fork.c at master · Caturra000/RTFSC · GitHub
// 文件:/kernel/sched/core.c
/*
* fork()/clone()-time setup:
*/
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
int cpu = get_cpu();
__sched_fork(clone_flags, p);
/*
* We mark the process as NEW here. This guarantees that
* nobody will actually run it, and a signal or other external
* event cannot wake it up and insert it on the runqueue either.
*/
// 首次创建的进程设为TASK_NEW状态
// 确保仍在初始化阶段,不会唤醒,也不会丢到ready queue中
p->state = TASK_NEW;
/*
* Make sure we do not leak PI boosting priority to the child.
*/
p->prio = current->normal_prio;
/*
* Revert to default priority/policy on fork if requested.
*/
if (unlikely(p->sched_reset_on_fork)) {
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
p->rt_priority = 0;
} else if (PRIO_TO_NICE(p->static_prio) < 0)
p->static_prio = NICE_TO_PRIO(0);
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
*/
p->sched_reset_on_fork = 0;
}
// 根据prio的值,决定是dl、rt还是fair调度
// 从这里可看到调度器的优先级
if (dl_prio(p->prio)) {
put_cpu();
return -EAGAIN;
} else if (rt_prio(p->prio)) {
p->sched_class = &rt_sched_class;
} else {
p->sched_class = &fair_sched_class;
}
// 初始化se的负载信息
init_entity_runnable_average(&p->se);
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
* is ran before sched_fork().
*
* Silence PROVE_RCU.
*/
raw_spin_lock_irqsave(&p->pi_lock, flags);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
*/
// 设定CPU,通过get_cpu()来得到cpu变量,由TI管理,用完记得put_cpu()
__set_task_cpu(p, cpu);
if (p->sched_class->task_fork)
// 进入到sched class相关的回调
// 目前看的fair调度,对应于task_fork_fair
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
#ifdef CONFIG_SCHED_INFO
if (likely(sched_info_on()))
memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
p->on_cpu = 0;
#endif
init_task_preempt_count(p);
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
put_cpu();
return 0;
}
/*
* Perform scheduler related setup for a newly forked process p.
* p is forked by current.
*
* __sched_fork() is basic setup used by init_idle() too:
*/
// 对p执行基本的构造
// 大部分调度相关的信息存放在se字段(schedule entity),这样可以抽象出组调度特性
static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
p->on_rq = 0;
p->se.on_rq = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
#endif
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
RB_CLEAR_NODE(&p->dl.rb_node);
init_dl_task_timer(&p->dl);
init_dl_inactive_task_timer(&p->dl);
__dl_clear_params(p);
INIT_LIST_HEAD(&p->rt.run_list);
p->rt.timeout = 0;
p->rt.time_slice = sched_rr_timeslice;
p->rt.on_rq = 0;
p->rt.on_list = 0;
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
init_numa_balancing(clone_flags, p);
}
// 文件:/kernel/sched/fair.c
/*
* called on fork with the child task as argument from the parent's context
* - child not yet on the tasklist
* - preemption disabled
*/
static void task_fork_fair(struct task_struct *p)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se, *curr;
struct rq *rq = this_rq();
struct rq_flags rf;
rq_lock(rq, &rf);
update_rq_clock(rq);
cfs_rq = task_cfs_rq(current);
curr = cfs_rq->curr;
if (curr) {
// 更新curr的调度信息
// 比如让vruntime更新到最新时间
// 以及更新rq上的min vruntime记录
update_curr(cfs_rq);
// 初始化se中的vruntime
se->vruntime = curr->vruntime;
}
// 进一步处理se中的vruntime
place_entity(cfs_rq, se, 1);
if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
/*
* Upon rescheduling, sched_class::put_prev_task() will place
* 'current' within the tree based on its new key value.
*/
swap(curr->vruntime, se->vruntime);
resched_curr(rq);
}
se->vruntime -= cfs_rq->min_vruntime;
rq_unlock(rq, &rf);
}
// 文件:/kernel/sched/core.c
/*
* resched_curr - mark rq's current task 'to be rescheduled now'.
*
* On UP this means the setting of the need_resched flag, on SMP it
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
void resched_curr(struct rq *rq)
{
struct task_struct *curr = rq->curr;
int cpu;
lockdep_assert_held(&rq->lock);
if (test_tsk_need_resched(curr))
return;
cpu = cpu_of(rq);
if (cpu == smp_processor_id()) {
// 打上TIF标记,表示需要调度
set_tsk_need_resched(curr);
set_preempt_need_resched();
return;
}
if (set_nr_and_not_polling(curr))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
}
流程总结如下:
- 首先是 task 状态,对于新创建的进程设为
TASK_NEW
,表示不会被调度器拉入队或者唤醒。 - 根据 task 的
prio
优先级决定使用哪种调度器,CFS 对应于实现fair_sched_class
。 - fork 出来的进程会调用
task_fork_fair()
,目的是调整创建进程的 vruntime。初始化使用与当前的进程相同的 vruntime,但是内部有进一步核心过程调整 vruntime 的流程place_entity()
。 - 最后是一个特性:
child_runs_first
,调度器会比较并交换 parent 和 child 的vruntime
,保证 child 尽可能先执行。
进程创建时的 vruntime 调整:vslice 计算
RTFSC/place_entity.c at master · Caturra000/RTFSC · GitHub
// 文件:/kernel/sched/fair.c
// 进行vruntime调整
// fork流程中,initial传参为1
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;
/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);
/* sleeps up to a single latency don't count. */
if (!initial) {
unsigned long thresh = sysctl_sched_latency;
/*
* Halve their sleep time's effect, to allow
* for a gentler effect of sleepers:
*/
if (sched_feat(GENTLE_FAIR_SLEEPERS))
thresh >>= 1;
vruntime -= thresh;
}
/* ensure we never gain time by being placed backwards. */
se->vruntime = max_vruntime(se->vruntime, vruntime);
}
/*
* We calculate the vruntime slice of a to-be-inserted task.
*
* vs = s/w
*/
// vslice大概是算出一个将要插入task的惩罚时间
static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}
/*
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
* s = p*P[w/rw]
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
// 传入当前在队列的task个数,se可能在队列也可能不在
// 算出来的应该是个墙上时间,但仍需要后面calc_delta调教转为se视角的虚拟时间
u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
for_each_sched_entity(se) {
struct load_weight *load;
struct load_weight lw;
cfs_rq = cfs_rq_of(se);
load = &cfs_rq->load;
// 一些原因导致se不在rq上,因此需要算上se重新调整rq上的load
// (这里我还不太明白,不在rq上却仍统计进去,是否合理?)
if (unlikely(!se->on_rq)) {
lw = cfs_rq->load;
update_load_add(&lw, se->load.weight);
load = &lw;
}
slice = __calc_delta(slice, se->load.weight, load);
}
return slice;
}
/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small.
*
* p = (nr <= nl) ? l : l*nr/nl
*/
// 调度周期的计算流程,保证每个task至少跑一遍
// 大概是max(nr_running * 0.75ms, 6ms),nr大于8就取左边
// 6ms并不是一个传统意义上的时间片概念,只是一个CPU密集型task的调度抢占延迟
static u64 __sched_period(unsigned long nr_running)
{
// sched_nr_latency初始化为8,
// 注释有说是相除得到的数,6 / 0.75 = 8
/*
* This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
*/
if (unlikely(nr_running > sched_nr_latency))
// sysctl_sched_min_granularity定义如下,不考虑CPU对数大概就是0.75ms的意思
/*
* Minimal preemption granularity for CPU-bound tasks:
*
* (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
return nr_running * sysctl_sched_min_granularity;
else
// 搬运一下sysctl_sched_latency的解释,默认不考虑CPU对数为6ms
/*
* Targeted preemption latency for CPU-bound tasks:
*
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length
* and have no persistent notion like in traditional, time-slice
* based scheduling concepts.
*
* (to see the precise effective timeslice length of your workload,
* run vmstat and monitor the context-switches (cs) field)
*
* (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
*/
return sysctl_sched_latency;
}
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
lw->weight += inc;
lw->inv_weight = 0;
}
/*
* delta /= w
*/
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
if (unlikely(se->load.weight != NICE_0_LOAD))
delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
// 因为vslice = slice,所以不用算
return delta;
}
/*
* delta_exec * weight / lw.weight
* OR
* (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
*
* Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
* we're guaranteed shift stays positive because inv_weight is guaranteed to
* fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
*
* Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
* weight/lw.weight <= 1, and therefore our shift will also be positive.
*/
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
u64 fact = scale_load_down(weight);
int shift = WMULT_SHIFT;
__update_inv_weight(lw);
if (unlikely(fact >> 32)) {
while (fact >> 32) {
fact >>= 1;
shift--;
}
}
/* hint to use a 32x32->64 mul */
fact = (u64)(u32)fact * lw->inv_weight;
while (fact >> 32) {
fact >>= 1;
shift--;
}
return mul_u64_u32_shr(delta_exec, fact, shift);
}
place_entity()
其实不仅用于新生 task,还可用于被唤醒时调整 vruntime 的场合,只需要传参上的 initial 加以区分即可。这里顺着刚才进程创建的流程来借这机会分析该函数。
在 task_fork_fair()
调用 place_entity()
前,task 的 vruntime
其实是和 parent 一致的(前面总结提过)。随后的流程如下:
- 根据防饥饿特性:
START_DEBIT
,vruntime
需要一定的惩罚vslice
作为插入队列的代价。 - 确保虚拟时间是单调递增的,算出来的
vruntime
绝不小于当前队列的min vruntime
。
也就是说,如果不使用 START_DEBIT
特性,新生进程与父进程的 vruntime 一致。
接下来讨论 START_DEBIT
特性,也就是 vslice 的计算。其大致过程是先算出墙上时间的 slice
,再经过 calc_delta
转换为虚拟时间上的 vslice
。最后 vruntime 会加上这个数值。
同时需要引入一个调度周期 \(period\)(前面提过)的计算:
- 如果 runqueue 上的 task 不超过 8,那就是固定的 \(6ms \times (1 + log_2(ncpu))\)。
- 否则为 \(nr\_running \times (0.75ms \times (1 + log_2(ncpu)))\)。
得出的 \(period\) 即为一个队列上整体的 slice
,还需要进一步根据权重 weight
来得到当前 task 实体能分享到的 vslice
,该过程为 calc_delta
。
一个公式可以这样描述 kernel 上的计算行为:
\[calc\_delta = (delta\_exec \times weight \times lw\_inv\_weight) >> 32\]其实在人类视角应该理解成这样:
\[calc\_delta = delta\_exec \times \frac{weight}{lw\_weight}\]这里的 \(lw\) 指的是整个 queue 上的权重之和 \(load\_weight = \sum{weight}\),而 \(inv\_weight\) 是指 \(\frac{2^{32}}{weight}\),作为一个整数计算上的优化,我们不关心这种操作,只是提一下。
总之一个 vruntime 会因此增加:
\[\begin{align*} vslice &= period \times \frac{weight}{\sum weight} \times \frac{NICE\_0\_LOAD}{weight} \\ &= period \times \frac{NICE\_0\_LOAD}{\sum weight} \end{align*}\]注意 \(\sum weight\) 已经算入了当前进程的权重。
所以一个进程创建时的初始化到这里就可以了……
但是我们还没把 task 放到调度器运行队列的管辖范围内。
唤醒一个全新的 task
初始化完成后,就应该唤醒(wake up)了。
RTFSC/wake_up_new_task.c at master · Caturra000/RTFSC · GitHub
// 文件:/kernel/sched/core.c
/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
*/
void wake_up_new_task(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
// 在前面sched_fork流程完成必要初始化后,从TASK_NEW转换到TASK_RUNNING
p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
/*
* Fork balancing, do it here and not earlier because:
* - cpus_allowed can change in the fork path
* - any previously selected CPU might disappear through hotplug
*
* Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
// TODO select_task_rq
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
// 求出初始化se的util_avg
post_init_entity_util_avg(&p->se);
// sched入队回调的封装
activate_task(rq, p, ENQUEUE_NOCLOCK);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p);
// 检查是否满足抢占的一个关键点(其实关键点数目不多)
// 满足则打上调度标记
check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
// CFS没有task_woken实现
if (p->sched_class->task_woken) {
/*
* Nothing relies on rq->lock after this, so its fine to
* drop it.
*/
rq_unpin_lock(rq, &rf);
p->sched_class->task_woken(rq, p);
rq_repin_lock(rq, &rf);
}
#endif
task_rq_unlock(rq, p, &rf);
}
/*
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
{
lockdep_assert_held(&p->pi_lock);
if (p->nr_cpus_allowed > 1)
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
else
cpu = cpumask_any(&p->cpus_allowed);
/*
* In order not to call set_task_cpu() on a blocking task we need
* to rely on ttwu() to place the task on a valid ->cpus_allowed
* CPU.
*
* Since this is common to all placement strategies, this lives here.
*
* [ this allows ->select_task() to simply return task_cpu(p) and
* not worry about this generic constraint ]
*/
if (unlikely(!is_cpu_allowed(p, cpu)))
cpu = select_fallback_rq(task_cpu(p), p);
return cpu;
}
// 文件:/kernel/sched/core.c
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
enqueue_task(rq, p, flags);
}
总结流程如下:
- 在唤醒的时候,task 的状态从之前的
TASK_NEW
转换为TASK_RUNNING
。 - 执行入队操作。
- 检查是否可抢占。
这里有 2 个明确的目标需要去了解,就是入队干什么,抢占怎么检查。
入队操作
RTFSC/enqueue_task.c at master · Caturra000/RTFSC · GitHub
// 文件:/kernel/sched/core.c
// 如果不考虑附加feature,只论CFS最基础的流程的话,就是:
// - 调整vruntime
// - 插入rbtree
// - 标记on_rq
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}
// 文件:/kernel/sched/fair.c
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
/*
* The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency.
* Let's add the task's estimated utilization to the cfs_rq's
* estimated utilization, before we update schedutil.
*/
util_est_enqueue(&rq->cfs, p);
/*
* If in_iowait is set, the code below may not trigger any cpufreq
* utilization updates, so do it here explicitly with the IOWAIT flag
* passed.
*/
if (p->in_iowait)
cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
for_each_sched_entity(se) {
// 这是加入前的代码,如果se已经on_rq,那就不需要处理了
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
/*
* end evaluation on encountering a throttled cfs_rq
*
* note: in the case of encountering a throttled cfs_rq we will
* post the final h_nr_running increment below.
*/
if (cfs_rq_throttled(cfs_rq))
break;
cfs_rq->h_nr_running++;
flags = ENQUEUE_WAKEUP;
}
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running++;
if (cfs_rq_throttled(cfs_rq))
break;
update_load_avg(cfs_rq, se, UPDATE_TG);
update_cfs_group(se);
}
if (!se)
add_nr_running(rq, 1);
hrtick_update(rq);
}
/*
* MIGRATION
*
* dequeue
* update_curr()
* update_min_vruntime()
* vruntime -= min_vruntime
*
* enqueue
* update_curr()
* update_min_vruntime()
* vruntime += min_vruntime
*
* this way the vruntime transition between RQs is done when both
* min_vruntime are up-to-date.
*
* WAKEUP (remote)
*
* ->migrate_task_rq_fair() (p->state == TASK_WAKING)
* vruntime -= min_vruntime
*
* enqueue
* update_curr()
* update_min_vruntime()
* vruntime += min_vruntime
*
* this way we don't have the most up-to-date min_vruntime on the originating
* CPU and an up-to-date min_vruntime on the destination CPU.
*/
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
bool curr = cfs_rq->curr == se;
/*
* If we're the current task, we must renormalise before calling
* update_curr().
*/
// 可能有迁移操作,因此这里加回cpu对应cfs_rq的min_vruntime
//
// 比如在task_fork_fair时vruntime就在当前的cfs_rq减去min_vruntime
// 但是加入的时候可能不同CPU,那就要加对应cfs_rq的min_vruntime
//
// dequeue操作也是一个道理
if (renorm && curr)
se->vruntime += cfs_rq->min_vruntime;
update_curr(cfs_rq);
/*
* Otherwise, renormalise after, such that we're placed at the current
* moment in time, instead of some random moment in the past. Being
* placed in the past could significantly boost this task to the
* fairness detriment of existing tasks.
*/
if (renorm && !curr)
se->vruntime += cfs_rq->min_vruntime;
/*
* When enqueuing a sched_entity, we must:
* - Update loads to have both entity and cfs_rq synced with now.
* - Add its load to cfs_rq->runnable_avg
* - For group_entity, update its weight to reflect the new share of
* its group cfs_rq
* - Add its new weight to cfs_rq->load.weight
*/
update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
update_cfs_group(se);
enqueue_runnable_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
// 针对唤醒的进程,进行一些补偿
if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
check_schedstat_required();
update_stats_enqueue(cfs_rq, se, flags);
check_spread(cfs_rq, se);
if (!curr)
// rbtree相关流程
__enqueue_entity(cfs_rq, se);
// 总算标记on_rq
se->on_rq = 1;
if (cfs_rq->nr_running == 1) {
list_add_leaf_cfs_rq(cfs_rq);
check_enqueue_throttle(cfs_rq);
}
}
/*
* Enqueue an entity into the rb-tree:
*/
// 大概就是插入到rbtree当中吧
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
bool leftmost = true;
/*
* Find the right place in the rbtree:
*/
while (*link) {
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
if (entity_before(se, entry)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(&se->run_node, parent, link);
rb_insert_color_cached(&se->run_node,
&cfs_rq->tasks_timeline, leftmost);
}
static inline int entity_before(struct sched_entity *a,
struct sched_entity *b)
{
return (s64)(a->vruntime - b->vruntime) < 0;
}
其实整个过程是相对清晰的,无非是:
- 调整
vruntime
。 - 基于
vruntime
插入rbtree
。 - 标记
on_rq
。
在这里的调整 vruntime
有 2 个层面:
- 一是入队不一定是来自新创建的 task,可能是已有的已出队睡眠的 task 进行唤醒,后者需要进行一个调整:回顾前面
place_entity()
,它们是需要一个补偿时间,而是固定的调度延迟时间(\(6ms * (1 + log_2(ncpu))\))。如果开启GENTLE_FAIR_SLEEPERS
特性,会让补偿数值减半。 - 二是
vruntime
在前面task_fork_fair
时最后一行代码「莫名其妙」地减去min_vruntime
,这是因为创建和唤醒可能不是同一个 CPU,而runqueue
是per-cpu
级别的,在这里重新加回min_vruntime
是基于当前 CPU 的调度器运行队列的 vruntime 重新修正(而前面的减去的直接因素是新建 task 的初始vruntime
来自于执行fork
的实体curr
的vruntime
,两者基准不一致)。
注意这里说的「补偿」不是增加 vruntime,而是减少,这样会在调度器挑选任务时更有优势。
总之可以从中得知:
- 无论睡眠时间长短,都不会影响权重的变化;
- 无论进程本身的优先级高低,权重的变化都一样。
检查抢占
kernel 当中的任务分配是细粒度的:对于抢占行为来说,标记抢占和执行抢占是两回事。就像你干多少活和领多少薪资也是两回事一样。
现在来看下检查(并标记)抢占 flag 的流程:
// 文件:/kernel/sched/core.c
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;
if (p->sched_class == rq->curr->sched_class) {
// fair实现对应于check_preempt_wakeup
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
for_each_class(class) {
if (class == rq->curr->sched_class)
break;
if (class == p->sched_class) {
resched_curr(rq);
break;
}
}
}
/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}
// 文件:/kernel/sched/fair.c
/*
* Preempt the current task with a newly woken task if needed:
*/
// 检查是否能抢占,会打上TIF标记
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
int next_buddy_marked = 0;
if (unlikely(se == pse))
return;
/*
* This is possible from callers such as attach_tasks(), in which we
* unconditionally check_prempt_curr() after an enqueue (which may have
* lead to a throttle). This both saves work and prevents false
* next-buddy nomination below.
*/
if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
return;
// TODO NEXT_BUDDY特性
if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);
next_buddy_marked = 1;
}
/*
* We can come here with TIF_NEED_RESCHED already set from new task
* wake up path.
*
* Note: this also catches the edge-case of curr being in a throttled
* group (e.g. via set_curr_task), since update_curr() (in the
* enqueue of curr) will have resulted in resched being set. This
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
// 已经设置TIF就直接跳过整个path了
if (test_tsk_need_resched(curr))
return;
/* Idle tasks are by definition preempted by non-idle tasks. */
if (unlikely(curr->policy == SCHED_IDLE) &&
likely(p->policy != SCHED_IDLE))
goto preempt;
/*
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;
find_matching_se(&se, &pse);
update_curr(cfs_rq_of(se));
BUG_ON(!pse);
// 需要抢占的时候
if (wakeup_preempt_entity(se, pse) == 1) {
/*
* Bias pick_next to pick the sched entity that is
* triggering this preemption.
*/
if (!next_buddy_marked)
set_next_buddy(pse);
goto preempt;
}
return;
preempt:
// 设置TIF,大概是一个set_bit的过程
resched_curr(rq);
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
* with schedule on the ->pre_schedule() or idle_balance()
* point, either of which can * drop the rq lock.
*
* Also, during early boot the idle thread is in the fair class,
* for obvious reasons its a bad idea to schedule back to it.
*/
if (unlikely(!se->on_rq || curr == rq->idle))
return;
if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
set_last_buddy(se);
}
/*
* Should 'se' preempt 'curr'.
*
* |s1
* |s2
* |s3
* g
* |<--->|c
*
* w(c, s1) = -1
* w(c, s2) = 0
* w(c, s3) = 1
*
*/
// 通过gran来避免over-scheduling
// 因此通过vruntime判断抢占会这样:
// - 如果c比se小,返回-1
// - 如果c比se大但是差距不超过gran,返回0
// - 否则返回1
// 1才允许抢占
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime;
if (vdiff <= 0)
return -1;
gran = wakeup_gran(se);
if (vdiff > gran)
return 1;
return 0;
}
static unsigned long wakeup_gran(struct sched_entity *se)
{
// sysctl_sched_wakeup_granularity注释如下,不考虑CPU对数则为1ms
/*
* SCHED_OTHER wake-up granularity.
*
* This option delays the preemption effects of decoupled workloads
* and reduces their over-scheduling. Synchronous workloads will still
* have immediate wakeup/sleep latencies.
*
* (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
unsigned long gran = sysctl_sched_wakeup_granularity;
/*
* Since its curr running now, convert the gran from real-time
* to virtual-time in his units.
*
* By using 'se' instead of 'curr' we penalize light tasks, so
* they get preempted easier. That is, if 'se' < 'curr' then
* the resulting gran will be larger, therefore penalizing the
* lighter, if otoh 'se' > 'curr' then the resulting gran will
* be smaller, again penalizing the lighter task.
*
* This is especially important for buddies when the leftmost
* task is higher priority than the buddy.
*/
// 后面还需要calc_delta转换为虚拟时间
return calc_delta_fair(gran, se);
}
其实过程也简单:
- 如果已经在
thread_info
打上TIF_NEED_RESCHED
标记,那就直接跳过整个过程,因为已经打上了标记,无需检测。 - 否则需要查看是否允许打标记,如果当前运行实体的
vruntime
已经明显大于要比较实体se
的vruntime
才能运行,避免过度抢占影响吞吐。
在这里,是否「明显大于」的标准是通过一个 gran
来判断,具体是 \(1ms * (1 + log_2(ncpu))\),如果 diff vruntime 小于等于这个值,那也是不可以抢占的。
现在整个 fork-time setup(这里用了内核的注释名字,也就是进程的 fork 初始化),以及基本的进程唤醒都已经整理完了。
周期性调度
周期性调度的触发时机涉及到时钟中断,CONFIG_HZ 因不同配置环境而异。
比如 Google 就曾推荐移动终端设为 333 以获得一个足够好的用户体验(我也不知道怎么测出来的)。
至于周期性调度的入口,它位于 scheduler_tick()
。
RTFSC/scheduler_tick.c at master · Caturra000/RTFSC (github.com)
// 文件:/kernel/sched/core.c
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*/
void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
struct rq_flags rf;
sched_clock_tick();
rq_lock(rq, &rf);
update_rq_clock(rq);
curr->sched_class->task_tick(rq, curr, 0);
cpu_load_update_active(rq);
calc_global_load_tick(rq);
rq_unlock(rq, &rf);
perf_event_task_tick();
#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq);
#endif
}
// 文件:/kernel/sched/fair.c
/*
* scheduler tick hitting a task of our scheduling class.
*
* NOTE: This function can be called remotely by the tick offload that
* goes along full dynticks. Therefore no local assumption can be made
* and everything must be accessed through the @rq and @curr passed in
* parameters.
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
}
static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
/*
* Ensure that runnable average is periodically updated.
*/
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
if (queued) {
resched_curr(rq_of(cfs_rq));
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
if (cfs_rq->nr_running > 1)
check_preempt_tick(cfs_rq, curr);
}
/*
* Preempt the current task with a newly woken task if needed:
*/
// 这里涉及到TIF抢占标记
// 打上标记的可能有2个
// 一是本身持续时间超出ideal time
// 二是跟first entity差距过于明显
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;
struct sched_entity *se;
s64 delta;
ideal_runtime = sched_slice(cfs_rq, curr);
// 求出被调度器选中后持续了多久时间
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
// 如果超出理想范围
if (delta_exec > ideal_runtime) {
// 则打TIF标记(SMP且非本地CPU需要考虑IPI中断),并返回
resched_curr(rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
*/
clear_buddies(cfs_rq, curr);
return;
}
/*
* Ensure that a task that missed wakeup preemption by a
* narrow margin doesn't have to wait for a full slice.
* This also mitigates buddy induced latencies under load.
*/
// 如果没超出且小于最小粒度,也直接返回
if (delta_exec < sysctl_sched_min_granularity)
return;
// 如果已达到最小粒度,会尝试跟最左那位对比
// 如果差距大于一个理想数值,仍会打上标记
se = __pick_first_entity(cfs_rq);
delta = curr->vruntime - se->vruntime;
if (delta < 0)
return;
if (delta > ideal_runtime)
resched_curr(rq_of(cfs_rq));
}
其实只考虑框架本身做的事情只有一个:检测并标记抢占。
这里 check_preempt_tick
的做法和前面的 check_preempt_curr
并不相同。
判断方式如下:
- 求出被调度器选中持续的运行时间。
- 如果超出一个理想范围(等同于调度延迟的大小),那就直接打标记。
- 否则进一步判断是否超出最小粒度(\(0.75ms * (1 + log_2(ncpu))\))且比
leftmost
实体的虚拟时间差仍大于一个理想范围,如果是,同样打上标记。
主动调度 schedule
schedule()
函数的应用极其广泛,kernel 上被调用的关键点可能有一百多个吧(caller 太多了。。)
RTFSC/schedule.c at master · Caturra000/RTFSC (github.com)
// 文件:/kernel/sched/core.c
// 主动调度的简单流程如下:
// - 关闭抢占
// - 挑选next task(见pick_next_task流程)
// - 执行切换(见switch流程)
// - 开启抢占
asmlinkage __visible void __sched schedule(void)
{
struct task_struct *tsk = current;
sched_submit_work(tsk);
do {
preempt_disable();
__schedule(false);
sched_preempt_enable_no_resched();
} while (need_resched());
}
/*
* __schedule() is the main scheduler function.
*
* The main means of driving the scheduler and thus entering this function are:
*
* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
*
* 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
* paths. For example, see arch/x86/entry_64.S.
*
* To drive preemption between tasks, the scheduler sets the flag in timer
* interrupt handler scheduler_tick().
*
* 3. Wakeups don't really cause entry into schedule(). They add a
* task to the run-queue and that's it.
*
* Now, if the new task added to the run-queue preempts the current
* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
* called on the nearest possible occasion:
*
* - If the kernel is preemptible (CONFIG_PREEMPT=y):
*
* - in syscall or exception context, at the next outmost
* preempt_enable(). (this might be as soon as the wake_up()'s
* spin_unlock()!)
*
* - in IRQ context, return from interrupt-handler to
* preemptible context
*
* - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
* then at the next:
*
* - cond_resched() call
* - explicit schedule() call
* - return from syscall or exception to user-space
* - return from interrupt-handler to user-space
*
* WARNING: must be called with preemption disabled!
*/
static void __sched notrace __schedule(bool preempt)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq_flags rf;
struct rq *rq;
int cpu;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
schedule_debug(prev);
if (sched_feat(HRTICK))
hrtick_clear(rq);
local_irq_disable();
rcu_note_context_switch(preempt);
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
* done by the caller to avoid the race with signal_wake_up().
*
* The membarrier system call requires a full memory barrier
* after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
smp_mb__after_spinlock();
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
switch_count = &prev->nivcsw;
// 如果是主动调度且处于非TASK_RUNNING,则本次切换贡献在nvcsw
// 否则贡献在nivcsw
if (!preempt && prev->state) {
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
// prev被拔出runqueue,依赖于dequeue_task/dequeue_task_fair
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
prev->on_rq = 0;
if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
/*
* If a worker went to sleep, notify and ask workqueue
* whether it wants to wake up a task to maintain
* concurrency.
*/
if (prev->flags & PF_WQ_WORKER) {
struct task_struct *to_wakeup;
to_wakeup = wq_worker_sleeping(prev);
if (to_wakeup)
try_to_wake_up_local(to_wakeup, &rf);
}
}
switch_count = &prev->nvcsw;
}
next = pick_next_task(rq, prev, &rf);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
* rq->curr, before returning to user-space.
*
* Here are the schemes providing that barrier on the
* various architectures:
* - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
* switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
* - finish_lock_switch() for weakly-ordered
* architectures where spin_unlock is a full barrier,
* - switch_to() for arm64 (weakly-ordered, spin_unlock
* is a RELEASE barrier),
*/
++*switch_count;
trace_sched_switch(preempt, prev, next);
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
rq_unlock_irq(rq, &rf);
}
balance_callback(rq);
}
schedule()
其实思路很直接:
- 挑出下一个要执行的 task。
- 执行上下文切换。
这里也可看出,调度跟踪点在统计上区分了 nvcsw(自愿上下文切换的次数)和 nivcsw(非自愿上下文切换的次数)。但是我们这里说的前提是一个主动调度的过程,所以并非抢占(preempt = false),跟踪自愿上下文切换。
(上面的周期性调度会间接引起 preempt schedule,这块有时间再补充?)
schedule()
的第一步才是与 CFS 相关的,所以接来看观察如何挑选。
主动调度的核心:挑选任务
RTFSC/pick_next_task.c at master · Caturra000/RTFSC (github.com)
// 文件:/kernel/sched/core.c
/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in the fair class we can
* call that function directly, but only if the @prev task wasn't of a
* higher scheduling class, because otherwise those loose the
* opportunity to pull in more work from other CPUs.
*/
if (likely((prev->sched_class == &idle_sched_class ||
prev->sched_class == &fair_sched_class) &&
rq->nr_running == rq->cfs.h_nr_running)) {
p = fair_sched_class.pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
goto again;
/* Assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p))
p = idle_sched_class.pick_next_task(rq, prev, rf);
return p;
}
again:
for_each_class(class) {
p = class->pick_next_task(rq, prev, rf);
if (p) {
if (unlikely(p == RETRY_TASK))
goto again;
return p;
}
}
/* The idle class should always have a runnable task: */
BUG();
}
// 文件:/kernel/sched/fair.c
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
again:
// 如果rq此时已经没有可以挑的了
// 尝试idle balance(pull tasks from other CPUs)
// 操作一下再次走again可能就有可以跑的task了
if (!cfs_rq->nr_running)
goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED
if (prev->sched_class != &fair_sched_class)
goto simple;
/*
* Because of the set_next_buddy() in dequeue_task_fair() it is rather
* likely that a next task is from the same cgroup as the current.
*
* Therefore attempt to avoid putting and setting the entire cgroup
* hierarchy, only change the part that actually changes.
*/
do {
struct sched_entity *curr = cfs_rq->curr;
/*
* Since we got here without doing put_prev_entity() we also
* have to consider cfs_rq->curr. If it is still a runnable
* entity, update_curr() will update its vruntime, otherwise
* forget we've ever seen it.
*/
if (curr) {
if (curr->on_rq)
update_curr(cfs_rq);
else
curr = NULL;
/*
* This call to check_cfs_rq_runtime() will do the
* throttle and dequeue its entity in the parent(s).
* Therefore the nr_running test will indeed
* be correct.
*/
if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
cfs_rq = &rq->cfs;
if (!cfs_rq->nr_running)
goto idle;
goto simple;
}
}
se = pick_next_entity(cfs_rq, curr);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
p = task_of(se);
/*
* Since we haven't yet done put_prev_entity and if the selected task
* is a different task than we started out with, try and touch the
* least amount of cfs_rqs.
*/
if (prev != p) {
struct sched_entity *pse = &prev->se;
while (!(cfs_rq = is_same_group(se, pse))) {
int se_depth = se->depth;
int pse_depth = pse->depth;
if (se_depth <= pse_depth) {
put_prev_entity(cfs_rq_of(pse), pse);
pse = parent_entity(pse);
}
if (se_depth >= pse_depth) {
set_next_entity(cfs_rq_of(se), se);
se = parent_entity(se);
}
}
put_prev_entity(cfs_rq, pse);
set_next_entity(cfs_rq, se);
}
goto done;
simple:
#endif
// 处理prev后事,比如更新vruntime,处理curr=NULL
// fair实现对应于put_prev_task_fair
put_prev_task(rq, prev);
do {
// 选择next的核心流程
se = pick_next_entity(cfs_rq, NULL);
// 选出的se需要进一步处理,重新设置curr
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
// 选出来的就是p
p = task_of(se);
done: __maybe_unused;
#ifdef CONFIG_SMP
/*
* Move the next running task to the front of
* the list, so our cfs_tasks list becomes MRU
* one.
*/
list_move(&p->se.group_node, &rq->cfs_tasks);
#endif
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
return p;
idle:
new_tasks = idle_balance(rq, rf);
/*
* Because idle_balance() releases (and re-acquires) rq->lock, it is
* possible for any higher priority task to appear. In that case we
* must re-start the pick_next_entity() loop.
*/
if (new_tasks < 0)
return RETRY_TASK;
if (new_tasks > 0)
goto again;
return NULL;
}
/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
* 2) pick the "next" process, since someone really wants that to run
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
// 一般来说挑选vruntime最小(红黑树最左)的se就行了
// 但是如注释所言:
// 1. skip实体不能选
// 2. last实体对cache更好
// 3. next实体更有实际价值
// 理想和现实差得远啊,不仅要避开1,还要衡量2和3值不值
// - 其中,衡量使用wakeup_preempt_entity比较,
// - 只要与left相差不大(一个gran内,表示not unfair)即可考虑
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
// 很显然,我要挑rbtree中vruntime最小只的那个
struct sched_entity *left = __pick_first_entity(cfs_rq);
struct sched_entity *se;
/*
* If curr is set we have to see if its left of the leftmost entity
* still in the tree, provided there was anything in the tree at all.
*/
if (!left || (curr && entity_before(curr, left)))
// 但仍需要和curr比较
left = curr;
// 一般情况就是你了
se = left; /* ideally we run the leftmost entity */
/*
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
// 但如果需要避开skip buddy
if (cfs_rq->skip == se) {
struct sched_entity *second;
// 且刚才挑了curr
if (se == curr) {
// 那就找回left
second = __pick_first_entity(cfs_rq);
// 且刚才没有算上curr
} else {
// 那就找left的后继,或者curr
second = __pick_next_entity(se);
if (!second || (curr && entity_before(curr, second)))
second = curr;
}
if (second && wakeup_preempt_entity(second, left) < 1)
se = second;
}
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
// 但如果需要last buddy
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;
/*
* Someone really wants this to run. If it's not unfair, run it.
*/
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;
// 从代码上来看
// 似乎在选取的优先级 next > last
// 只有命中了se才会清除对应的buddy
clear_buddies(cfs_rq, se);
return se;
}
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
if (!left)
return NULL;
return rb_entry(left, struct sched_entity, run_node);
}
static struct sched_entity *__pick_next_entity(struct sched_entity *se)
{
struct rb_node *next = rb_next(&se->run_node);
if (!next)
return NULL;
return rb_entry(next, struct sched_entity, run_node);
}
/*
* Account for a descheduled task:
*/
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
}
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
*/
if (prev->on_rq)
update_curr(cfs_rq);
/* throttle cfs_rqs exceeding runtime */
check_cfs_rq_runtime(cfs_rq);
check_spread(cfs_rq, prev);
if (prev->on_rq) {
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
update_load_avg(cfs_rq, prev, 0);
}
// curr为空的一个时机
cfs_rq->curr = NULL;
}
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/* 'current' is not kept within the tree. */
if (se->on_rq) {
/*
* Any task has to be enqueued before it get to execute on
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
}
update_stats_curr_start(cfs_rq, se);
// 重新设置curr(之前是NULL)
cfs_rq->curr = se;
/*
* Track our maximum slice length, if the CPU's load is at
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}
se->prev_sum_exec_runtime = se->sum_exec_runtime;
}
在本文最开始的地方已经说过,CFS 的公平性体现在 vruntime
每次都挑选最小的实体。这里也是如此,但是存在一个 CFS buddy 的机制需要注意。
在这里需要考虑如下 buddy:
- skip buddy:需要跳过的实体,如果选中的实体恰好是它,需要重新考虑。
- last buddy:从局部性来看最优的实体,也许比直接挑 vruntime 最小的 leftmost 要好。
- next buddy:从考虑实际应用的场合来看更倾向于该 buddy(刚被唤醒),优先级比 last 更高。
Buddies: Prefers running task other than the leftmost
- Prev – run the task that ran before
- Next – run the task that just got woken
- Skip – do not run the task
如何考虑呢?同样是通过前面讲述的 gran
来判断,只要与 leftmost
相差不大(在一个 gran
内),都优先考虑 buddy。
那么所谓的 buddy 在哪里用到了?似乎没有什么用途?其实有一个最直接的应用就是系统调用 sched_yield()
,除去细节,这个函数就是一个 set_skip_buddy()
,正因为挑选时跳过了该 buddy,才实现了 yield 语义(够简单吧)。
总结
本文只是按照进程的生命周期的角度来简单整理了 CFS 最基本的框架:从进程的创建、唤醒到调度。这里简单列出有用的点:
- 进程创建时,会基于原进程的 vruntime 附加惩罚时间来初始化现有进程的 vruntime。
- 进程唤醒时,会以睡眠时间和优先级无关的延迟常数对 vruntime 进行补偿(减小)。
- 进程调度时,除去挑选 leftmost vruntime 以外,还需考虑 gran 和 buddy 机制。
文中还提了一些关于抢占式调度的基础,后续可能另开一篇笔记再去整理。
UPDATE: 抢占调度和 nvcsw/nivcsw。