Linux 内核的 CFS 任务调度

前言

这篇文章会介绍 CFS 的基本原理并从源码上分析流程，内容顺序是按照我看源码的顺序展开的。

NOTES:

本文在名词上「task」和「进程」是混用的，都是指代 task_struct 结构体。
实际上由于本文作者非常菜，所以内容很浅。

基本设计和公平调度

CFS 的实现存放在 fair.c 文件中，毕竟全称是完全公平调度器（Completely Fair Scheduler）。

那怎样才算公平？你可以通过作者对 CFS 的介绍（sched-design-CFS.txt）来了解，总结如下：

作者希望模拟一个理想硬件，即使是单 cpu，task 也是可以并行处理的（假设有 2 个 task，那就同时刻均分 50% 的 cpu 计算能力）。
而对比真实硬件，一个 task 在执行，另一个 task 就需要等待，对于被等待的 task 来说，属于受到不公平的待遇，作者在这里用了一个等待时间的概念去统计这个不公平程度，等待时间越长，越是意味着迫切需要运行。
为了尽可能模拟理想硬件，那就缩短等待时间，每个关键路径让等待最长的运行，于是有了 timeline sort 的做法，并且在现在的实现中改用运行时间而非等待时间去统计，因此有了现在的 vruntime 和 rbtree。

个人认为，公平性是体现在 2 个方面：

一是保证（runqueue 上的）所有 task 能在一个调度周期（或者说是调度延迟）都能得到至少一次运行的机会（进一步推导出每个 task 不能运行超过该周期的长度）。
二是每次挑选最小运行时间的实体来执行，在这种贪心思想的长期运行下所有实体的运行时间会趋于相同（理想情况下，所有 task 的虚拟运行时间相等）。

总而言之，CFS 的设计是非常朴素的。

从进程创建说起：调度初始化

在 fork() 或者 clone() 流程中，对调度进行初始化的入口在子流程 sched_fork() 完成。

RTFSC/sched_fork.c at master · Caturra000/RTFSC · GitHub

// 文件：/kernel/sched/core.c
/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        unsigned long flags;
        int cpu = get_cpu();
        __sched_fork(clone_flags, p);
        /*
         * We mark the process as NEW here. This guarantees that
         * nobody will actually run it, and a signal or other external
         * event cannot wake it up and insert it on the runqueue either.
         */
        // 首次创建的进程设为TASK_NEW状态
        // 确保仍在初始化阶段，不会唤醒，也不会丢到ready queue中
        p->state = TASK_NEW;
        /*
         * Make sure we do not leak PI boosting priority to the child.
         */
        p->prio = current->normal_prio;
        /*
         * Revert to default priority/policy on fork if requested.
         */
        if (unlikely(p->sched_reset_on_fork)) {
                if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
                        p->policy = SCHED_NORMAL;
                        p->static_prio = NICE_TO_PRIO(0);
                        p->rt_priority = 0;
                } else if (PRIO_TO_NICE(p->static_prio) < 0)
                        p->static_prio = NICE_TO_PRIO(0);
                p->prio = p->normal_prio = __normal_prio(p);
                set_load_weight(p, false);
                /*
                 * We don't need the reset flag anymore after the fork. It has
                 * fulfilled its duty:
                 */
                p->sched_reset_on_fork = 0;
        }
        // 根据prio的值，决定是dl、rt还是fair调度
        // 从这里可看到调度器的优先级
        if (dl_prio(p->prio)) {
                put_cpu();
                return -EAGAIN;
        } else if (rt_prio(p->prio)) {
                p->sched_class = &rt_sched_class;
        } else {
                p->sched_class = &fair_sched_class;
        }
        // 初始化se的负载信息
        init_entity_runnable_average(&p->se);
        /*
         * The child is not yet in the pid-hash so no cgroup attach races,
         * and the cgroup is pinned to this child due to cgroup_fork()
         * is ran before sched_fork().
         *
         * Silence PROVE_RCU.
         */
        raw_spin_lock_irqsave(&p->pi_lock, flags);
        /*
         * We're setting the CPU for the first time, we don't migrate,
         * so use __set_task_cpu().
         */
        // 设定CPU，通过get_cpu()来得到cpu变量，由TI管理，用完记得put_cpu()
        __set_task_cpu(p, cpu);
        if (p->sched_class->task_fork)
                // 进入到sched class相关的回调
                // 目前看的fair调度，对应于task_fork_fair
                p->sched_class->task_fork(p);
        raw_spin_unlock_irqrestore(&p->pi_lock, flags);
#ifdef CONFIG_SCHED_INFO
        if (likely(sched_info_on()))
                memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
        p->on_cpu = 0;
#endif
        init_task_preempt_count(p);
#ifdef CONFIG_SMP
        plist_node_init(&p->pushable_tasks, MAX_PRIO);
        RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
        put_cpu();
        return 0;
}
/*
 * Perform scheduler related setup for a newly forked process p.
 * p is forked by current.
 *
 * __sched_fork() is basic setup used by init_idle() too:
 */
// 对p执行基本的构造
// 大部分调度相关的信息存放在se字段（schedule entity），这样可以抽象出组调度特性
static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        p->on_rq                        = 0;
        p->se.on_rq                     = 0;
        p->se.exec_start                = 0;
        p->se.sum_exec_runtime          = 0;
        p->se.prev_sum_exec_runtime     = 0;
        p->se.nr_migrations             = 0;
        p->se.vruntime                  = 0;
        INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
        p->se.cfs_rq                    = NULL;
#endif
#ifdef CONFIG_SCHEDSTATS
        /* Even if schedstat is disabled, there should not be garbage */
        memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
        RB_CLEAR_NODE(&p->dl.rb_node);
        init_dl_task_timer(&p->dl);
        init_dl_inactive_task_timer(&p->dl);
        __dl_clear_params(p);
        INIT_LIST_HEAD(&p->rt.run_list);
        p->rt.timeout           = 0;
        p->rt.time_slice        = sched_rr_timeslice;
        p->rt.on_rq             = 0;
        p->rt.on_list           = 0;
#ifdef CONFIG_PREEMPT_NOTIFIERS
        INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
        init_numa_balancing(clone_flags, p);
}
// 文件：/kernel/sched/fair.c
/*
 * called on fork with the child task as argument from the parent's context
 *  - child not yet on the tasklist
 *  - preemption disabled
 */
static void task_fork_fair(struct task_struct *p)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se, *curr;
        struct rq *rq = this_rq();
        struct rq_flags rf;
        rq_lock(rq, &rf);
        update_rq_clock(rq);
        cfs_rq = task_cfs_rq(current);
        curr = cfs_rq->curr;
        if (curr) {
                // 更新curr的调度信息
                // 比如让vruntime更新到最新时间
                // 以及更新rq上的min vruntime记录
                update_curr(cfs_rq);
                // 初始化se中的vruntime
                se->vruntime = curr->vruntime;
        }
        // 进一步处理se中的vruntime
        place_entity(cfs_rq, se, 1);
        if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
                /*
                 * Upon rescheduling, sched_class::put_prev_task() will place
                 * 'current' within the tree based on its new key value.
                 */
                swap(curr->vruntime, se->vruntime);
                resched_curr(rq);
        }
        se->vruntime -= cfs_rq->min_vruntime;
        rq_unlock(rq, &rf);
}
// 文件：/kernel/sched/core.c
/*
 * resched_curr - mark rq's current task 'to be rescheduled now'.
 *
 * On UP this means the setting of the need_resched flag, on SMP it
 * might also involve a cross-CPU call to trigger the scheduler on
 * the target CPU.
 */
void resched_curr(struct rq *rq)
{
        struct task_struct *curr = rq->curr;
        int cpu;
        lockdep_assert_held(&rq->lock);
        if (test_tsk_need_resched(curr))
                return;
        cpu = cpu_of(rq);
        if (cpu == smp_processor_id()) {
                // 打上TIF标记，表示需要调度
                set_tsk_need_resched(curr);
                set_preempt_need_resched();
                return;
        }
        if (set_nr_and_not_polling(curr))
                smp_send_reschedule(cpu);
        else
                trace_sched_wake_idle_without_ipi(cpu);
}

流程总结如下：

首先是 task 状态，对于新创建的进程设为 TASK_NEW，表示不会被调度器拉入队或者唤醒。
根据 task 的 prio 优先级决定使用哪种调度器，CFS 对应于实现 fair_sched_class。
fork 出来的进程会调用 task_fork_fair()，目的是调整创建进程的 vruntime。初始化使用与当前的进程相同的 vruntime，但是内部有进一步核心过程调整 vruntime 的流程 place_entity()。
最后是一个特性：child_runs_first，调度器会比较并交换 parent 和 child 的 vruntime，保证 child 尽可能先执行。

进程创建时的 vruntime 调整：vslice 计算

RTFSC/place_entity.c at master · Caturra000/RTFSC · GitHub

// 文件：/kernel/sched/fair.c
// 进行vruntime调整
// fork流程中，initial传参为1
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;
    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);
    /* sleeps up to a single latency don't count. */
    if (!initial) {
        unsigned long thresh = sysctl_sched_latency;
        /*
         * Halve their sleep time's effect, to allow
         * for a gentler effect of sleepers:
         */
        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            thresh >>= 1;
        vruntime -= thresh;
    }
    /* ensure we never gain time by being placed backwards. */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}
/*
 * We calculate the vruntime slice of a to-be-inserted task.
 *
 * vs = s/w
 */
// vslice大概是算出一个将要插入task的惩罚时间
static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    return calc_delta_fair(sched_slice(cfs_rq, se), se);
}
/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    // 传入当前在队列的task个数，se可能在队列也可能不在
    // 算出来的应该是个墙上时间，但仍需要后面calc_delta调教转为se视角的虚拟时间
    u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
    for_each_sched_entity(se) {
        struct load_weight *load;
        struct load_weight lw;
        cfs_rq = cfs_rq_of(se);
        load = &cfs_rq->load;
        // 一些原因导致se不在rq上，因此需要算上se重新调整rq上的load
        // （这里我还不太明白，不在rq上却仍统计进去，是否合理？）
        if (unlikely(!se->on_rq)) {
            lw = cfs_rq->load;
            update_load_add(&lw, se->load.weight);
            load = &lw;
        }
        slice = __calc_delta(slice, se->load.weight, load);
    }
    return slice;
}
/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
// 调度周期的计算流程，保证每个task至少跑一遍
// 大概是max(nr_running * 0.75ms, 6ms)，nr大于8就取左边
// 6ms并不是一个传统意义上的时间片概念，只是一个CPU密集型task的调度抢占延迟
static u64 __sched_period(unsigned long nr_running)
{
    // sched_nr_latency初始化为8，
    // 注释有说是相除得到的数，6 / 0.75 = 8
    /*
    * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
    */
    if (unlikely(nr_running > sched_nr_latency))
        // sysctl_sched_min_granularity定义如下，不考虑CPU对数大概就是0.75ms的意思
        /*
         * Minimal preemption granularity for CPU-bound tasks:
         *
         * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
         */
        return nr_running * sysctl_sched_min_granularity;
    else
        // 搬运一下sysctl_sched_latency的解释，默认不考虑CPU对数为6ms
        /*
        * Targeted preemption latency for CPU-bound tasks:
        *
        * NOTE: this latency value is not the same as the concept of
        * 'timeslice length' - timeslices in CFS are of variable length
        * and have no persistent notion like in traditional, time-slice
        * based scheduling concepts.
        *
        * (to see the precise effective timeslice length of your workload,
        *  run vmstat and monitor the context-switches (cs) field)
        *
        * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
        */
        return sysctl_sched_latency;
}
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
    lw->weight += inc;
    lw->inv_weight = 0;
}
/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

    // 因为vslice = slice，所以不用算
    return delta;
}
/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
    u64 fact = scale_load_down(weight);
    int shift = WMULT_SHIFT;

    __update_inv_weight(lw);

    if (unlikely(fact >> 32)) {
        while (fact >> 32) {
            fact >>= 1;
            shift--;
        }
    }

    /* hint to use a 32x32->64 mul */
    fact = (u64)(u32)fact * lw->inv_weight;

    while (fact >> 32) {
        fact >>= 1;
        shift--;
    }

    return mul_u64_u32_shr(delta_exec, fact, shift);
}

place_entity() 其实不仅用于新生 task，还可用于被唤醒时调整 vruntime 的场合，只需要传参上的 initial 加以区分即可。这里顺着刚才进程创建的流程来借这机会分析该函数。

在 task_fork_fair() 调用 place_entity() 前，task 的 vruntime 其实是和 parent 一致的（前面总结提过）。随后的流程如下：

根据防饥饿特性：START_DEBIT，vruntime 需要一定的惩罚 vslice 作为插入队列的代价。
确保虚拟时间是单调递增的，算出来的 vruntime 绝不小于当前队列的 min vruntime。

也就是说，如果不使用 START_DEBIT 特性，新生进程与父进程的 vruntime 一致。

接下来讨论 START_DEBIT 特性，也就是 vslice 的计算。其大致过程是先算出墙上时间的 slice，再经过 calc_delta 转换为虚拟时间上的 vslice。最后 vruntime 会加上这个数值。

同时需要引入一个调度周期 \(period\)（前面提过）的计算：

如果 runqueue 上的 task 不超过 8，那就是固定的 \(6ms \times (1 + log_2(ncpu))\)。
否则为 \(nr\_running \times (0.75ms \times (1 + log_2(ncpu)))\)。

得出的 \(period\) 即为一个队列上整体的 slice，还需要进一步根据权重 weight 来得到当前 task 实体能分享到的 vslice，该过程为 calc_delta。

一个公式可以这样描述 kernel 上的计算行为：

\[calc\_delta = (delta\_exec \times weight \times lw\_inv\_weight) >> 32\]

其实在人类视角应该理解成这样：

\[calc\_delta = delta\_exec \times \frac{weight}{lw\_weight}\]

这里的 \(lw\) 指的是整个 queue 上的权重之和 \(load\_weight = \sum{weight}\)，而 \(inv\_weight\) 是指 \(\frac{2^{32}}{weight}\)，作为一个整数计算上的优化，我们不关心这种操作，只是提一下。

总之一个 vruntime 会因此增加：

\[\begin{align*} vslice &= period \times \frac{weight}{\sum weight} \times \frac{NICE\_0\_LOAD}{weight} \\ &= period \times \frac{NICE\_0\_LOAD}{\sum weight} \end{align*}\]

注意 \(\sum weight\) 已经算入了当前进程的权重。

所以一个进程创建时的初始化到这里就可以了……

但是我们还没把 task 放到调度器运行队列的管辖范围内。

唤醒一个全新的 task

初始化完成后，就应该唤醒（wake up）了。

RTFSC/wake_up_new_task.c at master · Caturra000/RTFSC · GitHub

// 文件：/kernel/sched/core.c
/*
 * wake_up_new_task - wake up a newly created task for the first time.
 *
 * This function will do some initial scheduler statistics housekeeping
 * that must be done for every newly created context, then puts the task
 * on the runqueue and wakes it.
 */
void wake_up_new_task(struct task_struct *p)
{
        struct rq_flags rf;
        struct rq *rq;
        raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
        // 在前面sched_fork流程完成必要初始化后，从TASK_NEW转换到TASK_RUNNING
        p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
        /*
         * Fork balancing, do it here and not earlier because:
         *  - cpus_allowed can change in the fork path
         *  - any previously selected CPU might disappear through hotplug
         *
         * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
         * as we're not fully set-up yet.
         */
        p->recent_used_cpu = task_cpu(p);
        // TODO select_task_rq
        __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);
        // 求出初始化se的util_avg
        post_init_entity_util_avg(&p->se);
        // sched入队回调的封装
        activate_task(rq, p, ENQUEUE_NOCLOCK);
        p->on_rq = TASK_ON_RQ_QUEUED;
        trace_sched_wakeup_new(p);
        // 检查是否满足抢占的一个关键点（其实关键点数目不多）
        // 满足则打上调度标记
        check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
        // CFS没有task_woken实现
        if (p->sched_class->task_woken) {
                /*
                 * Nothing relies on rq->lock after this, so its fine to
                 * drop it.
                 */
                rq_unpin_lock(rq, &rf);
                p->sched_class->task_woken(rq, p);
                rq_repin_lock(rq, &rf);
        }
#endif
        task_rq_unlock(rq, p, &rf);
}
/*
 * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
 */
static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
{
        lockdep_assert_held(&p->pi_lock);
        if (p->nr_cpus_allowed > 1)
                cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
        else
                cpu = cpumask_any(&p->cpus_allowed);
        /*
         * In order not to call set_task_cpu() on a blocking task we need
         * to rely on ttwu() to place the task on a valid ->cpus_allowed
         * CPU.
         *
         * Since this is common to all placement strategies, this lives here.
         *
         * [ this allows ->select_task() to simply return task_cpu(p) and
         *   not worry about this generic constraint ]
         */
        if (unlikely(!is_cpu_allowed(p, cpu)))
                cpu = select_fallback_rq(task_cpu(p), p);
        return cpu;
}
// 文件：/kernel/sched/core.c
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
        if (task_contributes_to_load(p))
                rq->nr_uninterruptible--;
        enqueue_task(rq, p, flags);
}

总结流程如下：

在唤醒的时候，task 的状态从之前的 TASK_NEW 转换为 TASK_RUNNING。
执行入队操作。
检查是否可抢占。

这里有 2 个明确的目标需要去了解，就是入队干什么，抢占怎么检查。

入队操作

RTFSC/enqueue_task.c at master · Caturra000/RTFSC · GitHub

// 文件：/kernel/sched/core.c
// 如果不考虑附加feature，只论CFS最基础的流程的话，就是：
// - 调整vruntime
// - 插入rbtree
// - 标记on_rq
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (!(flags & ENQUEUE_NOCLOCK))
        update_rq_clock(rq);
    if (!(flags & ENQUEUE_RESTORE))
        sched_info_queued(rq, p);
    p->sched_class->enqueue_task(rq, p, flags);
}
// 文件：/kernel/sched/fair.c
/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;
    /*
     * The code below (indirectly) updates schedutil which looks at
     * the cfs_rq utilization to select a frequency.
     * Let's add the task's estimated utilization to the cfs_rq's
     * estimated utilization, before we update schedutil.
     */
    util_est_enqueue(&rq->cfs, p);
    /*
     * If in_iowait is set, the code below may not trigger any cpufreq
     * utilization updates, so do it here explicitly with the IOWAIT flag
     * passed.
     */
    if (p->in_iowait)
        cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
    for_each_sched_entity(se) {
        // 这是加入前的代码，如果se已经on_rq，那就不需要处理了
        if (se->on_rq)
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, flags);
        /*
         * end evaluation on encountering a throttled cfs_rq
         *
         * note: in the case of encountering a throttled cfs_rq we will
         * post the final h_nr_running increment below.
         */
        if (cfs_rq_throttled(cfs_rq))
            break;
        cfs_rq->h_nr_running++;
        flags = ENQUEUE_WAKEUP;
    }
    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        cfs_rq->h_nr_running++;
        if (cfs_rq_throttled(cfs_rq))
            break;
        update_load_avg(cfs_rq, se, UPDATE_TG);
        update_cfs_group(se);
    }
    if (!se)
        add_nr_running(rq, 1);
    hrtick_update(rq);
}
/*
 * MIGRATION
 *
 *  dequeue
 *    update_curr()
 *      update_min_vruntime()
 *    vruntime -= min_vruntime
 *
 *  enqueue
 *    update_curr()
 *      update_min_vruntime()
 *    vruntime += min_vruntime
 *
 * this way the vruntime transition between RQs is done when both
 * min_vruntime are up-to-date.
 *
 * WAKEUP (remote)
 *
 *  ->migrate_task_rq_fair() (p->state == TASK_WAKING)
 *    vruntime -= min_vruntime
 *
 *  enqueue
 *    update_curr()
 *      update_min_vruntime()
 *    vruntime += min_vruntime
 *
 * this way we don't have the most up-to-date min_vruntime on the originating
 * CPU and an up-to-date min_vruntime on the destination CPU.
 */
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
    bool curr = cfs_rq->curr == se;
    /*
     * If we're the current task, we must renormalise before calling
     * update_curr().
     */
    // 可能有迁移操作，因此这里加回cpu对应cfs_rq的min_vruntime
    //
    // 比如在task_fork_fair时vruntime就在当前的cfs_rq减去min_vruntime
    // 但是加入的时候可能不同CPU，那就要加对应cfs_rq的min_vruntime
    //
    // dequeue操作也是一个道理
    if (renorm && curr)
        se->vruntime += cfs_rq->min_vruntime;
    update_curr(cfs_rq);
    /*
     * Otherwise, renormalise after, such that we're placed at the current
     * moment in time, instead of some random moment in the past. Being
     * placed in the past could significantly boost this task to the
     * fairness detriment of existing tasks.
     */
    if (renorm && !curr)
        se->vruntime += cfs_rq->min_vruntime;
    /*
     * When enqueuing a sched_entity, we must:
     *   - Update loads to have both entity and cfs_rq synced with now.
     *   - Add its load to cfs_rq->runnable_avg
     *   - For group_entity, update its weight to reflect the new share of
     *     its group cfs_rq
     *   - Add its new weight to cfs_rq->load.weight
     */
    update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
    update_cfs_group(se);
    enqueue_runnable_load_avg(cfs_rq, se);
    account_entity_enqueue(cfs_rq, se);
    // 针对唤醒的进程，进行一些补偿
    if (flags & ENQUEUE_WAKEUP)
        place_entity(cfs_rq, se, 0);
    check_schedstat_required();
    update_stats_enqueue(cfs_rq, se, flags);
    check_spread(cfs_rq, se);
    if (!curr)
        // rbtree相关流程
        __enqueue_entity(cfs_rq, se);
    // 总算标记on_rq
    se->on_rq = 1;
    if (cfs_rq->nr_running == 1) {
        list_add_leaf_cfs_rq(cfs_rq);
        check_enqueue_throttle(cfs_rq);
    }
}
/*
 * Enqueue an entity into the rb-tree:
 */
// 大概就是插入到rbtree当中吧
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    bool leftmost = true;
    /*
     * Find the right place in the rbtree:
     */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (entity_before(se, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = false;
        }
    }
    rb_link_node(&se->run_node, parent, link);
    rb_insert_color_cached(&se->run_node,
                   &cfs_rq->tasks_timeline, leftmost);
}
static inline int entity_before(struct sched_entity *a,
                struct sched_entity *b)
{
    return (s64)(a->vruntime - b->vruntime) < 0;
}

其实整个过程是相对清晰的，无非是：

调整 vruntime。
基于 vruntime 插入 rbtree。
标记 on_rq。

在这里的调整 vruntime 有 2 个层面：

一是入队不一定是来自新创建的 task，可能是已有的已出队睡眠的 task 进行唤醒，后者需要进行一个调整：回顾前面 place_entity()，它们是需要一个补偿时间，而是固定的调度延迟时间（\(6ms * (1 + log_2(ncpu))\)）。如果开启 GENTLE_FAIR_SLEEPERS 特性，会让补偿数值减半。
二是 vruntime 在前面 task_fork_fair 时最后一行代码「莫名其妙」地减去 min_vruntime，这是因为创建和唤醒可能不是同一个 CPU，而 runqueue 是 per-cpu 级别的，在这里重新加回 min_vruntime 是基于当前 CPU 的调度器运行队列的 vruntime 重新修正（而前面的减去的直接因素是新建 task 的初始 vruntime 来自于执行 fork 的实体 curr 的 vruntime，两者基准不一致）。

注意这里说的「补偿」不是增加 vruntime，而是减少，这样会在调度器挑选任务时更有优势。

总之可以从中得知：

无论睡眠时间长短，都不会影响权重的变化；
无论进程本身的优先级高低，权重的变化都一样。

检查抢占

kernel 当中的任务分配是细粒度的：对于抢占行为来说，标记抢占和执行抢占是两回事。就像你干多少活和领多少薪资也是两回事一样。

现在来看下检查（并标记）抢占 flag 的流程：


// 文件：/kernel/sched/core.c
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
    const struct sched_class *class;
    if (p->sched_class == rq->curr->sched_class) {
        // fair实现对应于check_preempt_wakeup
        rq->curr->sched_class->check_preempt_curr(rq, p, flags);
    } else {
        for_each_class(class) {
            if (class == rq->curr->sched_class)
                break;
            if (class == p->sched_class) {
                resched_curr(rq);
                break;
            }
        }
    }
    /*
     * A queue event has occurred, and we're going to schedule.  In
     * this case, we can save a useless back to back clock update.
     */
    if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
        rq_clock_skip_update(rq);
}
// 文件：/kernel/sched/fair.c
/*
 * Preempt the current task with a newly woken task if needed:
 */
// 检查是否能抢占，会打上TIF标记
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se;
    struct cfs_rq *cfs_rq = task_cfs_rq(curr);
    int scale = cfs_rq->nr_running >= sched_nr_latency;
    int next_buddy_marked = 0;
    if (unlikely(se == pse))
        return;
    /*
     * This is possible from callers such as attach_tasks(), in which we
     * unconditionally check_prempt_curr() after an enqueue (which may have
     * lead to a throttle).  This both saves work and prevents false
     * next-buddy nomination below.
     */
    if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
        return;
    // TODO NEXT_BUDDY特性
    if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
        set_next_buddy(pse);
        next_buddy_marked = 1;
    }
    /*
     * We can come here with TIF_NEED_RESCHED already set from new task
     * wake up path.
     *
     * Note: this also catches the edge-case of curr being in a throttled
     * group (e.g. via set_curr_task), since update_curr() (in the
     * enqueue of curr) will have resulted in resched being set.  This
     * prevents us from potentially nominating it as a false LAST_BUDDY
     * below.
     */
    // 已经设置TIF就直接跳过整个path了
    if (test_tsk_need_resched(curr))
        return;
    /* Idle tasks are by definition preempted by non-idle tasks. */
    if (unlikely(curr->policy == SCHED_IDLE) &&
        likely(p->policy != SCHED_IDLE))
        goto preempt;
    /*
     * Batch and idle tasks do not preempt non-idle tasks (their preemption
     * is driven by the tick):
     */
    if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
        return;
    find_matching_se(&se, &pse);
    update_curr(cfs_rq_of(se));
    BUG_ON(!pse);
    // 需要抢占的时候
    if (wakeup_preempt_entity(se, pse) == 1) {
        /*
         * Bias pick_next to pick the sched entity that is
         * triggering this preemption.
         */
        if (!next_buddy_marked)
            set_next_buddy(pse);
        goto preempt;
    }
    return;
preempt:
    // 设置TIF，大概是一个set_bit的过程
    resched_curr(rq);
    /*
     * Only set the backward buddy when the current task is still
     * on the rq. This can happen when a wakeup gets interleaved
     * with schedule on the ->pre_schedule() or idle_balance()
     * point, either of which can * drop the rq lock.
     *
     * Also, during early boot the idle thread is in the fair class,
     * for obvious reasons its a bad idea to schedule back to it.
     */
    if (unlikely(!se->on_rq || curr == rq->idle))
        return;
    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
        set_last_buddy(se);
}
/*
 * Should 'se' preempt 'curr'.
 *
 *             |s1
 *        |s2
 *   |s3
 *         g
 *      |<--->|c
 *
 *  w(c, s1) = -1
 *  w(c, s2) =  0
 *  w(c, s3) =  1
 *
 */
// 通过gran来避免over-scheduling
// 因此通过vruntime判断抢占会这样：
// - 如果c比se小，返回-1
// - 如果c比se大但是差距不超过gran，返回0
// - 否则返回1
// 1才允许抢占
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
    s64 gran, vdiff = curr->vruntime - se->vruntime;
    if (vdiff <= 0)
        return -1;
    gran = wakeup_gran(se);
    if (vdiff > gran)
        return 1;
    return 0;
}
static unsigned long wakeup_gran(struct sched_entity *se)
{
    // sysctl_sched_wakeup_granularity注释如下，不考虑CPU对数则为1ms
    /*
    * SCHED_OTHER wake-up granularity.
    *
    * This option delays the preemption effects of decoupled workloads
    * and reduces their over-scheduling. Synchronous workloads will still
    * have immediate wakeup/sleep latencies.
    *
    * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
    */
    unsigned long gran = sysctl_sched_wakeup_granularity;
    /*
     * Since its curr running now, convert the gran from real-time
     * to virtual-time in his units.
     *
     * By using 'se' instead of 'curr' we penalize light tasks, so
     * they get preempted easier. That is, if 'se' < 'curr' then
     * the resulting gran will be larger, therefore penalizing the
     * lighter, if otoh 'se' > 'curr' then the resulting gran will
     * be smaller, again penalizing the lighter task.
     *
     * This is especially important for buddies when the leftmost
     * task is higher priority than the buddy.
     */
    // 后面还需要calc_delta转换为虚拟时间
    return calc_delta_fair(gran, se);
}

其实过程也简单：

如果已经在 thread_info 打上 TIF_NEED_RESCHED 标记，那就直接跳过整个过程，因为已经打上了标记，无需检测。
否则需要查看是否允许打标记，如果当前运行实体的 vruntime 已经明显大于要比较实体 se 的 vruntime 才能运行，避免过度抢占影响吞吐。

在这里，是否「明显大于」的标准是通过一个 gran 来判断，具体是 \(1ms * (1 + log_2(ncpu))\)，如果 diff vruntime 小于等于这个值，那也是不可以抢占的。

现在整个 fork-time setup（这里用了内核的注释名字，也就是进程的 fork 初始化），以及基本的进程唤醒都已经整理完了。

周期性调度

周期性调度的触发时机涉及到时钟中断，CONFIG_HZ 因不同配置环境而异。

比如 Google 就曾推荐移动终端设为 333 以获得一个足够好的用户体验（我也不知道怎么测出来的）。

至于周期性调度的入口，它位于 scheduler_tick()。

RTFSC/scheduler_tick.c at master · Caturra000/RTFSC (github.com)

// 文件：/kernel/sched/core.c
/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
        int cpu = smp_processor_id();
        struct rq *rq = cpu_rq(cpu);
        struct task_struct *curr = rq->curr;
        struct rq_flags rf;
        sched_clock_tick();
        rq_lock(rq, &rf);
        update_rq_clock(rq);
        curr->sched_class->task_tick(rq, curr, 0);
        cpu_load_update_active(rq);
        calc_global_load_tick(rq);
        rq_unlock(rq, &rf);
        perf_event_task_tick();
#ifdef CONFIG_SMP
        rq->idle_balance = idle_cpu(cpu);
        trigger_load_balance(rq);
#endif
}
// 文件：/kernel/sched/fair.c
/*
 * scheduler tick hitting a task of our scheduling class.
 *
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &curr->se;
        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                entity_tick(cfs_rq, se, queued);
        }
        if (static_branch_unlikely(&sched_numa_balancing))
                task_tick_numa(rq, curr);
}
static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);
        /*
         * Ensure that runnable average is periodically updated.
         */
        update_load_avg(cfs_rq, curr, UPDATE_TG);
        update_cfs_group(curr);
#ifdef CONFIG_SCHED_HRTICK
        /*
         * queued ticks are scheduled to match the slice, so don't bother
         * validating it and just reschedule.
         */
        if (queued) {
                resched_curr(rq_of(cfs_rq));
                return;
        }
        /*
         * don't let the period tick interfere with the hrtick preemption
         */
        if (!sched_feat(DOUBLE_TICK) &&
                        hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
                return;
#endif
        if (cfs_rq->nr_running > 1)
                check_preempt_tick(cfs_rq, curr);
}
/*
 * Preempt the current task with a newly woken task if needed:
 */
// 这里涉及到TIF抢占标记
// 打上标记的可能有2个
// 一是本身持续时间超出ideal time
// 二是跟first entity差距过于明显
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;
        ideal_runtime = sched_slice(cfs_rq, curr);
        // 求出被调度器选中后持续了多久时间
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        // 如果超出理想范围
        if (delta_exec > ideal_runtime) {
                // 则打TIF标记（SMP且非本地CPU需要考虑IPI中断），并返回
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }
        /*
         * Ensure that a task that missed wakeup preemption by a
         * narrow margin doesn't have to wait for a full slice.
         * This also mitigates buddy induced latencies under load.
         */
        // 如果没超出且小于最小粒度，也直接返回
        if (delta_exec < sysctl_sched_min_granularity)
                return;
        // 如果已达到最小粒度，会尝试跟最左那位对比
        // 如果差距大于一个理想数值，仍会打上标记
        se = __pick_first_entity(cfs_rq);
        delta = curr->vruntime - se->vruntime;
        if (delta < 0)
                return;
        if (delta > ideal_runtime)
                resched_curr(rq_of(cfs_rq));
}

其实只考虑框架本身做的事情只有一个：检测并标记抢占。

这里 check_preempt_tick 的做法和前面的 check_preempt_curr 并不相同。

判断方式如下：

求出被调度器选中持续的运行时间。
如果超出一个理想范围（等同于调度延迟的大小），那就直接打标记。
否则进一步判断是否超出最小粒度（\(0.75ms * (1 + log_2(ncpu))\)）且比 leftmost 实体的虚拟时间差仍大于一个理想范围，如果是，同样打上标记。

主动调度 schedule

schedule() 函数的应用极其广泛，kernel 上被调用的关键点可能有一百多个吧（caller 太多了。。）

RTFSC/schedule.c at master · Caturra000/RTFSC (github.com)

// 文件：/kernel/sched/core.c
// 主动调度的简单流程如下：
// - 关闭抢占
// - 挑选next task（见pick_next_task流程）
// - 执行切换（见switch流程）
// - 开启抢占
asmlinkage __visible void __sched schedule(void)
{
        struct task_struct *tsk = current;
        sched_submit_work(tsk);
        do {
                preempt_disable();
                __schedule(false);
                sched_preempt_enable_no_resched();
        } while (need_resched());
}
/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(bool preempt)
{
        struct task_struct *prev, *next;
        unsigned long *switch_count;
        struct rq_flags rf;
        struct rq *rq;
        int cpu;
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
        prev = rq->curr;
        schedule_debug(prev);
        if (sched_feat(HRTICK))
                hrtick_clear(rq);
        local_irq_disable();
        rcu_note_context_switch(preempt);
        /*
         * Make sure that signal_pending_state()->signal_pending() below
         * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
         * done by the caller to avoid the race with signal_wake_up().
         *
         * The membarrier system call requires a full memory barrier
         * after coming from user-space, before storing to rq->curr.
         */
        rq_lock(rq, &rf);
        smp_mb__after_spinlock();
        /* Promote REQ to ACT */
        rq->clock_update_flags <<= 1;
        update_rq_clock(rq);
        switch_count = &prev->nivcsw;
        // 如果是主动调度且处于非TASK_RUNNING，则本次切换贡献在nvcsw
        // 否则贡献在nivcsw
        if (!preempt && prev->state) {
                if (unlikely(signal_pending_state(prev->state, prev))) {
                        prev->state = TASK_RUNNING;
                } else {
                        // prev被拔出runqueue，依赖于dequeue_task/dequeue_task_fair
                        deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
                        prev->on_rq = 0;
                        if (prev->in_iowait) {
                                atomic_inc(&rq->nr_iowait);
                                delayacct_blkio_start();
                        }
                        /*
                         * If a worker went to sleep, notify and ask workqueue
                         * whether it wants to wake up a task to maintain
                         * concurrency.
                         */
                        if (prev->flags & PF_WQ_WORKER) {
                                struct task_struct *to_wakeup;
                                to_wakeup = wq_worker_sleeping(prev);
                                if (to_wakeup)
                                        try_to_wake_up_local(to_wakeup, &rf);
                        }
                }
                switch_count = &prev->nvcsw;
        }
        next = pick_next_task(rq, prev, &rf);
        clear_tsk_need_resched(prev);
        clear_preempt_need_resched();
        if (likely(prev != next)) {
                rq->nr_switches++;
                rq->curr = next;
                /*
                 * The membarrier system call requires each architecture
                 * to have a full memory barrier after updating
                 * rq->curr, before returning to user-space.
                 *
                 * Here are the schemes providing that barrier on the
                 * various architectures:
                 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
                 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
                 * - finish_lock_switch() for weakly-ordered
                 *   architectures where spin_unlock is a full barrier,
                 * - switch_to() for arm64 (weakly-ordered, spin_unlock
                 *   is a RELEASE barrier),
                 */
                ++*switch_count;
                trace_sched_switch(preempt, prev, next);
                /* Also unlocks the rq: */
                rq = context_switch(rq, prev, next, &rf);
        } else {
                rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
                rq_unlock_irq(rq, &rf);
        }
        balance_callback(rq);
}

schedule() 其实思路很直接：

挑出下一个要执行的 task。
执行上下文切换。

这里也可看出，调度跟踪点在统计上区分了 nvcsw（自愿上下文切换的次数Number of Voluntary Context Switches）和 nivcsw（非自愿上下文切换的次数Number of InVoluntary Context Switches）。但是我们这里说的前提是一个主动调度的过程，所以并非抢占（preempt = false），跟踪自愿上下文切换。

（上面的周期性调度会间接引起 preempt schedule，这块有时间再补充？）

schedule() 的第一步才是与 CFS 相关的，所以接来看观察如何挑选。

主动调度的核心：挑选任务

RTFSC/pick_next_task.c at master · Caturra000/RTFSC (github.com)

// 文件：/kernel/sched/core.c
/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        const struct sched_class *class;
        struct task_struct *p;
        /*
         * Optimization: we know that if all tasks are in the fair class we can
         * call that function directly, but only if the @prev task wasn't of a
         * higher scheduling class, because otherwise those loose the
         * opportunity to pull in more work from other CPUs.
         */
        if (likely((prev->sched_class == &idle_sched_class ||
                    prev->sched_class == &fair_sched_class) &&
                   rq->nr_running == rq->cfs.h_nr_running)) {
                p = fair_sched_class.pick_next_task(rq, prev, rf);
                if (unlikely(p == RETRY_TASK))
                        goto again;
                /* Assumes fair_sched_class->next == idle_sched_class */
                if (unlikely(!p))
                        p = idle_sched_class.pick_next_task(rq, prev, rf);
                return p;
        }
again:
        for_each_class(class) {
                p = class->pick_next_task(rq, prev, rf);
                if (p) {
                        if (unlikely(p == RETRY_TASK))
                                goto again;
                        return p;
                }
        }
        /* The idle class should always have a runnable task: */
        BUG();
}
// 文件：/kernel/sched/fair.c
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        struct cfs_rq *cfs_rq = &rq->cfs;
        struct sched_entity *se;
        struct task_struct *p;
        int new_tasks;
again:
        // 如果rq此时已经没有可以挑的了
        // 尝试idle balance（pull tasks from other CPUs）
        // 操作一下再次走again可能就有可以跑的task了
        if (!cfs_rq->nr_running)
                goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED
        if (prev->sched_class != &fair_sched_class)
                goto simple;
        /*
         * Because of the set_next_buddy() in dequeue_task_fair() it is rather
         * likely that a next task is from the same cgroup as the current.
         *
         * Therefore attempt to avoid putting and setting the entire cgroup
         * hierarchy, only change the part that actually changes.
         */
        do {
                struct sched_entity *curr = cfs_rq->curr;
                /*
                 * Since we got here without doing put_prev_entity() we also
                 * have to consider cfs_rq->curr. If it is still a runnable
                 * entity, update_curr() will update its vruntime, otherwise
                 * forget we've ever seen it.
                 */
                if (curr) {
                        if (curr->on_rq)
                                update_curr(cfs_rq);
                        else
                                curr = NULL;
                        /*
                         * This call to check_cfs_rq_runtime() will do the
                         * throttle and dequeue its entity in the parent(s).
                         * Therefore the nr_running test will indeed
                         * be correct.
                         */
                        if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
                                cfs_rq = &rq->cfs;
                                if (!cfs_rq->nr_running)
                                        goto idle;
                                goto simple;
                        }
                }
                se = pick_next_entity(cfs_rq, curr);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);
        p = task_of(se);
        /*
         * Since we haven't yet done put_prev_entity and if the selected task
         * is a different task than we started out with, try and touch the
         * least amount of cfs_rqs.
         */
        if (prev != p) {
                struct sched_entity *pse = &prev->se;
                while (!(cfs_rq = is_same_group(se, pse))) {
                        int se_depth = se->depth;
                        int pse_depth = pse->depth;
                        if (se_depth <= pse_depth) {
                                put_prev_entity(cfs_rq_of(pse), pse);
                                pse = parent_entity(pse);
                        }
                        if (se_depth >= pse_depth) {
                                set_next_entity(cfs_rq_of(se), se);
                                se = parent_entity(se);
                        }
                }
                put_prev_entity(cfs_rq, pse);
                set_next_entity(cfs_rq, se);
        }
        goto done;
simple:
#endif
        // 处理prev后事，比如更新vruntime，处理curr=NULL
        // fair实现对应于put_prev_task_fair
        put_prev_task(rq, prev);
        do {
                // 选择next的核心流程
                se = pick_next_entity(cfs_rq, NULL);
                // 选出的se需要进一步处理，重新设置curr
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);
        // 选出来的就是p
        p = task_of(se);
done: __maybe_unused;
#ifdef CONFIG_SMP
        /*
         * Move the next running task to the front of
         * the list, so our cfs_tasks list becomes MRU
         * one.
         */
        list_move(&p->se.group_node, &rq->cfs_tasks);
#endif
        if (hrtick_enabled(rq))
                hrtick_start_fair(rq, p);
        return p;
idle:
        new_tasks = idle_balance(rq, rf);
        /*
         * Because idle_balance() releases (and re-acquires) rq->lock, it is
         * possible for any higher priority task to appear. In that case we
         * must re-start the pick_next_entity() loop.
         */
        if (new_tasks < 0)
                return RETRY_TASK;
        if (new_tasks > 0)
                goto again;
        return NULL;
}
/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
// 一般来说挑选vruntime最小（红黑树最左）的se就行了
// 但是如注释所言：
// 1. skip实体不能选
// 2. last实体对cache更好
// 3. next实体更有实际价值
// 理想和现实差得远啊，不仅要避开1，还要衡量2和3值不值
// - 其中，衡量使用wakeup_preempt_entity比较，
// - 只要与left相差不大（一个gran内，表示not unfair）即可考虑
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        // 很显然，我要挑rbtree中vruntime最小只的那个
        struct sched_entity *left = __pick_first_entity(cfs_rq);
        struct sched_entity *se;
        /*
         * If curr is set we have to see if its left of the leftmost entity
         * still in the tree, provided there was anything in the tree at all.
         */
        if (!left || (curr && entity_before(curr, left)))
                // 但仍需要和curr比较
                left = curr;
        // 一般情况就是你了
        se = left; /* ideally we run the leftmost entity */
        /*
         * Avoid running the skip buddy, if running something else can
         * be done without getting too unfair.
         */
        // 但如果需要避开skip buddy
        if (cfs_rq->skip == se) {
                struct sched_entity *second;
                // 且刚才挑了curr
                if (se == curr) {
                        // 那就找回left
                        second = __pick_first_entity(cfs_rq);
                // 且刚才没有算上curr
                } else {
                        // 那就找left的后继，或者curr
                        second = __pick_next_entity(se);
                        if (!second || (curr && entity_before(curr, second)))
                                second = curr;
                }
                if (second && wakeup_preempt_entity(second, left) < 1)
                        se = second;
        }
        /*
         * Prefer last buddy, try to return the CPU to a preempted task.
         */
        // 但如果需要last buddy
        if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
                se = cfs_rq->last;
        /*
         * Someone really wants this to run. If it's not unfair, run it.
         */
        if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
                se = cfs_rq->next;
        // 从代码上来看
        // 似乎在选取的优先级 next > last
        // 只有命中了se才会清除对应的buddy
        clear_buddies(cfs_rq, se);
        return se;
}
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
        struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
        if (!left)
                return NULL;
        return rb_entry(left, struct sched_entity, run_node);
}
static struct sched_entity *__pick_next_entity(struct sched_entity *se)
{
        struct rb_node *next = rb_next(&se->run_node);
        if (!next)
                return NULL;
        return rb_entry(next, struct sched_entity, run_node);
}
/*
 * Account for a descheduled task:
 */
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
        struct sched_entity *se = &prev->se;
        struct cfs_rq *cfs_rq;
        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                put_prev_entity(cfs_rq, se);
        }
}
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
        /*
         * If still on the runqueue then deactivate_task()
         * was not called and update_curr() has to be done:
         */
        if (prev->on_rq)
                update_curr(cfs_rq);
        /* throttle cfs_rqs exceeding runtime */
        check_cfs_rq_runtime(cfs_rq);
        check_spread(cfs_rq, prev);
        if (prev->on_rq) {
                update_stats_wait_start(cfs_rq, prev);
                /* Put 'current' back into the tree. */
                __enqueue_entity(cfs_rq, prev);
                /* in !on_rq case, update occurred at dequeue */
                update_load_avg(cfs_rq, prev, 0);
        }
        // curr为空的一个时机
        cfs_rq->curr = NULL;
}
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        /* 'current' is not kept within the tree. */
        if (se->on_rq) {
                /*
                 * Any task has to be enqueued before it get to execute on
                 * a CPU. So account for the time it spent waiting on the
                 * runqueue.
                 */
                update_stats_wait_end(cfs_rq, se);
                __dequeue_entity(cfs_rq, se);
                update_load_avg(cfs_rq, se, UPDATE_TG);
        }
        update_stats_curr_start(cfs_rq, se);
        // 重新设置curr（之前是NULL）
        cfs_rq->curr = se;
        /*
         * Track our maximum slice length, if the CPU's load is at
         * least twice that of our own weight (i.e. dont track it
         * when there are only lesser-weight tasks around):
         */
        if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
                schedstat_set(se->statistics.slice_max,
                        max((u64)schedstat_val(se->statistics.slice_max),
                            se->sum_exec_runtime - se->prev_sum_exec_runtime));
        }
        se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

在本文最开始的地方已经说过，CFS 的公平性体现在 vruntime 每次都挑选最小的实体。这里也是如此，但是存在一个 CFS buddy 的机制需要注意。

在这里需要考虑如下 buddy：

skip buddy：需要跳过的实体，如果选中的实体恰好是它，需要重新考虑。
last buddy：从局部性来看最优的实体，也许比直接挑 vruntime 最小的 leftmost 要好。
next buddy：从考虑实际应用的场合来看更倾向于该 buddy（刚被唤醒），优先级比 last 更高。

Buddies: Prefers running task other than the leftmost

Prev – run the task that ran before
Next – run the task that just got woken
Skip – do not run the task

A decidedly !RT scheduler – Peter Zijlstra

如何考虑呢？同样是通过前面讲述的 gran 来判断，只要与 leftmost 相差不大（在一个 gran 内），都优先考虑 buddy。

那么所谓的 buddy 在哪里用到了？似乎没有什么用途？其实有一个最直接的应用就是系统调用 sched_yield()，除去细节，这个函数就是一个 set_skip_buddy()，正因为挑选时跳过了该 buddy，才实现了 yield 语义（够简单吧）。

总结

本文只是按照进程的生命周期的角度来简单整理了 CFS 最基本的框架：从进程的创建、唤醒到调度。这里简单列出有用的点：

进程创建时，会基于原进程的 vruntime 附加惩罚时间来初始化现有进程的 vruntime。
进程唤醒时，会以睡眠时间和优先级无关的延迟常数对 vruntime 进行补偿（减小）。
进程调度时，除去挑选 leftmost vruntime 以外，还需考虑 gran 和 buddy 机制。

文中还提了一些关于抢占式调度的基础，后续可能另开一篇笔记再去整理。

UPDATE: 抢占调度和 nvcsw/nivcsw。