cgroup 进程调度之 Borrowed-virtual-time (BVT) scheduling

内容隐藏

1. cfs 睡眠补偿机制

2. bvt 原理

3. bvt 测试

规避 CFS 的非公平性问题（睡眠补偿等等），99年发表论文，15年heracles论文重新对 bvt 做了改进，从论文作者的名字，我扒到了对应的源码，这哥们把源码放到gist上了

https://gist.github.com/leverich/5913713

论文：https://rcs.uwaterloo.ca/papers/bvt.pdf

1. cfs 睡眠补偿机制

在讲bvt之前，有必要先介绍一下 cfs 的睡眠补偿机制

cfs 调度器的目标是公平，cfs 希望每个进程得到调度的机会是一样的，这个“机会”是用 vruntime 来衡量的

但是如果一个进程一直在睡眠，那么它的 vruntime 是非常小的，当睡眠中的进程被唤醒时，基于 CFS 的调度逻辑，会一直持续运行当前进程，直到 vruntime 不是最小的时候，才会选择下一个进程来调度。

内核为了解决 sleep 进程获得过长时间的问题，增加了一个阈值限制，当进程被唤醒时，取当前运行队列的最小vruntime，并 + 上一个偏移量，这个偏移量默认是 1/2 个调度周期，12ms

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;

    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);

    /* sleeps up to a single latency don't count. */
    if (!initial) {
        unsigned long thresh = sysctl_sched_latency;

        /*
         * Halve their sleep time's effect, to allow
         * for a gentler effect of sleepers:
         */
        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            thresh >>= 1;

        vruntime -= thresh;
    }

    /* ensure we never gain time by being placed backwards. */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}

也就是说，当进程被唤醒之后，至少会得到 4ms 完整的运行时间（因为12ms > 最小保证运行时间，这是cfs公平性的另外一个机制），这个时间不可被中断，除非进程主动让出 CPU

2. bvt 原理

从源码上来看，实现非常简单（这个可能和原来论文的设计应该不一样了），bvt 并不改变 cfs 的公平性，它只是修改了一下进程唤醒时的睡眠补偿

这个 patch 增加了一个新的 cgroup 接口：bvt_warp_ns

看下接口定义：

+ * If the BVT_PLACEMENT scheduler feature is enabled, waking BVT tasks
+ * are placed differently from CFS tasks when they wakeup.  Rather
+ * than being placed some large factor (i.e. sched_latency >> 1)
+ * before min_vruntime (which gives waking tasks an unfair advantage
+ * in preempting currently runng tasks), they are placed
+ * sched_bvt_place_epsilon nanoseconds relative to min_vruntime.  If
+ * you really want a BVT task to preempt currently running tasks, it
+ * should have a greater "warp" value than the current running task.

简单来说：这个值越大越好

+#ifdef CONFIG_CFS_BVT
+static inline void update_effective_vruntime(struct sched_entity *se)
+{
+    s64 warp;
+    struct task_group *tg;
+
+    if (entity_is_task(se)) {
+        se->effective_vruntime = se->vruntime;
+        return;
+    }
+
+    tg = se->my_q->tg;
+    warp = tg->bvt_warp_ns;
+
+    /* FIXME: Should we calc_delta_fair on warp_ns? */
+    se->effective_vruntime = se->vruntime - warp;
+    se->is_warped = warp ? 1 : 0;
+}
+#endif /* CONFIG_CFS_BVT */

每次 update_effective_vruntime 的时候，进程的 vruntime 就会少减少 bvt_warp_ns，因为 cfs 队列是个红黑树，每次调度的时候选 vruntime 最小的进程来调度。因此 vruntime 越小，得到调度的机会就越大

update_effective_vruntime 在进程进出 cfs 队列，进程唤醒，抢占等很多地方都会调用到，具体可以看 patch

3. bvt 测试

不过，在实际测试的过程中，我们发现 bvt 对 latency sensitive 程序并没有太多优化

hackbench 是一个用来衡量调度器的性能的开源benchmark。它的原理是启动 N 个 reader/write进程或线程对，通过 IPC(socket 或者 pipe) 进行并发的读写，来测试调度的延迟，响应速度等。

case 1：在同一个cpu上启动一个cpu消耗性程序和hackbench，他们处于同一个cpu cgroup下，观察hackbench的运行时间

case 2：在同一个cpu上启动一个cpu消耗性程序和hackbench，eatcpu不受BVT影响，hackbench受BVT的影响，观察hackbench的运行时间

执行时间	case 1	case 2(bvt =40ms)	case 2(bvt =1ms)	case 2(bvt =0.1ms)
1	2.585	5.045	5.415	5.534
2	2.656	5.329	5.330
3	2.727	5.467	5.438
4	2.652
9	2.634

但是实际测试发现，开了bvt性能反而更差了

分析了一下，原因是 bvt 中的sysctl_sched_bvt_place_epsilon，使得睡眠被唤醒的进程得不到补偿，导致hackbench测试性能急剧下降。

在使用sysctl_sched_bvt_place_epsilon计算新的vruntime，使用之前已经被补偿一个运行周期的vruntime，而不是当前运行队列的最小vruntime

测试结果：

执行时间	case 1	case 2(bvt =4ms)	case 2(bvt =40ms)	case 2(bvt =0.1ms)
1	2.429	2.441	2.404	2.417
2	2.451	2.436	2.426
3	2.408	2.495	2.317
4	2.407	2.354	2.403
5	2.399	2.400
6	2.322	2.396
7	2.399	2.360
8	2.416	2.449
9	2.339	2.395

结论：

在使用如上的patch后，修复了BVT的负面影响，但测试看BVT对任务延迟没有什么影响

一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

成功，源于对美学的执著追求

cgroup 进程调度之 Borrowed-virtual-time (BVT) scheduling

1. cfs 睡眠补偿机制

2. bvt 原理

3. bvt 测试

发表回复取消回复

成功，源于对美学的执著追求

1. cfs 睡眠补偿机制

2. bvt 原理

3. bvt 测试

发表回复 取消回复

发表回复取消回复