The Linux Scheduler: a Decade of Wasted Cores

https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

The Linux Scheduler: a Decade of Wasted Cores – Lozi et al. 2016

This is the first in a series of papers from EuroSys 2016. There are three strands here: first of all, there’s some great background into how scheduling works in the Linux kernel; secondly, there’s a story about Software Aging and how changing requirements and maintenance can cause decay; and finally, the authors expose four bugs in Linux scheduling that caused cores to remain idle even when there was pressing work waiting to be scheduled. Hence the paper title, “A Decade of Wasted Cores.”

In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database.


dentry 缓存泄露追查

最近追查某线上服务发现一个问题:频繁的创建&删除目录或者文件,会消耗大量的物理内存。

原因是因为,目录或者文件删除后,dentry并没有立即释放,而是保存在了superblock上的lru链表上。通常只有内核认为内存紧张的时候才会回收此部分内存。

不过由于dentry结构非常小,这种case下内存增长是很慢。

// linux-2.6.32.68/fs/dcache.c
/*
 * dput - release a dentry
 * @dentry: dentry to release 
 *
 * Release a dentry. This will drop the usage count and if 
 * appropriate call the dentry unlink method as well as
 * removing it from the queues and releasing its resources. 
 * If the parent dentries were scheduled for release
 * they too may now get deleted.
 *
 * no dcache lock, please.
 */

void dput(struct dentry *dentry)
{
	if (!dentry)
		return;

repeat:
	if (atomic_read(&dentry->d_count) == 1)
		might_sleep();
	if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
		return;

	spin_lock(&dentry->d_lock);
	if (atomic_read(&dentry->d_count)) {
		spin_unlock(&dentry->d_lock);
		spin_unlock(&dcache_lock);
		return;
	}

	/*
	 * AV: ->d_delete() is _NOT_ allowed to block now.
	 */
	if (dentry->d_op && dentry->d_op->d_delete) {
		if (dentry->d_op->d_delete(dentry))
			goto unhash_it;
	}
	/* Unreachable? Get rid of it */
 	if (d_unhashed(dentry))
		goto kill_it;
  	if (list_empty(&dentry->d_lru)) {
  		dentry->d_flags |= DCACHE_REFERENCED;
                // 这里把dentry放到sb的lru链表里
		dentry_lru_add(dentry);
  	}
 	spin_unlock(&dentry->d_lock);
	spin_unlock(&dcache_lock);
	return;

unhash_it:
	__d_drop(dentry);
kill_it:
	/* if dentry was on the d_lru list delete it from there */
	dentry_lru_del(dentry);
	dentry = d_kill(dentry);
	if (dentry)
		goto repeat;
}

而内存回收有两种方式:

  1. 内核发现内存不足时主动处罚。例如分配不了内存时,达到某种指定的阈值时等等。
  2. 手动触发 sysctl vm.drop_caches。

这里只讨论第二种方法。内核在初始化的时候,注册了一个dcache_shrinker来释放dcache内存。shrink_dcache_memory的最主要的逻辑时prune_dcache()函数

static struct shrinker dcache_shrinker = {
	.shrink = shrink_dcache_memory,
	.seeks = DEFAULT_SEEKS,
};
static void __init dcache_init(void)
{
	int loop;
	/* 
	 * A constructor could be added for stable state like the lists,
	 * but it is probably not worth it because of the cache nature
	 * of the dcache. 
	 */
	dentry_cache = KMEM_CACHE(dentry,
		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
	// 注册钩子函数,当用户手动触发内存回收时,调用dcache_shrinker
	register_shrinker(&dcache_shrinker);
	/* ... */
}

CPU 亲缘性实现原理

目前cpu亲缘性设置有两种方式:

  1. 通过sched_setaffinity()函数将进程绑定到某些cpu上运行
  2. 讲进程加入到cpuset控制组,通过修改cpuset.cpus参数来实现亲缘性设置

但是有一点需要注意的是,通过sched_setaffinity设置的cpu_mask不允许超出cpuset.cpus的范围,超出部分是无效的。

举例,进程A所在的cpuset控制组cpuset.cpus参数值是0,1,2,3,4,但是进程运行期间调用sched_setaffinity()修改cpu_mask为0,1,2,4,7。因为7并不在cpuset.cpus范围里,所以内核会把7忽略掉,最终进程只能在0,1,2,4这几个CPU上运行。


内核oom过程简单分析

以内核3.10.79为例。这里分析一下内核对于cgroup.memory进程组oom的过程,以及混部环境下需要什么样的oom策略。

触发时机

内核对于每个memory cgroup维护一个计数器,统计当前cgroup内已经使用了的内存。每当cgroup内进程创建页面时,页面大小所占用的内存就会通过res_counter_charge_locked()函数计入计数器里。而当内存占用超过memory.limit_in_bytes所设置的阈值时,charge失败,返回ENOMEN错误。

int res_counter_charge_locked(
struct res_counter *counter, unsigned long val, bool force)
{
        int ret = 0;
        if (counter->usage + val > counter->limit) {
                counter->failcnt++;
                ret = -ENOMEM;
                if (!force)
                        return ret;
        }
        counter->usage += val;
        if (counter->usage > counter->max_usage)
                counter->max_usage = counter->usage;
        return ret;
}

另外有一个问题需要注意的是,内存这个子系统是拓扑型控制的,不是平级控制的。下一级子系统的所有limit_in_bytes之和不能超过父亲的limit_in_bytes值,否则会设置失败。

所以内存计数的时候:

  1. 进程新创建的页面,会被反向递归计入到所有的父cgroup下面。
  2. memory子系统的根的计数一定是当前内核所有进程的内存使用之和。(注意,由于cgroup.memory对内存的统计和proc文件系统的统计方法不一致,所以这两个系统对于内存使用的值并不是完全对等的)

子cgroup也许内存配额有冗余,但父cgroup不一定会有冗余,所以在反向递归计数的时候,谁内存超过阈值了,就oom谁(选择这个cgroup下的某个进程kill掉,所以这里是有可能某个cgroup明明内存没有被超限但也会被莫名的干掉了)。