以内核3.10.79为例。这里分析一下内核对于cgroup.memory进程组oom的过程,以及混部环境下需要什么样的oom策略。
触发时机
内核对于每个memory cgroup维护一个计数器,统计当前cgroup内已经使用了的内存。每当cgroup内进程创建页面时,页面大小所占用的内存就会通过res_counter_charge_locked()函数计入计数器里。而当内存占用超过memory.limit_in_bytes所设置的阈值时,charge失败,返回ENOMEN错误。
int res_counter_charge_locked( struct res_counter *counter, unsigned long val, bool force) { int ret = 0; if (counter->usage + val > counter->limit) { counter->failcnt++; ret = -ENOMEM; if (!force) return ret; } counter->usage += val; if (counter->usage > counter->max_usage) counter->max_usage = counter->usage; return ret; }
另外有一个问题需要注意的是,内存这个子系统是拓扑型控制的,不是平级控制的。下一级子系统的所有limit_in_bytes之和不能超过父亲的limit_in_bytes值,否则会设置失败。
所以内存计数的时候:
- 进程新创建的页面,会被反向递归计入到所有的父cgroup下面。
- memory子系统的根的计数一定是当前内核所有进程的内存使用之和。(注意,由于cgroup.memory对内存的统计和proc文件系统的统计方法不一致,所以这两个系统对于内存使用的值并不是完全对等的)
子cgroup也许内存配额有冗余,但父cgroup不一定会有冗余,所以在反向递归计数的时候,谁内存超过阈值了,就oom谁(选择这个cgroup下的某个进程kill掉,所以这里是有可能某个cgroup明明内存没有被超限但也会被莫名的干掉了)。
static int __res_counter_charge(struct res_counter *counter, unsigned long val, struct res_counter **limit_fail_at, bool force) { int ret, r; unsigned long flags; struct res_counter *c, *u; r = ret = 0; *limit_fail_at = NULL; local_irq_save(flags); for (c = counter; c != NULL; c = c->parent) { spin_lock(&c->lock); r = res_counter_charge_locked(c, val, force); spin_unlock(&c->lock); if (r < 0 && !ret) { ret = r; *limit_fail_at = c; if (!force) break; } } if (ret < 0 && !force) { for (u = counter; u != c; u = u->parent) { spin_lock(&u->lock); res_counter_uncharge_locked(u, val); spin_unlock(&u->lock); } } local_irq_restore(flags); return ret; }
当内核发现某个父cgroup内存已经超限时,先尝试通过mem_cgroup_reclaim()回收,下面代码中mem_over_limit就是已经超限了的cgroup,如果可以回收,则通知caller进行重试,否则,触发oom
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; /* * Even though the limit is exceeded at this point, reclaim * may have been able to free some pages. Retry the charge * before killing the task. * * Only for regular pages, though: huge pages are rather * unlikely to succeed so close to the limit, and we fall back * to regular pages anyway in case of failure. */ if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) return CHARGE_RETRY; /* * At task move, charge accounts can be doubly counted. So, it's * better to wait until the end of task_move if something is going on. */ if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; if (invoke_oom) mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize)); return CHARGE_NOMEM;
mem_cgroup_oom()这个函数并不真正的触发oom,这里只是把它放到当前进程的current->memcg_oom里,然后返回ENOMEM错误,caller会根据情况来决定是否触发oom,一旦触发oom,则内核会调用mem_cgroup_oom_synchronize()函数来完成对某个cgroup的oom过程。
oom基本流程
内核对memory.cgroup进行oom的函数是mem_cgroup_out_of_memory(),其实这个函数很简单,基本就是复用了oom_kill.c里的out_of_memory()函数
mem_cgroup_out_of_memory()函数的流程是,遍历current->mem_cg下的所有进程,对每个进程调用oom_scan_process_thread()来决定是否参与打分,如果进程参与打分,调用oom_badness()算分。最高分者即被kill,kill进程是通过发送-9信号来完成的。
totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; for_each_mem_cgroup_tree(iter, memcg) { struct cgroup *cgroup = iter->css.cgroup; struct cgroup_iter it; struct task_struct *task; cgroup_iter_start(cgroup, ); while ((task = cgroup_iter_next(cgroup, ))) { switch (oom_scan_process_thread(task, totalpages, NULL, false)) { case OOM_SCAN_SELECT: if (chosen) put_task_struct(chosen); chosen = task; chosen_points = ULONG_MAX; get_task_struct(chosen); /* fall through */ case OOM_SCAN_CONTINUE: continue; case OOM_SCAN_ABORT: cgroup_iter_end(cgroup, ); mem_cgroup_iter_break(memcg, iter); if (chosen) put_task_struct(chosen); return; case OOM_SCAN_OK: break; }; points = oom_badness(task, memcg, NULL, totalpages); if (points > chosen_points) { if (chosen) put_task_struct(chosen); chosen = task; chosen_points = points; get_task_struct(chosen); } } cgroup_iter_end(cgroup, ); } if (!chosen) return; points = chosen_points * 1000 / totalpages; oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg, NULL, "Memory cgroup out of memory");
oom_scan_process_thread()函数在这里有一个很重要的作用就是过滤一些不需要打分的进程,加快oom速度,哪些进程不需要oom呢?
- 已经exit了的进程
- 内核线程
- 进程内存页已经被释放,通过task->mm是否为NULL来判断。说明当前进程正在退出
badness打分机制
影响oom_badness打分的因素有三个:
- 进程可以通过/proc/${pid}/oom_score_adj 设置oom参数,该值越大,进程越容易被kill
- 进程内存,占用内存越大,越容易被kill
- 内核会适当降低root进程被kill的风险,在计算内存占用的时候降低3/1000
oom_badness()函数如下:
adj = (long)p->signal->oom_score_adj; if (adj == OOM_SCORE_ADJ_MIN) { task_unlock(p); return 0; } /* * The baseline for the badness score is the proportion of RAM that each * task's rss, pagetable and swap space use. */ points = get_mm_rss(p->mm) + p->mm->nr_ptes + get_mm_counter(p->mm, MM_SWAPENTS); task_unlock(p); /* * Root processes get 3% bonus, just like the __vm_enough_memory() * implementation used by LSMs. */ if (has_capability_noaudit(p, CAP_SYS_ADMIN)) points -= (points * 3) / 100; /* Normalize to oom_score_adj units */ adj *= totalpages / 1000; points += adj; /* * Never return 0 for an eligible task regardless of the root bonus and * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). */ return points > 0 ? points : 1;
混部环境下的oom策略
混部主要就是在线业务和离线业务混合部署,但是至少到目前为止,单机资源隔离方面并不能做到非常完美,很容易因为离线作业限制不住导致整机oom或者影响在线作业的情况。特别是内存这一块,需要特别小心。
因为离线作业本身的优先级就是很低的,当机器内存不足时,与其一个个进程算badness,选择性的杀进程,还不如干脆把离线作业全kill掉,这样时间会节省的非常多,尽可能快的恢复受影响的在线作业。