标签:oom
一 应用场景描述
线上一台mongos出现OOM情况,于是花点时间想要详细了解Linux内核的OOM机制原理,便于以后再作分析$ sudo grep mongos /var/log/messages
Apr 10 15:35:38 localhost sz[32066]: [xxxx] check_mongos.sh/ZMODEM: 211 Bytes, 229 BPS
Apr 23 14:50:18 localhost sz[5794]: [xxxxx] mongos/ZMODEM: 297 Bytes, 151 BPS
Apr 23 15:01:55 localhost kernel: [20387] 497 20387 694326 427932 0 0 0 mongos
Apr 23 15:01:55 localhost kernel: Out of memory: Kill process 20387 (mongos) score 890 or sacrifice child
Apr 23 15:01:55 localhost kernel: Killed process 20387, UID 497, (mongos) total-vm:2777304kB, anon-rss:1711700kB, file-rss:28kB
mongos这台机器的内存不足触发了Linux内核的OOM机制,然后把mongos进程给kill掉了
下载Linux内核源码查看OOM相关代码
查看oom_kill.c源代码里面的内容
linux-2.6.32.65/mm/oom_kill.c/*
* linux/mm/oom_kill.c
*
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
*
* The routines in this file are used to kill a process when
* we‘re seriously out of memory. This gets called from __alloc_pages()
* in mm/page_alloc.c when we really run out of memory.
*
* Since we won‘t call these routines often (on a well-configured
* machine) this file will double as a ‘coding guide‘ and a signpost
* for newbie kernel hackers. It features several pointers to major
* kernel subsystems and hints as to where to find out what things do.
*/
这个文件的步骤用于当内存严重耗尽时如何去选择性地杀掉一个进程。这些步奏不经常调用。
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
* @uptime: current uptime in seconds
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
* to kill when we run out of memory.
*
* Good in this context means that:
* 1) we lose the minimum amount of work done
* 2) we recover a large amount of memory
* 3) we don‘t kill anything innocent of eating tons of memory
* 4) we want to kill the minimum amount of processes (one)
* 5) we try to kill the process the user expects us to kill, this
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
*/
unsigned long badness(struct task_struct *p, unsigned long uptime)
badness函数会为每个进程计算一个值来描述这个任务有多bad
所要选择被杀死的进程符合以下特征:
1)杀掉这个进程会花费最少量的工作
2)杀掉这个进程后会恢复很大一部分内存
3)不杀掉任何消耗大量内存的无辜进程
4)尽可能地杀掉最少量的进程
5)尝试杀掉用于希望杀死的进程unsigned long points, cpu_time, run_time;
struct mm_struct *mm;
struct task_struct *child;
int oom_adj = p->signal->oom_adj;
struct task_cputime task_time;
unsigned long utime;
unsigned long stime;/*
* The memory size of the process is the basis for the badness.
*/
points = mm->total_vm;
进程使用的内存大小是判断badness的基础/*
* swapoff can easily use up all memory, so kill those first.
*/
if (p->flags & PF_OOM_ORIGIN)
return ULONG_MAX;
swapoff最容易用光所有内存,先杀掉这些进程
/*
* Processes which fork a lot of child processes are likely
* a good choice. We add half the vmsize of the children if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of children. In case a single
* child is eating the vast majority of memory, adding only half
* to the parents will make the child our kill candidate of choice.
*/
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
if (child->mm != mm && child->mm)
points += child->mm->total_vm/2 + 1;
task_unlock(child);
}
那些fork出很多子进程的进程是一个很好的选择。
/*
* CPU time is in tens of seconds and run time is in thousands
* of seconds. There is no particular reason for this other than
* that it turned out to work very well in practice.
*/
thread_group_cputime(p, &task_time);
utime = cputime_to_jiffies(task_time.utime);
stime = cputime_to_jiffies(task_time.stime);
cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
/*
* Niced processes are most likely less important, so double
* their badness points.
*/
if (task_nice(p) > 0)
points *= 2;
设置nice值得进程是最可能不重要的进程,这里讲他们的badness得分加倍
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
has_capability_noaudit(p, CAP_SYS_RESOURCE))
points /= 4;
使用超级用户运行的进程是很重要的进程,所以最不可能被杀掉的进程
/*
* We don‘t want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
if (has_capability_noaudit(p, CAP_SYS_RAWIO))
points /= 4;
直接访问硬件的进程不容易被杀掉
/*
* If p‘s nodes don‘t overlap ours, it may still help to kill p
* because p may have allocated or otherwise mapped memory on
* this node before. However it will be less likely.
*/
if (!has_intersects_mems_allowed(p))
points /= 8;
/*
* Adjust the score by oom_adj.
*/
if (oom_adj) {
if (oom_adj > 0) {
if (!points)
points = 1;
points <<= oom_adj;
} else
points >>= -(oom_adj);
}
通过oom_adj来调整得分
ifdef DEBUG
printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
p->pid, p->comm, points);
#endif
return points;
}
输出得分
/*
* Simple selection loop. We chose the process with the highest
* number of ‘points‘. We expect the caller will lock the tasklist.
*
* (not docbooked, we don‘t want this one cluttering up the manual)
*/
循环比较,选出得分最高的进程。
/*
* skip kernel threads and tasks which have already released
* their mm.
*/
if (!p->mm)
continue;
/* skip the init task */
if (is_global_init(p))
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
跳过那些已经释放到内存的内核线程和任务,跳过init task
/**
* dump_tasks - dump current memory state of all system tasks
* @mem: target memory controller
*
* Dumps the current memory state of all system tasks, excluding kernel threads.
* State information includes task‘s pid, uid, tgid, vm size, rss, cpu, oom_adj
* score, and name.
*
* If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
* shown.
*
* Call with tasklist_lock read-locked.
*/
static void dump_tasks(const struct mem_cgroup *mem)
/*
* Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO
* flag though it‘s unlikely that we select a process with CAP_SYS_RAW_IO
* set.
*/
static void __oom_kill_task(struct task_struct *p, int verbose)
/*
* If the task is already exiting, don‘t alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
/* Try to kill a child first */
**
* out_of_memory - kill the "best" process when we run out of memory
* @zonelist: zonelist pointer
* @gfp_mask: memory allocation flags
* @order: amount of memory being requested as a power of 2
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don‘t have to be perfect here, we just have to be good.
*/
内存溢出,当内存溢出时杀掉最优的进程。
如果出现内存溢出,要么选择随机杀掉一个进程或者直接让系统崩溃,或者尝试有选择性地杀掉一个进程。
参考文章:
标签:oom
原文:http://john88wang.blog.51cto.com/2165294/1637895