linux vm 控制

最新推荐文章于 2022-03-06 15:35:33 发布

TheGameIsFives

最新推荐文章于 2022-03-06 15:35:33 发布

阅读量1.1k

点赞数

本文链接：https://blog.csdn.net/TheGameIsFives/article/details/22868351

版权

linux 中 /proc/vm/下面可以调节vm的使用。

==============================================================

min_free_kbytes:

This is used to force the Linux VM to keep a minimum number
of kilobytes free. The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

=============================================================
==============================================================

lowmem_reserve_ratio

For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone. This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.

So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem. This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region. This
mechanism will also defend that region from allocations which could use
highmem or lowmem).

The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
in defending these lower zones.

If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should change the lowmem_reserve_ratio setting.

The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256     256     32
-
Note: # of this elements is one fewer than number of zones. Because the highest
      zone's value is not necessary for following calculation.

But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.

-
Node 0, zone      DMA
pages free     1355
        min      3
        low      3
        high     4
    :
    :
    numa_other   0
        protection: (0, 2004, 2004, 2004)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pagesets
    cpu: 0 pcp: 0
        :
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.

In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.

zone[i]'s protection[j] is calculated by following expression.

(i < j):
zone[i]->protection[j]
= (total sums of present_pages from zone[i+1] to zone[j] on the node)
    / lowmem_reserve_ratio[i];
(i = j):
   (should not be protected. = 0;
(i > j):
   (not necessary, but looks 0)

The default values of lowmem_reserve_ratio[i] are
    256 (if zone[i] means DMA or DMA32 zone)
    32 (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total present

If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).

==============================================================

内核使用low memory来跟踪所有的内存分配，这样的话一个16GB内存的系统比一个4GB内存的系统，需要消耗更多的low memory，当low memory耗尽，即便系统仍然有剩余内存，仍然会触发oom-killer
关于min_free_kbytes:

The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.
Setting this too high will OOM your machine instantly.
Linux内核的策略是最大程度的利用内存cache 文件系统的数据，提高IO速度，虽然在机制上是有进程需要更大的内存时，会自动释放Page Cache,但不排除释放不及时或者释放的内存由于存在碎片不满足进程的内存需求。
所以我们需要一个方法，能够限定PageCache的上限。
Linux 提供了这样一个参数min_free_kbytes，用来确定系统开始回收内存的阀值，控制系统的空闲内存。值越高，内核越早开始回收内存，空闲内存越高

关于lowmem_reserve_ratio:
which _could_ use highmem from using too much lowmem. This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

----------------------------------------------------------------------------------

关于 overcommit_memory

Linux下面的OOM killer到底是什么样一个机制呢，它在什么时候会跳出来，又会选择那些进程下手呢。
什么时候跳出来
先看第一个问题，它什么时候会跳出来。是不是malloc返回NULL的时候跳出来呢？不是的，malloc的manpage里有下面一段话：

By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available. This is a really bad bug. In
case it turns out that the system is out of memory, one or more processes
will be killed by the infamous OOM killer. In case Linux is
employed under circumstances where it would be less desirable to suddenly
lose some randomly picked processes, and moreover the kernel version
is sufficiently recent, one can switch off this overcommitting
behavior using a command like:

# echo 2 > /proc/sys/vm/overcommit_memory

上面一段话告诉我们，Linux中malloc返回非空指针，并不一定意味着指向的内存就是可用的，Linux下允许程序申请比系统可用内存更多的内存，这个特性叫Overcommit。这样做是出于优化系统考虑，因为不是所有的程序申请了内存就立刻使用的，当你使用的时候说不定系统已经回收了一些资源了。不幸的是，当你用到这个Overcommit给你的内存的时候，系统还没有资源的话，OOM killer就跳出来了。

Linux下有3种Overcommit的策略（参考内核文档：vm/overcommit-accounting），可以在/proc/sys/vm/overcommit_memory配置。取0,1和2三个值，默认是0。

0：启发式策略，比较严重的Overcommit将不能得逞，比如你突然申请了128TB的内存。而轻微的Overcommit将被允许。另外，root能Overcommit的值比普通用户要稍微多些。

1：永远允许Overcommit，这种策略适合那些不能承受内存分配失败的应用，比如某些科学计算应用。

2：永远禁止Overcommit，在这个情况下，系统所能分配的内存不会超过swap+RAM*系数（/proc/sys/vm/overcmmit_ratio，默认50%，你可以调整），如果这么多资源已经用光，那么后面任何尝试申请内存的行为都会返回错误，这通常意味着此时没法运行任何新程序。

补充（待考证）：在这篇文章：Memory overcommit in Linux中，作者提到，实际上启发策略只有在启用了SMACK或者SELinux模块时才会起作用，其他情况下等于永远允许策略。
跳出来之后选择进程的策略
好了，只要存在Overcommit，就可能会有OOM killer跳出来。那么OOM killer跳出来之后选目标的策略又是什么呢？我们期望的是：没用的且耗内存多的程序被枪。

Linux下这个选择策略也一直在不断的演化。作为用户，我们可以通过设置一些值来影响OOM killer做出决策。Linux下每个进程都有个OOM权重，在/proc/<pid>/oom_adj里面，取值是-17到+15，取值越高，越容易被干掉。

最终OOM killer是通过/proc/<pid>/oom_score这个值来决定哪个进程被干掉的。这个值是系统综合进程的内存消耗量、CPU时间(utime + stime)、存活时间(uptime - start time)和oom_adj计算出的，消耗内存越多分越高，存活时间越长分越低。总之，总的策略是：损失最少的工作，释放最大的内存同时不伤及无辜的用了很大内存的进程，并且杀掉的进程数尽量少。

另外，Linux在计算进程的内存消耗的时候，会将子进程所耗内存的一半同时算到父进程中。这样，那些子进程比较多的进程就要小心了。