linux内核工程导论,Linux内核工程导论——内存管理（3）

最新推荐文章于 2022-04-26 16:19:21 发布

淡定情

最新推荐文章于 2022-04-26 16:19:21 发布

阅读量144

点赞数

文章标签： linux内核工程导论

Linux内核工程导论——内存管理(三)

用户端内核内存参数调整

/proc/sys/vm/ (需要根据内核版本调整)

交换相关

swap_token_timeout

Thisfile contains valid hold time of swap out protection token. The Linux VM hastoken based thrashing control mechanism and uses the token to preventunnecessary page faults in thrashing situation. The unit of the value issecond. The value would be useful

to tune thrashing behavior. This tunable wasremoved in 2.6.20 when the algorithm got improved.

swappiness

swappiness is aparameter which sets the kernel's balance between reclaiming pages from thepage cache and swapping process memory. The default value is 60. If you wantkernel to swap out more process memory and thus cache more file contentsincrease the value.

Otherwise, if you would like kernel to swap less decreaseit.

page-cluster

page-cluster controls the number of pageswhich are written to swap in a single attempt. The swap I/O size. It is alogarithmic value - setting it to zero means "1 page", setting it to1 means "2 pages", setting it to 2 means "4 pages", etc.The default value

is three (eight pages at a time). There may be some smallbenefits in tuning this to a different value if your workload isswap-intensive.

文件缓存相关

vfs_cache_pressure

Controls the tendency of the kernel to reclaimthe memory which is used for caching of directory and inode objects. At thedefault value of vfs_cache_pressure = 100 the kernel will attempt to reclaimdentries and inodes at a "fair" rate with respect to pagecache

andswapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer toretain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 causesthe kernel to prefer to reclaim dentries and inodes.

nr_pdflush_threads

The count of currently-running pdflushthreads. This is a read-only value.

min_free_kbytes

This is used toforce the Linux VM to keep a minimum number of kilobytes free. The VM uses thisnumber to compute a pages_min value for each lowmem zone in the system. Eachlowmem zone gets a number of reserved free pages based proportionally on itssize.

dirty_background_ratio

参数dirty_background_ratio是当所有被更改页面总大小占工作内存超过一定比例时，pdflush 会开始写回工作。用户可以增加这个比例，以增加页面驻留在内存的时间。

dirty_expire_centisecs

参数dirty_expire_centisecs控制一个更改过的页面经过多长时间后被认为是过期的、必须被写回的页面。

dirty_ratio

Contains, as a percentage of total systemmemory, the number of pages at which a process which is generating disk writeswill itself start writing out dirty data.

dirty_writeback_centisecs

参数dirty_writeback_centisecs 是在pdflash线程周期唤醒的时间间隔。也就是每过一定时间pdflsh就会将修改过得数据回写到磁盘。

drop_caches

Writing to thiswill cause the kernel to drop clean caches, dentries and inodes from memory,causing that memory to become free. To free pagecache:

echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation, anddirty objects are not freeable, the user should run "sync" first inorder to make sure all cached objects are freed. This tunable was added in2.6.16.

laptop_mode

在“笔记本模式”下，内核更智能的使用 I/O 系统，它会尽量使磁盘处于低能耗的状态下。“笔记本模式”会将许多的 I/O 操作组织在一起，一次完成，而在每次的磁盘 I/O之间是默认长达 10 分钟的非活动期，这样会大大减少磁盘启动的次数。为了完成这么长时间的非活动期，内核就要在一次活动期时完成尽可能多的 I/O 任务。在一次活动期间，要完成大量的预读，然后将所有的缓冲同步。

内存分配相关

percpu_pagelist_fraction

This is thefraction of pages at most (high mark pcp->high) in each zone that areallocated for each per cpu page list. The min value for this is 8. It meansthat we don't allow more than 1/8th of pages in each zone to be allocated inany single per_cpu_pagelist.

This entry only changes the value of hot per cpupagelists. User can specify a number like 100 to allocate 1/100th of each zoneto each per cpu page list. The batch value of each per cpu pagelist is alsoupdated as a result. It is set to pcp->high / 4. The upper

limit of batch is(PAGE_SHIFT * 8). The initial value is zero. Kernel does not use this value atboot time to set the high water marks for each per cpu page list.

overcommit_memory

Controlsovercommit of system memory, possibly allowing processes to allocate (but notuse) more memory than is actually available.

0 - Heuristic overcommit handling. Obviousovercommits of address space are refused. Used for a typical system. It ensuresa seriously wild allocation fails while allowing overcommit to reduce swapusage. root is allowed to allocate slighly more memory in this

mode. This isthe default.

1 - Always overcommit. Appropriate for somescientific applications.

2 - Don't overcommit. The total addressspace commit for the system is not permitted to exceed swap plus a configurablepercentage (default is 50) of physical RAM. Depending on the percentage you use,in most situations this means a process will not be killed

while attempting touse already-allocated memory but will receive errors on memory allocation asappropriate.

overcommit_ratio

Percentage of physical memory size to includein overcommit calculations. Memory allocation limit = swapspace + physmem *(overcommit_ratio / 100) swapspace = total size of all swap areas

physmem = size of physical memory in system

max_map_count

This filecontains the maximum number of memory map areas a process may have. Memory mapareas are used as a side-effect of calling malloc, directly by mmap andmprotect, and also when loading shared libraries. While most applications needless than a thousand

maps, certain programs, particularly malloc debuggers, mayconsume lots of them, e.g., up to one or two maps per allocation. The defaultvalue is 65536.

mmap_min_addr

This fileindicates the amount of address space which a user process will be restrictedfrom mmaping. Since kernel null dereference bugs could accidentally operatebased on the information in the first couple of pages of memory userspaceprocesses should not

be allowed to write to them. By default this value is set to 0 and noprotections will be enforced by the security module. Setting this value tosomething like 64k will allow the vast majority of applications to workcorrectly and provide defense in depth

against future potential kernel bugs.

lowmem_reserve_ratio

Ratio of totalpages to free pages for each memory zone.

legacy_va_layout

If non-zero,this sysctl disables the new 32-bit mmap map layout - the kernel will use thelegacy (2.4) layout for all processes

其他

block_dump

参数block_dump使块I / O调试时设置为一个非零的值。如果你想找出哪些过程引起的磁盘旋转(见/proc/sys/vm/laptop_mode)，你可以通过设置标志收集信息。设置该标志后，Linux将会以文件的形式报告所有磁盘活动时的读写操作以及所有脏块。这使得它可以解释为什么一个磁盘需要旋转起来，甚至可以增加电池寿命。把block_dump输出写至内核输出，可以使用“dmesg”相关信息。当你使用block_dump和内核日志记录级别，还包括内核调试信息，你可能要关闭klogd，否则block_dump输出将被记录，导致不正常的磁盘活动有。

hugepages_treat_as_movable

When a non-zerovalue is written to this tunable, future allocations for the huge page poolwill use ZONE_MOVABLE. Despite huge pages being non-movable, we do notintroduce additional external fragmentation of note as huge pages are alwaysthe largest contiguous

block we care about. Huge pages are not movable so are notallocated from ZONE_MOVABLE by default. However, as ZONE_MOVABLE will alwayshave pages that can be migrated or reclaimed, it can be used to satisfyhugepage allocations even when the system has been

running a long time. Thisallows an administrator to resize the hugepage pool at runtime depending on thesize of ZONE_MOVABLE.

hugetlb_shm_group

hugetlb_shm_groupcontains group id that is allowed to create SysV shared memory segment usinghugetlb page

nr_hugepages

nr_hugepages configures number of hugetlbpage reserved for the system.

numa_zonelist_order

This sysctl is only for NUMA. 'Where thememory is allocated from' is controlled by zonelists. In non-NUMA case, azonelist for GFP_KERNEL is ordered as following: ZONE_NORMAL -> ZONE_DMA.This means that a memory allocation request for GFP_KERNEL will get

memory fromZONE_DMA only when ZONE_NORMAL is not available. In NUMA case, you can think offollowing 2 types of order. Assume 2 node NUMA and below is zonelist ofNode(0)'s GFP_KERNEL:

(A) Node(0) ZONE_NORMAL -> Node(0)ZONE_DMA -> Node(1) ZONE_NORMAL

(B) Node(0) ZONE_NORMAL -> Node(1)ZONE_NORMAL -> Node(0) ZONE_DMA. Type(A) offers the best locality forprocesses on Node(0), but ZONE_DMA will be used before ZONE_NORMAL exhaustion.This increases possibility of out-of-memory (OOM) of ZONE_DMA because ZONE_DMAis

tend to be small. Type(B) cannot offer the best locality but is more robustagainst OOM of the DMA zone. Type(A) is called as "Node" order. Type(B) is "Zone" order. "Node order" orders the zonelists bynode, then by zone within each node. Specify "[Nn]ode" for

nodeorder. "Zone Order" orders the zonelists by zone type, then by nodewithin each zone. Specify "[Zz]one" for zone order. Specify"[Dd]efault" to request automatic configuration. Autoconfigurationwill select "node" order in following case:

(1) if the DMA zone does not exist or

(2) if the DMA zone comprises greater than50% of the available memory or

(3) if any node's DMA zone comprisesgreater than 60% of its local memory and the amount of local memory is bigenough. Otherwise, "zone" order will be selected. Default order isrecommended unless this is causing problems for your system/application.

panic_on_oom

This enables or disables panic onout-of-memory feature. If this is set to 1, the kernel panics whenout-of-memory happens. If this is set to 0, the kernel will kill some rogueprocess, by calling oom_kill(). Usually, oom_killer can kill rogue processesand

system will survive. If you want to panic the system rather than killingrogue processes, set this to 1. The default value is 0.

stat_interval

With this tunable you can configure VMstatistics update interval. The default value is 1. This tunable first appearedin 2.6.22 kernel.

vdso_enabled

Whenthis flag is set, the kernel maps a vDSO page into newly created processes andpasses its address down to glibc upon exec(). This feature is enabled bydefault. vDSO is a virtual DSO (dynamic shared object) exposed by the kernel atsome address in every

process' memory. It's purpose is to speed up systemcalls. The mapping address used to be fixed (0xffffe000), but starting with2.6.18 it's randomized (besides the security implications, this also helpsdebuggers