Linux | 内存 | 由内存页不足（page allocation failure）引起程序杀死（OOM Killer）

最新推荐文章于 2024-12-01 08:15:00 发布

原创最新推荐文章于 2024-12-01 08:15:00 发布 · 4.1k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#linux #运维 #服务器

本文深入探讨了Linux内核在处理`page allocation failure`问题时的机制，从`__alloc_pages_slowpath()`、`__vmalloc_area_node()`和`__vmalloc_node_range()`的调用来解析内存分配流程。当内存碎片导致高阶内存无法分配时，系统会尝试内存压缩、直接回收和启动OOM Killer。通过分析内存分配的gfp_mask和migrate类型，揭示了为什么CMA区域无法使用。最后，提出了释放内存、压缩内存和调整`vm.min_free_kbytes`配置等解决办法。

本文对由于 page allocation failure 而引起 Out of Memory Killer 的背景及工作原理进行不完全总结。

更新：2022 / 12 / 30

在内存申请的时候经常会遇到类似 xxx: page allocation failure: order:10... 类型的问题，这是warn_alloc() 的输出。

Page Allocation Failure 的概念可参考这里 ¹。

warn_alloc() 被如下函数调用 __alloc_pages_slowpath()、__vmalloc_area_node()、__vmalloc_node_range。

下面分三部分可以大致了解这种问题的来龙去脉 ²：

warn_alloc() 的触发条件？
warn_alloc() 都做了哪些事情，或者说，工作原理？
结合实际问题分析问题原因。

触发条件

要了解什么情况下会导致 warn_alloc()，就需要分析在何种情况下会被调用。

__alloc_pages_slowpath()

源码源自这里 ³

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
						struct alloc_context *ac)
{
	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
	struct page *page = NULL;
	unsigned int alloc_flags;
	unsigned long did_some_progress;
	enum compact_priority compact_priority;
	enum compact_result compact_result;
	int compaction_retries;
	int no_progress_loops;
	unsigned int cpuset_mems_cookie;
	unsigned int zonelist_iter_cookie;
	int reserve_flags;

	/*
	 * We also sanity check to catch abuse of atomic reserves being used by
	 * callers that are not in atomic context.
	 */
	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
		gfp_mask &= ~__GFP_ATOMIC;

restart:
	compaction_retries = 0;
	no_progress_loops = 0;
	compact_priority = DEF_COMPACT_PRIORITY;
	cpuset_mems_cookie = read_mems_allowed_begin();
	zonelist_iter_cookie = zonelist_iter_begin();

	/*
	 * The fast path uses conservative alloc_flags to succeed only until
	 * kswapd needs to be woken up, and to avoid the cost of setting up
	 * alloc_flags precisely. So we do that now.
	 */
	alloc_flags = gfp_to_alloc_flags(gfp_mask);

	/*
	 * We need to recalculate the starting point for the zonelist iterator
	 * because we might have used different nodemask in the fast path, or
	 * there was a cpuset modification and we are retrying - otherwise we
	 * could end up iterating over non-eligible zones endlessly.
	 */
	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
					ac->highest_zoneidx, ac->nodemask);
	if (!ac->preferred_zoneref->zone)
		goto nopage;-----------------------------找不到合适的zone，进入nopage处理。

	/*
	 * Check for insane configurations where the cpuset doesn't contain
	 * any suitable zone to satisfy the request - e.g. non-movable
	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
	 */
	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
		struct zoneref *z = first_zones_zonelist(ac->zonelist,
					ac->highest_zoneidx,
					&cpuset_current_mems_allowed);
		if (!z->zone)
			goto nopage;
	}

	if (alloc_flags & ALLOC_KSWAPD)
		wake_all_kswapds(order, gfp_mask, ac);

	/*
	 * The adjusted alloc_flags might result in immediate success, so try
	 * that first
	 */
	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
	if (page)
		goto got_pg;

	/*
	 * For costly allocations, try direct compaction first, as it's likely
	 * that we have enough base pages and don't need to reclaim. For non-
	 * movable high-order allocations, do that as well, as compaction will
	 * try prevent permanent fragmentation by migrating from blocks of the
	 * same migratetype.
	 * Don't try this for allocations that are allowed to ignore
	 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
	 */
	if (can_direct_reclaim &&
			(costly_order ||
			   (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
			&& !gfp_pfmemalloc_allowed(gfp_mask)) {
		page = __alloc_pages_direct_compact(gfp_mask, order,
						alloc_flags, ac,
						INIT_COMPACT_PRIORITY,
						&compact_result);----页面较大情况下，走直接页面回收来获取内存。
		if (page)
			goto got_pg;

		/*
		 * Checks for costly allocations with __GFP_NORETRY, which
		 * includes some THP page fault allocations
		 */
		if (costly_order && (gfp_mask & __GFP_NORETRY)) {---------不做重试的情况。
			/*
			 * If allocating entire pageblock(s) and compaction
			 * failed because all zones are below low watermarks
			 * or is prohibited because it recently failed at this
			 * order, fail immediately unless the allocator has
			 * requested compaction and reclaim retry.
			 *
			 * Reclaim is
			 *  - potentially very expensive because zones are far
			 *    below their low watermarks or this is part of very
			 *    bursty high order allocations,
			 *  - not guaranteed to help because isolate_freepages()
			 *    may not iterate over freed pages as part of its
			 *    linear scan, and
			 *  - unlikely to make entire pageblocks free on its
			 *    own.
			 */
			if (compact_result == COMPACT_SKIPPED ||
			    compact_result == COMPACT_DEFERRED)
				goto nopage;--------------------compaction不成功，进入nopage处理。

			/*
			 * Looks like reclaim/compaction is worth trying, but
			 * sync compaction could be very expensive, so keep
			 * using async compaction.
			 */
			compact_priority = INIT_COMPACT_PRIORITY;
		}
	}

retry:
	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
	if (alloc_flags & ALLOC_KSWAPD)
		wake_all_kswapds(order, gfp_mask, ac);-----唤醒kswapd内核线程，让其处于工作。

	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
	if (reserve_flags)
		alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
					  (alloc_flags & ALLOC_KSWAPD);

	/*
	 * Reset the nodemask and zonelist iterators if memory policies can be
	 * ignored. These allocations are high priority and system rather than
	 * user oriented.
	 */
	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
		ac->nodemask = NULL;
		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
					ac->highest_zoneidx, ac->nodemask);
	}

	/* Attempt with potentially adjusted zonelist and alloc_flags */
	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
	if (page)
		goto got_pg;-----------------------申请内存分配，成功则返回struct page地址。

	/* Caller is not willing to reclaim, we can't balance anything */
	if (!can_direct_reclaim)
		goto nopage;-----既不能内存规整direct compact，也无法从freelist获取内存的情况，进入nopage流程。

	/* Avoid recursion of direct reclaim */
	if (current->flags & PF_MEMALLOC)
		goto nopage;

	/* Try direct reclaim and then allocating */
	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
							&did_some_progress);
	if (page)
		goto got_pg;

	/* Try direct compaction and then allocating */
	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
					compact_priority, &compact_result);
	if (page)
		goto got_pg;

	/* Do not loop if specifically requested */
	if (gfp_mask & __GFP_NORETRY)
		goto nopage;---------------------------------------强调不允许循环重试情况。

	/*
	 * Do not retry costly high order allocations unless they are
	 * __GFP_RETRY_MAYFAIL
	 */
	if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
		goto nopage;---针对高order情况，并且不允许__GFP_REPEAT的情况，进入nopage流程。

	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
				 did_some_progress > 0, &no_progress_loops))
		goto retry;

	/*
	 * It doesn't make any sense to retry for the compaction if the order-0
	 * reclaim is not able to make any progress because the current
	 * implementation of the compaction depends on the sufficient amount
	 * of free memory (see __compaction_suitable)
	 */
	if (did_some_progress > 0 &&
			should_compact_retry(ac, order, alloc_flags,
				compact_result, &compact_priority,
				&compaction_retries))
		goto retry;


	/*
	 * Deal with possible cpuset update races or zonelist updates to avoid
	 * a unnecessary OOM kill.
	 */
	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
	    check_retry_zonelist(zonelist_iter_cookie))
		goto restart;

	/* Reclaim has failed us, start killing things */
	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
	if (page)
		goto got_pg;-----分配页面，并且判断是否需要启动OOM killer，did_some_progress会导致retry。如果order小于3则不会进入OOM。

	/* Avoid allocations with no watermarks from looping endlessly */
	if (tsk_is_oom_victim(current) &&
	    (alloc_flags & ALLOC_OOM ||
	     (gfp_mask & __GFP_NOMEMALLOC)))
		goto nopage;

	/* Retry as long as the OOM killer is making progress */
	if (did_some_progress) {
		no_progress_loops = 0;
		goto retry;
	}

nopage:
	/*
	 * Deal with possible cpuset update races or zonelist updates to avoid
	 * a unnecessary OOM kill.
	 */
	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
	    check_retry_zonelist(zonelist_iter_cookie))
		goto restart;------------------------------------------进入restart循环处理。

	/*
	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
	 * we always retry
	 */
	if (gfp_mask & __GFP_NOFAIL) {
		/*
		 * All existing users of the __GFP_NOFAIL are blockable, so warn
		 * of any new users that actually require GFP_NOWAIT
		 */
		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
			goto fail;

		/*
		 * PF_MEMALLOC request from this context is rather bizarre
		 * because we cannot reclaim anything and only can loop waiting
		 * for somebody to do a work for us
		 */
		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);

		/*
		 * non failing costly orders are a hard requirement which we
		 * are not prepared for much so let's warn about these users
		 * so that we can identify them and convert them to something
		 * else.
		 */
		WARN_ON_ONCE_GFP(costly_order, gfp_mask);

		/*
		 * Help non-failing allocations by giving them access to memory
		 * reserves but do not use ALLOC_NO_WATERMARKS because this
		 * could deplete whole memory reserves which would just make
		 * the situation worse
		 */
		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
		if (page)
			goto got_pg;

		cond_resched();
		goto retry;
	}
fail:
	warn_alloc(gfp_mask, ac->nodemask,------------------无法满足分配order大小页面。
			"page allocation failure: order:%u", order);
got_pg:
	return page;
}

__alloc_pages_slowpath() 表示页面申请进入了 slowpath，那相对就有 fastpath。

从 __alloc_pages_nodemask() 中可知，这个 fastpath 就是 get_page_from_freelist()。
__alloc_pages_nodemask() 是分配页面的后备选择。

__vmalloc_area_node()

源码源自这里 ⁴

该函数同 vmalloc 相关，__vmalloc_area_node() 在分配失败之后进入 fail ，调用 warn_alloc() 输出 log。

static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
				 pgprot_t prot, unsigned int page_shift,
				 int node)
{
	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
	bool nofail = gfp_mask & __GFP_NOFAIL;
	unsigned long addr = (unsigned long)area->addr;
	unsigned long size = get_vm_area_size(area);
	unsigned long array_size;
	unsigned int nr_small_pages = size >> PAGE_SHIFT;
	unsigned int page_order;
	unsigned int flags;
	int ret;

	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
	gfp_mask |= __GFP_NOWARN;
	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
		gfp_mask |= __GFP_HIGHMEM;

	/* Please note that the recursion is strictly bounded. */
	if (array_size > PAGE_SIZE) {
		area->pages = __vmalloc_node(array_size, 1, nested_gfp, node,
					area->caller);
	} else {
		area->pages = kmalloc_node(array_size, nested_gfp, node);
	}

	if (!area->pages) {
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, failed to allocated page array size %lu",
			nr_small_pages * PAGE_SIZE, array_size);
		free_vm_area(area);
		return NULL;
	}

	set_vm_area_page_order(area, page_shift - PAGE_SHIFT);
	page_order = vm_area_page_order(area);

	area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN,
		node, page_order, nr_small_pages, area->pages);

	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
	if (gfp_mask & __GFP_ACCOUNT) {
		int i;

		for (i = 0; i < area->nr_pages; i++)
			mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
	}

	/*
	 * If not enough pages were obtained to accomplish an
	 * allocation request, free them via __vfree() if any.
	 */
	if (area->nr_pages != nr_small_pages) {
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, page order %u, failed to allocate pages",
			area->nr_pages * PAGE_SIZE, page_order);
		goto fail;
	}

	/*
	 * page tables allocations ignore external gfp mask, enforce it
	 * by the scope API
	 */
	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
		flags = memalloc_nofs_save();
	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
		flags = memalloc_noio_save();

	do {
		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
			page_shift);
		if (nofail && (ret < 0))
			schedule_timeout_uninterruptible(1);
	} while (nofail && (ret < 0));

	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
		memalloc_nofs_restore(flags);
	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
		memalloc_noio_restore(flags);

	if (ret < 0) {
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, failed to map pages",
			area->nr_pages * PAGE_SIZE);
		goto fail;
	}

	return area->addr;

fail:
	__vfree(area->addr);
	return NULL;
}

__vmalloc_node_range

源码源自这里 ⁴

/**
 * __vmalloc_node_range - allocate virtually contiguous memory
 * @size:		  allocation size
 * @align:		  desired alignment
 * @start:		  vm area range start
 * @end:		  vm area range end
 * @gfp_mask:		  flags for the page level allocator
 * @prot:		  protection mask for the allocated pages
 * @vm_flags:		  additional vm area flags (e.g. %VM_NO_GUARD)
 * @node:		  node to use for allocation or NUMA_NO_NODE
 * @caller:		  caller's return address
 *
 * Allocate enough pages to cover @size from the page level
 * allocator with @gfp_mask flags. Please note that the full set of gfp
 * flags are not supported. GFP_KERNEL, GFP_NOFS and GFP_NOIO are all
 * supported.
 * Zone modifiers are not supported. From the reclaim modifiers
 * __GFP_DIRECT_RECLAIM is required (aka GFP_NOWAIT is not supported)
 * and only __GFP_NOFAIL is supported (i.e. __GFP_NORETRY and
 * __GFP_RETRY_MAYFAIL are not supported).
 *
 * __GFP_NOWARN can be used to suppress failures messages.
 *
 * Map them into contiguous kernel virtual space, using a pagetable
 * protection of @prot.
 *
 * Return: the address of the area or %NULL on failure
 */
void *__vmalloc_node_range(unsigned long size, unsigned long align,
			unsigned long start, unsigned long end, gfp_t gfp_mask,
			pgprot_t prot, unsigned long vm_flags, int node,
			const void *caller)
{
	struct vm_struct *area;
	void *ret;
	kasan_vmalloc_flags_t kasan_flags = KASAN_VMALLOC_NONE;
	unsigned long real_size = size;
	unsigned long real_align = align;
	unsigned int shift = PAGE_SHIFT;

	if (WARN_ON_ONCE(!size))
		return NULL;

	if ((size >> PAGE_SHIFT) > totalram_pages()) {
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, exceeds total pages",
			real_size);
		return NULL;
	}

	if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
		unsigned long size_per_node;

		/*
		 * Try huge pages. Only try for PAGE_KERNEL allocations,
		 * others like modules don't yet expect huge pages in
		 * their allocations due to apply_to_page_range not
		 * supporting them.
		 */

		size_per_node = size;
		if (node == NUMA_NO_NODE)
			size_per_node /= num_online_nodes();
		if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
			shift = PMD_SHIFT;
		else
			shift = arch_vmap_pte_supported_shift(size_per_node);

		align = max(real_align, 1UL << shift);
		size = ALIGN(real_size, 1UL << shift);
	}

again:
	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
				  VM_UNINITIALIZED | vm_flags, start, end, node,
				  gfp_mask, caller);
	if (!area) {
		bool nofail = gfp_mask & __GFP_NOFAIL;
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, vm_struct allocation failed%s",
			real_size, (nofail) ? ". Retrying." : "");
		if (nofail) {
			schedule_timeout_uninterruptible(1);
			goto again;
		}
		goto fail;
	}

	/*
	 * Prepare arguments for __vmalloc_area_node() and
	 * kasan_unpoison_vmalloc().
	 */
	if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL)) {
		if (kasan_hw_tags_enabled()) {
			/*
			 * Modify protection bits to allow tagging.
			 * This must be done before mapping.
			 */
			prot = arch_vmap_pgprot_tagged(prot);

			/*
			 * Skip page_alloc poisoning and zeroing for physical
			 * pages backing VM_ALLOC mapping. Memory is instead
			 * poisoned and zeroed by kasan_unpoison_vmalloc().
			 */
			gfp_mask |= __GFP_SKIP_KASAN_UNPOISON | __GFP_SKIP_ZERO;
		}

		/* Take note that the mapping is PAGE_KERNEL. */
		kasan_flags |= KASAN_VMALLOC_PROT_NORMAL;
	}

	/* Allocate physical pages and map them into vmalloc space. */
	ret = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
	if (!ret)
		goto fail;

	/*
	 * Mark the pages as accessible, now that they are mapped.
	 * The condition for setting KASAN_VMALLOC_INIT should complement the
	 * one in post_alloc_hook() with regards to the __GFP_SKIP_ZERO check
	 * to make sure that memory is initialized under the same conditions.
	 * Tag-based KASAN modes only assign tags to normal non-executable
	 * allocations, see __kasan_unpoison_vmalloc().
	 */
	kasan_flags |= KASAN_VMALLOC_VM_ALLOC;
	if (!want_init_on_free() && want_init_on_alloc(gfp_mask) &&
	    (gfp_mask & __GFP_SKIP_ZERO))
		kasan_flags |= KASAN_VMALLOC_INIT;
	/* KASAN_VMALLOC_PROT_NORMAL already set if required. */
	area->addr = kasan_unpoison_vmalloc(area->addr, real_size, kasan_flags);

	/*
	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
	 * flag. It means that vm_struct is not fully initialized.
	 * Now, it is fully initialized, so remove this flag here.
	 */
	clear_vm_uninitialized_flag(area);

	size = PAGE_ALIGN(size);
	if (!(vm_flags & VM_DEFER_KMEMLEAK))
		kmemleak_vmalloc(area, size, gfp_mask);

	return area->addr;

fail:
	if (shift > PAGE_SHIFT) {
		shift = PAGE_SHIFT;
		align = real_align;
		size = real_size;
		goto again;
	}

	return NULL;
}

工作原理

warn_alloc() 首先显示相关进程和内存分配 gfp_mask 信息，然后打印栈信息 ³，

void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
{
	struct va_format vaf;
	va_list args;
	static DEFINE_RATELIMIT_STATE(nopage_rs, 10*HZ, 1);

	if ((gfp_mask & __GFP_NOWARN) ||
	     !__ratelimit(&nopage_rs) ||
	     ((gfp_mask & __GFP_DMA) && !has_managed_dma()))
		return;

	va_start(args, fmt);
	vaf.fmt = fmt;
	vaf.va = &args;
	pr_warn("%s: %pV, mode:%#x(%pGg), nodemask=%*pbl",
			current->comm, &vaf, gfp_mask, &gfp_mask,
			nodemask_pr_args(nodemask));-------------------------显示对应进程名称。
	va_end(args);-------------------------------------显示warn_alloc()传入的参数。

	cpuset_print_current_mems_allowed();
	pr_cont("\n");
	dump_stack();----------------------------------------------------显示栈信息。
	warn_alloc_show_mem(gfp_mask, nodemask);--------------显示内存信息，这里是重点。
}

show_mem() 显示详细的内存信息，

void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
{
	pg_data_t *pgdat;
	unsigned long total = 0, reserved = 0, highmem = 0;

	printk("Mem-Info:\n");
	__show_free_areas(filter, nodemask, max_zone_idx);

	for_each_online_pgdat(pgdat) {
		int zoneid;

		for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
			struct zone *zone = &pgdat->node_zones[zoneid];
			if (!populated_zone(zone))
				continue;

			total += zone->present_pages;
			reserved += zone->present_pages - zone_managed_pages(zone);

			if (is_highmem_idx(zoneid))
				highmem += zone->present_pages;
		}
	}

	printk("%lu pages RAM\n", total);-----整个平台的页面统计信息：所有页面数、reserved、cma等等。
	printk("%lu pages HighMem/MovableOnly\n", highmem);
	printk("%lu pages reserved\n", reserved);
#ifdef CONFIG_CMA
	printk("%lu pages cma reserved\n", totalcma_pages);
#endif
#ifdef CONFIG_MEMORY_FAILURE
	printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
#endif
}

show_free_areas() 从所有 node、不同 node、不同 zone、同一 zone 下不同 order 分别显示空闲页面信息。

/*
 * Show free area list (used inside shift_scroll-lock stuff)
 * We also calculate the percentage fragmentation. We do this by counting the
 * memory on each free list with the exception of the first item on the list.
 *
 * Bits in @filter:
 * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's
 *   cpuset.
 */
void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
{
	unsigned long free_pcp = 0;
	int cpu, nid;
	struct zone *zone;
	pg_data_t *pgdat;

	for_each_populated_zone(zone) {
		if (zone_idx(zone) > max_zone_idx)
			continue;
		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
			continue;

		for_each_online_cpu(cpu)
			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
	}

	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" ------显示所有node的统计信息。
		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
		" unevictable:%lu dirty:%lu writeback:%lu\n"
		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
		" mapped:%lu shmem:%lu pagetables:%lu\n"
		" sec_pagetables:%lu bounce:%lu\n"
		" kernel_misc_reclaimable:%lu\n"
		" free:%lu free_pcp:%lu free_cma:%lu\n",
		global_node_page_state(NR_ACTIVE_ANON),
		global_node_page_state(NR_INACTIVE_ANON),
		global_node_page_state(NR_ISOLATED_ANON),
		global_node_page_state(NR_ACTIVE_FILE),
		global_node_page_state(NR_INACTIVE_FILE),
		global_node_page_state(NR_ISOLATED_FILE),
		global_node_page_state(NR_UNEVICTABLE),
		global_node_page_state(NR_FILE_DIRTY),
		global_node_page_state(NR_WRITEBACK),
		global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
		global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
		global_node_page_state(NR_FILE_MAPPED),
		global_node_page_state(NR_SHMEM),
		global_node_page_state(NR_PAGETABLE),
		global_node_page_state(NR_SECONDARY_PAGETABLE),
		global_zone_page_state(NR_BOUNCE),
		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE),
		global_zone_page_state(NR_FREE_PAGES),
		free_pcp,
		global_zone_page_state(NR_FREE_CMA_PAGES));

	for_each_online_pgdat(pgdat) {---------------------分别显示不同node的统计信息。

		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
			continue;
		if (!node_has_managed_zones(pgdat, max_zone_idx))
			continue;

		printk("Node %d"
			" active_anon:%lukB"
			" inactive_anon:%lukB"
			" active_file:%lukB"
			" inactive_file:%lukB"
			" unevictable:%lukB"
			" isolated(anon):%lukB"
			" isolated(file):%lukB"
			" mapped:%lukB"
			" dirty:%lukB"
			" writeback:%lukB"
			" shmem:%lukB"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
			" shmem_thp: %lukB"
			" shmem_pmdmapped: %lukB"
			" anon_thp: %lukB"
#endif
			" writeback_tmp:%lukB"
			" kernel_stack:%lukB"
#ifdef CONFIG_SHADOW_CALL_STACK
			" shadow_call_stack:%lukB"
#endif
			" pagetables:%lukB"
			" sec_pagetables:%lukB"
			" all_unreclaimable? %s"
			"\n",
			pgdat->node_id,
			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
			K(node_page_state(pgdat, NR_INACTIVE_ANON)),
			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
			K(node_page_state(pgdat, NR_UNEVICTABLE)),
			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
			K(node_page_state(pgdat, NR_FILE_MAPPED)),
			K(node_page_state(pgdat, NR_FILE_DIRTY)),
			K(node_page_state(pgdat, NR_WRITEBACK)),
			K(node_page_state(pgdat, NR_SHMEM)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
			K(node_page_state(pgdat, NR_SHMEM_THPS)),
			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
			K(node_page_state(pgdat, NR_ANON_THPS)),
#endif
			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
			node_page_state(pgdat, NR_KERNEL_STACK_KB),
#ifdef CONFIG_SHADOW_CALL_STACK
			node_page_state(pgdat, NR_KERNEL_SCS_KB),
#endif
			K(node_page_state(pgdat, NR_PAGETABLE)),
			K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
				"yes" : "no");
	}

	for_each_populated_zone(zone) {--------------------分别显示所有zone的统计信息。
		int i;

		if (zone_idx(zone) > max_zone_idx)
			continue;
		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
			continue;

		free_pcp = 0;
		for_each_online_cpu(cpu)
			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;

		show_node(zone);
		printk(KERN_CONT
			"%s"
			" free:%lukB"
			" boost:%lukB"
			" min:%lukB"
			" low:%lukB"
			" high:%lukB"
			" reserved_highatomic:%luKB"
			" active_anon:%lukB"
			" inactive_anon:%lukB"
			" active_file:%lukB"
			" inactive_file:%lukB"
			" unevictable:%lukB"
			" writepending:%lukB"
			" present:%lukB"
			" managed:%lukB"
			" mlocked:%lukB"
			" bounce:%lukB"
			" free_pcp:%lukB"
			" local_pcp:%ukB"
			" free_cma:%lukB"
			"\n",
			zone->name,
			K(zone_page_state(zone, NR_FREE_PAGES)),
			K(zone->watermark_boost),
			K(min_wmark_pages(zone)),
			K(low_wmark_pages(zone)),
			K(high_wmark_pages(zone)),
			K(zone->nr_reserved_highatomic),
			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
			K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
			K(zone->present_pages),
			K(zone_managed_pages(zone)),
			K(zone_page_state(zone, NR_MLOCK)),
			K(zone_page_state(zone, NR_BOUNCE)),
			K(free_pcp),
			K(this_cpu_read(zone->per_cpu_pageset->count)),
			K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
		printk("lowmem_reserve[]:");
		for (i = 0; i < MAX_NR_ZONES; i++)
			printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
		printk(KERN_CONT "\n");
	}

	for_each_populated_zone(zone) {---------显示所有zone下不同order空闲数目统计信息。
		unsigned int order;
		unsigned long nr[MAX_ORDER], flags, total = 0;
		unsigned char types[MAX_ORDER];

		if (zone_idx(zone) > max_zone_idx)
			continue;
		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
			continue;
		show_node(zone);
		printk(KERN_CONT "%s: ", zone->name);

		spin_lock_irqsave(&zone->lock, flags);
		for (order = 0; order < MAX_ORDER; order++) {------------遍历当前zone的不同order，不同order区域数目存在nr[]中，total是总的页面数目。
			struct free_area *area = &zone->free_area[order];
			int type;

			nr[order] = area->nr_free;
			total += nr[order] << order;

			types[order] = 0;
			for (type = 0; type < MIGRATE_TYPES; type++) {
				if (!free_area_empty(area, type))
					types[order] |= 1 << type;------------记录order区域中页面类型。
			}
		}
		spin_unlock_irqrestore(&zone->lock, flags);
		for (order = 0; order < MAX_ORDER; order++) {
			printk(KERN_CONT "%lu*%lukB ",
			       nr[order], K(1UL) << order);----输出不同order区域数量和区域大小。
			if (nr[order])
				show_migration_types(types[order]);-----------------输出页面类型。
		}
		printk(KERN_CONT "= %lukB\n", K(total));---------------------显示总大小。
	}

	for_each_online_node(nid) {
		if (show_mem_node_skip(filter, nid, nodemask))
			continue;
		hugetlb_show_meminfo_node(nid);--------------------显示huge page统计信息。
	}

	printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));--------总的文件缓存页面数量。

	show_swap_cache_info();-------------------------------显示swap cache统计信息。
}

不同的页面有不同的属性，在 warn_alloc() 输出的字母对应了页面的属性。主要有 M、U、E、C。

static void show_migration_types(unsigned char type)
{
	static const char types[MIGRATE_TYPES] = {
		[MIGRATE_UNMOVABLE]	= 'U',-------------------------------------不可移动。
		[MIGRATE_MOVABLE]	= 'M',---------------------------------------可移动。
		[MIGRATE_RECLAIMABLE]	= 'E',-----------------------------------可回收。
		[MIGRATE_HIGHATOMIC]	= 'H',-------------------等同于MIGRATE_PCPTYPES。
#ifdef CONFIG_CMA
		[MIGRATE_CMA]		= 'C',----------------------------------CMA区域页面。
#endif
#ifdef CONFIG_MEMORY_ISOLATION
		[MIGRATE_ISOLATE]	= 'I',
#endif
	};
	char tmp[MIGRATE_TYPES + 1];
	char *p = tmp;
	int i;

	for (i = 0; i < MIGRATE_TYPES; i++) {
		if (type & (1 << i))
			*p++ = types[i];
	}

	*p = '\0';
	printk(KERN_CONT "(%s) ", tmp);
}

结合实例

1.

通过 dmesg 命令查看内核信息，发现提示内存页分配失败 ⁵，如下图所示：

在这里插入图片描述
可以看到 Page Allocation Failure 字样，这个信息表示系统无法分配高阶内存（所谓的高阶内存，指的是大块的连续物理内存，内存分配原理可查看本文下面的 内存分配算法 ），使用命令查看内存页的分配情况 cat /proc/buddyinfo，如下图所示：

在这里插入图片描述

可以看到内存的碎片化情况很严重，存在大量的低阶内存页，但缺少 64KB 以上的高阶内存页（红框表示 64KB 以上的内存页数量都为 0 ）。

使用命令 free -h 命令查看系统内存情况：

在这里插入图片描述

从图中看到空闲的内存有 890M，应该是对应低阶内存页。

2.

[ 2161.623563] xxxx: page allocation failure: order:10, mode:0x2084020(GFP_ATOMIC|__GFP_COMP)-----------------warn_alloc(),从这里可以知道是哪个进程页面分配失败，并且有对应的gfp_mask。
[ 2161.632085] CPU: 0 PID: 179 Comm: AiApp Not tainted 4.9.56 #53---------------------------------------------dump_stack()，栈信息指出了更详细的调用路径。
[ 2161.637947] 
Call Trace:
[<802f63f2>] dump_stack+0x1e/0x3c
[<800f6cf4>] warn_alloc+0x100/0x148
[<800f709c>] __alloc_pages_nodemask+0x2bc/0xb5c
[<801120fe>] kmalloc_order+0x26/0x48
[<80112158>] kmalloc_order_trace+0x38/0x98
[<8012c5d8>] __kmalloc+0xf4/0x12c
[<8048ac78>] alloc_ep_req+0x5c/0x98
[<8048f232>] source_sink_recv+0x2a/0xe0
[<8048f35e>] usb_sourcesink_bulk_read+0x76/0x1c8
[<8048f770>] usb_sourcesink_read+0xfc/0x2c8
[<80134d58>] __vfs_read+0x30/0x108
[<80135c14>] vfs_read+0x94/0x128
[<80136d12>] SyS_read+0x52/0xd4
[<8004a246>] csky_systemcall+0x96/0xe0
[ 2161.689204] Mem-Info:--------------------------------------------------------------show_mem()
[ 2161.691518] active_anon:3268 inactive_anon:2 isolated_anon:0-----------------------所有node统计信息。
[ 2161.691518]  active_file:1271 inactive_file:89286 isolated_file:0
[ 2161.691518]  unevictable:0 dirty:343 writeback:0 unstable:0
[ 2161.691518]  slab_reclaimable:2019 slab_unreclaimable:644
[ 2161.691518]  mapped:4282 shmem:4 pagetables:59 bounce:0
[ 2161.691518]  free:62086 free_pcp:199 free_cma:60234
--------------------------------------------------------------------------------------只有一个node，输出node 0统计信息。
[ 2161.724334] Node 0 active_anon:13072kB inactive_anon:8kB active_file:5084kB inactive_file:357144kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:17128kB dirty:1372kB writeback:0kB shmem:16kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
--------------------------------------------------------------------------------------输出Normal zone统计信息。
[ 2161.748626] Normal free:248344kB min:2444kB low:3052kB high:3660kB active_anon:13072kB inactive_anon:8kB active_file:5084kB inactive_file:357144kB unevictable:0kB writepending:1372kB present:1048572kB managed:734568kB mlocked:0kB slab_reclaimable:8076kB slab_unreclaimable:2576kB kernel_stack:608kB pagetables:236kB bounce:0kB free_pcp:796kB local_pcp:796kB free_cma:240936kB
[ 2161.781670] lowmem_reserve[]: 0 0 0
---------------------------------------------------------------------------------------输出Normal zone下不同order的空闲情况，包括其中页面属性。
[ 2161.785225] Normal: 4*4kB (UEC) 3*8kB (EC) 3*16kB (UEC) 2*32kB (UE) 2*64kB (UE) 2*128kB (UE) 2*256kB (EC) 1*512kB (E) 3*1024kB (UEC) 3*2048kB (UEC) 58*4096kB (C) = 248344kB
90573 total pagecache pages
---------------------------------------------------------------------------------------整个平台页面统计信息。
[ 2161.803526] 262143 pages RAM
[ 2161.806410] 0 pages HighMem/MovableOnly
[ 2161.810264] 78501 pages reserved
[ 2161.813509] 90112 pages cma reserved

表示进程 xxxx 在分配 order 为 10 个连续物理页面时失败。

order 告诉您请求了多少页，在这里 order: 10 被认为是高阶的，因为它实际上请求2¹⁰，即 1024 页或 4096KiB 的连续内存。
mode 表示内存分配的页模式，是传给 kernel memory allocator 的 flag，具体在 include/linux/gfp.h ⁶ 中定义。

内存碎片会导致 page 分配失败，即使还有很多空闲 page 。当 order=0 时，表示系统当前已经完全 OOM。

从 stack 信息可以得知，alloc_ep_req() 是分配内存的起点。

struct usb_request *alloc_ep_req(struct usb_ep *ep, size_t len)
{
    struct usb_request      *req;

    req = usb_ep_alloc_request(ep, GFP_ATOMIC);
    if (req) {
        req->length = usb_endpoint_dir_out(ep->desc) ?
            usb_ep_align(ep, len) : len;
        req->buf = kmalloc(req->length, GFP_ATOMIC);
        if (!req->buf) {
            usb_ep_free_request(ep, req);
            req = NULL;
        }
    }
    return req;
}

GFP_ATOMIC 和 __GFP_COMP：页面分配标志

从代码可知此时 gfp_mask 为 GFP_ATOMIC，这种情况是不允许 __GFP_DIRECT_RECLAIM 页面直接回收的。

#define GFP_ATOMIC    (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define __GFP_HIGH    ((__force gfp_t)___GFP_HIGH)----------------------------------------------表示更高优先级。
#define __GFP_ATOMIC    ((__force gfp_t)___GFP_ATOMIC)------------------------------------------表示调用者不可以回收页面或者睡眠，并且是高优先级。典型的应用是中断处理中。
#define __GFP_KSWAPD_RECLAIM    ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */----在内存分配的时候，主动唤醒kswapd线程。
#define __GFP_COMP    ((__force gfp_t)___GFP_COMP)----------------------------------------------复合页标志位，表示将两个或多个也看成一个页面。

GFP位掩码定义如下：

#define ___GFP_DMA        0x01u
#define ___GFP_HIGHMEM        0x02u
#define ___GFP_DMA32        0x04u
#define ___GFP_MOVABLE        0x08u
#define ___GFP_RECLAIMABLE    0x10u
#define ___GFP_HIGH        0x20u
#define ___GFP_IO        0x40u
#define ___GFP_FS        0x80u
#define ___GFP_COLD        0x100u
#define ___GFP_NOWARN        0x200u
#define ___GFP_REPEAT        0x400u
#define ___GFP_NOFAIL        0x800u
#define ___GFP_NORETRY        0x1000u
#define ___GFP_MEMALLOC        0x2000u
#define ___GFP_COMP        0x4000u
#define ___GFP_ZERO        0x8000u
#define ___GFP_NOMEMALLOC    0x10000u
#define ___GFP_HARDWALL        0x20000u
#define ___GFP_THISNODE        0x40000u
#define ___GFP_ATOMIC        0x80000u
#define ___GFP_ACCOUNT        0x100000u
#define ___GFP_NOTRACK        0x200000u
#define ___GFP_DIRECT_RECLAIM    0x400000u
#define ___GFP_OTHER_NODE    0x800000u
#define ___GFP_WRITE        0x1000000u
#define ___GFP_KSWAPD_RECLAIM    0x2000000u

gfp 和 migrate 转换，进而 alloc_flags：为什么不能使用 CMA 区域？

gfp_mask 决定了申请页面的 migratetype，然后在 CMA 存在的情况下根据 migratetype 决定是否可用 CMA 区域。

static inline unsigned int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;

	/*
	 * __GFP_HIGH is assumed to be the same as ALLOC_HIGH
	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
	 * to save two branches.
	 */
	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);

	/*
	 * The caller may dip into page reserves a bit more if the caller
	 * cannot run direct reclaim, or if the caller has realtime scheduling
	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
	 */
	alloc_flags |= (__force int)
		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));-------------__GFP_HIGH到ALLOC_HIGH转换。

	if (gfp_mask & __GFP_ATOMIC) {
		/*
		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
		 * if it can't schedule.
		 */
		if (!(gfp_mask & __GFP_NOMEMALLOC))
			alloc_flags |= ALLOC_HARDER;
		/*
		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
		 * comment for __cpuset_node_allowed().
		 */
		alloc_flags &= ~ALLOC_CPUSET;
	} else if (unlikely(rt_task(current)) && in_task())
		alloc_flags |= ALLOC_HARDER;

	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);

	return alloc_flags;
}

结合warn_alloc()

虽然存在很多空闲内存，但是 alloc_ep_req() 无法使用
由于 alloc_ep_req() 申请内存的 gfp_mask 为 GFP_ATOMIC | __GFP_COMP。
由于不具备 __GFP_MOVABLE，所以即使存在很多空闲 4MB 连续页面，也无法使用，因为这些4MB 页面都是 CMA 的。

[ 2161.785225] Normal: 4*4kB (UEC) 3*8kB (EC) 3*16kB (UEC) 2*32kB (UE) 2*64kB (UE) 2*128kB (UE) 2*256kB (EC) 1*512kB (E) 3*1024kB (UEC) 3*2048kB (UEC) 58*4096kB (C) = 248344kB
----- 光 4MB CMA 就达到了 232M，其他只有 16MB。

为什么剩下的内存绝大部分是 CMA ？
从 Normal 区域空闲页面情况看，绝大部分都是 CMA 的。但是初始化的时候存在很多其他类型的页面。
通过 cat /proc/pagetypeinfo 查看前后对比，可以发现 Movable 类型的页面基本被申请完。

所以这里怀疑是内存泄漏，通过下面脚本跟踪 MemFree。

while true; do cat /proc/meminfo | grep MemFree; sleep 10; done

发现内存在不停的下降，达到 260M 左右的时候出现 warn_alloc()。
所以问题的根源在内存泄漏。

解决方案

释放内存

在释放内存之前先手动执行 sync 命令，将所有未写的系统缓冲区写到磁盘中，包含已修改的 i-node、已延迟的块 I/O 和读写映射文件。

释放页缓存
echo 1 > /proc/sys/vm/drop_caches
释放目录和索引节点缓存
echo 2 > /proc/sys/vm/drop_caches
同时释放页、目录、索引节点缓存
echo 3 > /proc/sys/vm/drop_caches

上述的操作是无害的，因为只会释放完全没有使用的内存对象，脏对象将继续被使用直到他们被写入磁盘中，所以内存中的脏对象并不会被释放。如果如果重复 echo 3 > /proc/sys/vm/drop_caches 不能再次释放缓存，可以先尝试 echo 0 > /proc/sys/vm/drop_caches 然后再执行 echo 3 > /proc/sys/vm/drop_caches

压缩内存

当释放内存后，也没有足够的高阶内存时，可以通过命令 echo 1 > /proc/sys/vm/compact_memory 进行内存压缩，但这个步骤比较消耗 CPU，

在这里插入图片描述
可以看到经过内存压缩后，释放了大量的高阶内存。

修改配置

增加 vm.min_free_kbytes（/proc/sys/vm/min_free_kbytes）

概念

min_free_kbytes 的概念如下 ⁷：

This is used to force the Linux VM to keep a minimum number of kilobytes* free.  The VM uses this number to compute a watermark [WMARK_MIN] value for each lowmem zone in the system.  Each lowmem zone gets a number of reserved free pages based proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024 KB**, your system will become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

代表系统所保留空闲内存的最低限。
在系统初始化时会根据内存大小计算一个默认值，计算规则是：

min_free_kbytes = sqrt(lowmem_kbytes * 16) = 4 * sqrt(lowmem_kbytes)

注：lowmem_kbytes 即可认为是系统内存大小

另外，计算出来的值有最小、最大限制，最小为 128K，最大为 64M。
可以看出，min_free_kbytes 随着内存的增大不是线性增长，comments 里提到了原因 because network bandwidth does not increase linearly with machine size。随着内存的增大，没有必要也线性的预留出过多的内存，能保证紧急时刻的使用量便足矣。

min_free_kbytes 的主要用途是计算影响内存回收的三个参数 watermark [min / low / high]

(1) watermark [high] > watermark [low] > watermark [min]，各个 zone 各一套
(2) 在系统空闲内存低于 watermark [low] 时，开始启动内核线程 kswapd 进行内存回收（每个zone 一个），直到该 zone 的空闲内存数量达到 watermark [ high ] 后停止回收。如果上层申请内存的速度太快，导致空闲内存降至 watermark [ min ] 后，内核就会进行 direct reclaim（直接回收），即直接在应用程序的进程上下文中进行回收，再用回收上来的空闲页满足内存申请，因此实际会阻塞应用程序，带来一定的响应延迟，而且可能会触发系统 OOM。这是因为watermark [ min ] 以下的内存属于系统的自留内存，用以满足特殊使用，所以不会给用户态的普通申请来用。
(3) 三个 watermark 的计算方法：

watermark[min] = min_free_kbytes
watermark[low] = watermark[min] * 5 / 4
watermark[high] = watermark[min] * 3 / 2

所以中间的 buffer 量为 high - low = low - min = per_zone_min_free_pages * 1/4。因为min_free_kbytes = 4* sqrt(lowmem_kbytes），也可以看出中间的 buffer 量也是跟内存的增长速度成开方关系。
(4) 可以通过 /proc/zoneinfo 查看每个 zone 的 watermark

Node 0, zone      DMA
pages  free     3960
       min      65
       low      81
       high     97

min_free_kbytes 大小的影响

(1) min_free_kbytes 设的越大，watermark 的线越高，同时三个线之间的 buffer 量也相应会增加。这意味着会较早的启动 kswapd 进行回收，且会回收上来较多的内存（直至 watermark [high] 才会停止），这会使得系统预留过多的空闲内存，从而在一定程度上降低了应用程序可使用的内存量。极端情况下设置 min_free_kbytes 接近内存大小时，留给应用程序的内存就会太少而可能会频繁地导致 OOM 的发生。
(2) min_free_kbytes 设的过小，则会导致系统预留内存过小。kswapd 回收的过程中也会有少量的内存分配行为（会设上 PF_MEMALLOC ）标志，这个标志会允许 kswapd 使用预留内存；另外一种情况是被 OOM 选中杀死的进程在退出过程中，如果需要申请内存也可以使用预留部分。这两种情况下让他们使用预留内存可以避免系统进入 deadlock 状态。

操作

修改此项的值（降低到 16MiB）可以通过 ⁸：

echo "vm.min_free_kbytes=16384" >> /etc/sysctl.conf

修改后，可以通过：

sysctl -w vm.min_free_kbytes=16384

来确认是否已修改生效。

当系统可用内存（不包含 buffer 和 cache ）小于这个值的时候，系统会启动内核线程 kswapd 来对内存进行回收。而如果最终还是触发了 oom killer，则表明内存真的不够用了或者在内存回收前或者回收中直接触发了 oom killer ⁹。

实际上只能缓解 kernel 报提示消息的频率, 加大 vm.min_free_kbytes 的值意味着加大了水位值( low, high ), 内核线程 kswapd 进程可以提前做 reclaim 和释放内存相关的操作, 但是在突然需要大内存操作的时候还是会出现这个错误 ¹⁰。

Linux内存分配

内存区域

linux 中常见的内存区域 ¹⁰:

内存区域	说明
`ZONE_DMA`	此区域包含的页用来执行 `DMA` (直接内存访问) 操作
`ZONE_DMA32`	和 `ZONE_DMA` 类似, 不过其中的页只能被 `32` 位设备访问
`ZONE_NORMAL`	可以正常映射的页, 用户空间程序使用此区域的页
`ZONE_HIGHMEM`	高端内存, 其中的页不能永久映射到内核地址空间, `64` 位体系结构中所有内存都可以被映射, 所以 `x86-64` 的机器不存在高端内存

内核在分配物理内存时, 从高端 ( HIGHMEM ) 到低端 ( DMA ) 依次查找是否有足够的内存可以分配, 找到可用的内存后编映射到虚拟地址上供程序使用, 不过低端的内存较少, 如果低端的内存区域被占满, 就算剩余的物理内存很大，可能还会会出现 oom 或 page allocation failure 的情况.

DMA 内存区域很小, 所以我们主要关注 DMA32 和 NORMAL 的内存区域, 如下所示为 NODE0, NODE1 两颗 cpu 对应的 NORMAL 内存区域的详细信息:

Node 0 DMA free:15920kB min:4kB low:4kB high:4kB dirty:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
Node 0 DMA32 free:120772kB min:1612kB low:2012kB high:2416kB dirty:0kB shmem:0kB slab_reclaimable:137848kB slab_unreclaimable:5708kB
Node 0 Normal free:36276kB min:14608kB low:18260kB high:21912kB dirty:0kB shmem:4kB slab_reclaimable:222264kB slab_unreclaimable:27504kB
Node 1 Normal free:19976kB min:16260kB low:20324kB high:24388kB dirty:0kB shmem:0kB slab_reclaimable:371916kB slab_unreclaimable:22880kB

这台主机的 vm.min_free_kbytes 为 32496, 对于 vm.min_free_kbytes 参数而言, linux 会根据此参数的值计算每颗 CPU 对应的每个内存区域的 ( low, high ) 水位值. 低于 low 的值时, kswapd 进程开始执行 reclaim 操作, 低于 min 的值时, 内核直接执行 reclaim 操作。

kswapd 和内核执行 reclaim 操作的区别在于前者是在后台执行, 后者直接在前台执行, 所以在可用内存低于 min 值的时候, 系统可能出现卡顿的现象。

HighMem / LowMem

参考这里 ⁹

当系统的物理内存 > 内核的地址空间范围时，才需要引入 highmem 概念。
x86 架构下，linux 默认会把进程的虚拟地址空间（ 4G ）按 3:1 拆分，0~3G user space 通过页表映射，3G-4G kernel space 线性映射到进程高地址。就是说，x86 机器的物理内存超过1G 时，需要引入 highmem 概念。
内核不能直接访问 1G 以上的物理内存（因为这部分内存没法映射到内核的地址空间），当内核需要访问 1G 以上的物理内存时，需要通过临时映射的方式，把高地址的物理内存映射到内核可以访问的地址空间里。
当 lowmem 被占满之后，就算剩余的物理内存很大，还是会出现 oom 的情况。对于 linux2.6 来说，oom 之后会根据 score 杀掉一个进程（ oom 的话题这里不展开了）。
x86_64 架构下，内核可用的地址空间远大于实际物理内存空间，所以目前没有上面讨论的highmem 的问题，可以认为系统内存等于 lowmem。

伙伴系统

Linux 系统使用了一个名为 伙伴系统 ( buddy system ) 的内存分配算法，将所有的空闲页表（一个页表的大小为 4K ）分别链接到包含了 11 个元素的数组中，数组中的每个元素将大小相同的连续页表组成一个链表，页表的数量为 1 , 2 , 4 , 8 , 16 , 32 , 64 , 128 , 256 , 512 , 1024，所一次性可以分配的最大连续内存为 1024 个连续的 4k 页表，即 4MB 的内存。

假设你想申请一个包括 256 个页表的内存，系统会首先查找数组中的第 9 个链表（即大小为 256 的链表），如果该链表为空，就继续查找大小为 512 的链表，如果找到了，就将 512 个页表划分为两个 256 ，一个分配给进程，另一个就挂载到大小为 256 的链表上。如果大小为512 的链表也是空，就会继续查找大小为 1024 的链表，仍然为空就返回一个错误。当一个页表被释放之后，相邻的两个页表就会合并成一个大的页框。

分配算法

当申请分配页的时候，如果无法从伙伴系统的空闲链表中获得页面，则进入慢速内存分配路径，率先使用低水位线尝试分配，若失败，则说明内存稍有不足，页分配器会唤醒 kswapd 线程异步回收页，然后再尝试使用最低水位线分配页。如果分配失败，说明剩余内存严重不足，会先执行异步的内存规整，若异步规整后仍无法分配页面，则执行直接内存回收，或回收的页面数量仍不满足需求，则进行直接内存规整，若直接内存回收一个页面都未收到，则调用 oom killer 回收内存。

OOM Killer机制

参考这里 ⁹

Linux 是允许 memory overcommit 的，只要你来申请内存我就给你，寄希望于进程实际上用不到那么多内存，但万一用到那么多了呢？那就会发生类似 银行挤兑 的危机，现金(内存)不足了。

Linux 设计了一个 OOM killer 机制 ( OOM = out-of-memory ) 来处理这种危机：挑选一个进程出来杀死，以腾出部分内存，如果还不够就继续杀…
也可通过设置内核参数 vm.panic_on_oom 使得发俄7r h t生 OOM 时自动重启系统。这都是有风险的机制，重启有可能造成业务中断，杀死进程也有可能导致业务中断。
所以 Linux 2.6 之后允许通过内核参数 vm.overcommit_memory 禁止 memory overcommit。

overcommit

内核参数 vm.overcommit_memory 接受三种取值:

0 – Heuristic overcommit handling
这是缺省值，它允许 overcommit，但过于明目张胆的overcommit 会被拒绝，比如 malloc 一次性申请的内存大小就超过了系统总内存。Heuristic 的意思是 试探式的，内核利用某种算法猜测你的内存申请是否合理，它认为不合理就会拒绝overcommit。
单次申请的内存大小不能超过 free memory + free swap + pagecache的大小 + SLAB中可回收的部分，否则本次申请就会失败。
1 – Always overcommit
允许 overcommit，对内存申请来者不拒。内核执行无内存过量使用处理。使用这个设置会增大内存超载的可能性，但也可以增强大量使用内存任务的性能。
2 – Don’t overcommit
禁止 overcommit。内存拒绝等于或者大于总可用 swap 大小以及 overcommit_ratio 指定的物理 RAM 比例的内存请求。如果您希望减小内存过度使用的风险，这个设置就是最好的。
怎样才算是 overcommit，后者说，等于或者大于总可用 swap 大小以及 overcommit_ratio 指定的物理 RAM 比例呢？
kernel 设有一个阈值，申请的内存总数超过这个阈值就算overcommit，在 /proc/meminfo 中可以看到这个阈值的大小：

# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB

CommitLimit 就是 overcommit 的阈值，申请的内存总数超过 CommitLimit 的话就算是overcommit。其计算公式如下：

CommitLimit = (Physical RAM * vm.overcommit_ratio / 100) + Swap

vm.overcommit_ratio 是内核参数，缺省值是 50，表示物理内存的 50%。如果你不想使用比率，也可以直接指定内存的字节数大小，通过另一个内核参数 vm.overcommit_kbytes 即可；

如果使用了 huge pages，那么需要从物理内存中减去，公式变成：
CommitLimit = ([total RAM] – [total huge TLB RAM]) * vm.overcommit_ratio / 100 + swap

/proc/meminfo 中的 Committed_AS 表示所有进程已经申请的内存总大小，（注意是已经申请的，不是已经分配的），如果 Committed_AS 超过 CommitLimit 就表示发生了 overcommit，超出越多表示 overcommit 越严重。Committed_AS 的含义换一种说法就是，如果要绝对保证不发生OOM (out of memory) 需要多少物理内存。

sar -r 是查看内存使用状况的常用工具，它的输出结果中有两个与 overcommit 有关，kbcommit 和 %commit：
kbcommit 对应 /proc/meminfo 中的 Committed_AS；
%commit 的计算公式并没有采用 CommitLimit 作分母，而是Committed_AS / ( MemTotal + SwapTotal )，意思是_内存申请_占_物理内存与交换区之和_的百分比。

% sar -r 
05:00:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
05:10:01 PM    160576   3648460     95.78         0   1846212   4939368     62.74   1390292   1854880         4

panic on OOM

决定系统出现 oom 的时候，要做的操作。

接受的三种取值如下：

0 - 默认值
当出现 oom 的时候，触发 oom killer
1
程序在有 cpuset 、memory policy、memcg 的约束情况下的 OOM，可以考虑不 panic，而是启动 OOM killer。其它情况触发 kernel panic，即系统直接重启。
2
当出现 oom，直接触发 kernel panic，即系统直接重启。

kill

准确的说这几个参数都是和具体进程相关的，因此它们位于 /proc/xxx/ 目录下（ xxx 是进程ID ）。假设我们选择在出现 OOM 状况的时候杀死进程，那么一个很自然的问题就浮现出来：到底干掉哪一个呢？内核的算法倒是非常简单，那就是打分（ oom_score，注意，该参数是 read only 的），找到分数最高的就OK了。那么怎么来算分数呢？可以参考内核中的 oom_badness 函数：

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, 
              const nodemask_t *nodemask, unsigned long totalpages) 
{……
    adj = (long)p->signal->oom_score_adj; 
    if (adj == OOM_SCORE_ADJ_MIN) {－－－－－－－－－－－－－－－－－－－－－－（1） 
        task_unlock(p); 
        return 0;－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－（2） 
    }
    points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + 
        atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);－－－－－－－－－（3） 
    task_unlock(p);

    if (has_capability_noaudit(p, CAP_SYS_ADMIN))－－－－－－－－－－－－－－－－－（4） 
        points -= (points * 3) / 100;
    adj *= totalpages / 1000;－－－－－－－－－－－－－－－－－－－－－－－－－－－－（5） 
    points += adj;  

    return points > 0 ? points : 1; 
}

（1）对某一个 task 进行打分（ oom_score ）主要有两部分组成，一部分是系统打分，主要是根据该 task 的内存使用情况。另外一部分是用户打分，也就是 oom_score_adj 了，该 task 的实际得分需要综合考虑两方面的打分。如果用户将该 task 的 oom_score_adj 设定成OOM_SCORE_ADJ_MIN（ -1000 ）的话，那么实际上就是禁止了 OOM killer 杀死该进程。
（2）这里返回了 0 也就是告知 OOM killer，该进程是 good process ，不要干掉它。后面我们可以看到，实际计算分数的时候最低分是 1 分。
（3）前面说过了，系统打分就是看物理内存消耗量，主要是三部分，RSS 部分，swap file 或者 swap device 上占用的内存情况以及页表占用的内存情况。
（4）root 进程有 3% 的内存使用特权，因此这里要减去那些内存使用量。
（5）用户可以调整 oom_score，具体如何操作呢？oom_score_adj 的取值范围是 -1000～1000，0 表示用户不调整 oom_score ，负值表示要在实际打分值上减去一个折扣，正值表示要惩罚该 task ，也就是增加该进程的 oom_score。在实际操作中，需要根据本次内存分配时候可分配内存来计算（如果没有内存分配约束，那么就是系统中的所有可用内存，如果系统支持cpuset，那么这里的可分配内存就是该 cpuset 的实际额度值）。oom_badness 函数有一个传入参数 totalpages，该参数就是当时的可分配的内存上限值。实际的分数值（ points ）要根据oom_score_adj 进行调整，例如如果 oom_score_adj 设定 -500，那么表示实际分数要打五折（基数是 totalpages ），也就是说该任务实际使用的内存要减去可分配的内存上限值的一半。

了解了 oom_score_adj 和 oom_score 之后，应该是尘埃落定了，oom_adj 是一个旧的接口参数，其功能类似 oom_score_adj，为了兼容，目前仍然保留这个参数，当操作这个参数的时候，kernel 实际上是会换算成 oom_score_adj，有兴趣的同学可以自行了解，这里不再细述了。

由任意调整的进程衍生的任意进程将继承该进程的 oom_score。例如：如果 sshd 进程不受 oom_killer 功能影响，所有由 SSH 会话产生的进程都将不受其影响。这可在出现 OOM 时影响 oom_killer 功能救援系统的能力。