在 linux 中,稀疏型内存模型意味着物理地址空间由若干个地址段组成,段与段之间允许存在空洞,适用于地址段比较稀疏的情况,并用于支持内存热插拔。
稀疏型内存模型将整个物理地址空间划分成多个区段(section)。一个区段用一个 struct mem_section 结构体来描述,其中最重要的字段是 section_mem_map,它编码了 NUMA 节点号、页描述符数组的地址以及一些标志位。
struct mem_section 数组声明:
/*
* Permanent SPARSEMEM data:
*
* 1) mem_section - memory sections, mem_map's for valid memory
*/
#ifdef CONFIG_SPARSEMEM_EXTREME
struct mem_section **mem_section;
#else
struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
____cacheline_internodealigned_in_smp;
#endif
EXPORT_SYMBOL(mem_section);
struct mem_section 结构体内容:
struct mem_section {
/*
* This is, logically, a pointer to an array of struct
* pages. However, it is stored with some other magic.
* (see sparse.c::sparse_init_one_section())
*
* Additionally during early boot we encode node id of
* the location of the section here to guide allocation.
* (see sparse.c::memory_present())
*
* Making it a UL at least makes someone do a cast
* before using it wrong.
*/
unsigned long section_mem_map;
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
#ifdef CONFIG_PAGE_EXTENSION
/*
* If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
* section. (see page_ext.h about this.)
*/
struct page_ext *page_ext;
unsigned long pad;
#endif
/*
* WARNING: mem_section must be a power-of-2 in size for the
* calculation and use of SECTION_ROOT_MASK to make sense.
*/
};
下面节选的部分代码(内核版本为 4.19)用来初始化 struct mem_section 数组中属于某个 node 的所有有效 section,其中 sparse_buffer_init() 用于分配 struct page 数组空间,pnum 为 struct mem_section 数组打平成一维数组的下标,它对应某个 section,同时也对应某个 struct page 子数组,sparse_mem_map_populate() 用于得到 pnum 对应的 struct page 子数组的首地址,sparse_init_one_section() 用于设置 pnum 对应的 mem_section 结构体中 section_mem_map 的值。
/*
* Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
* And number of present sections in this node is map_count.
*/
static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
unsigned long pnum_end,
unsigned long map_count)
{
unsigned long pnum, usemap_longs, *usemap;
struct page *map;
...
sparse_buffer_init(map_count * section_map_size(), nid);
for_each_present_section_nr(pnum_begin, pnum) {
if (pnum >= pnum_end)
break;
map = sparse_mem_map_populate(pnum, nid, NULL);
...
sparse_init_one_section(__nr_to_section(pnum), pnum, map, usemap);
...
}
sparse_buffer_fini();
return;
...
}
如果某个 node 的所有 section 都有效,那么它每个 section 中的 section_mem_map 值应该都相等,因为在 sparse_buffer_init() 函数中已经预先申请了该 node 所包含的 struct page 数组空间,同时,在处理每个 section 时依次得到 pnum 对应的 struct page 子数组的首地址,并将其简单处理后设置到 section_mem_map 字段中,其中,处理的方式保证了这一点:section_mem_map = (map - section_nr_to_pfn(pnum)) —— 由于 map 和 pnum 是一一对应的,这使得每对 map、pnum,上述差值都相等。
但在实践中发现往往最后一个 section 的 section_mem_map 发生了变化,经过调试发现,sparse_buffer_init() 函数在申请内存时是以页为大小对齐的,而在 sparse_mem_map_populate() 中分配内存时是以 sizeof(struct page) * PAGES_PER_SECTION 为大小对齐的(PAGES_PER_SECTION 是一个 section 所包含的 page 数量),一般而言,后者大于前者,这会导致在处理最后一个 section 时,预申请的 struct page 数组空间不够用,使得最后一个 section 需要单独申请,这就浪费了部分因对齐而无法使用的空间。
看了一下最新的内核代码,这个问题在 09dbcf422e9b791d2d43cad8c283d9bdaef019a9 提交中修复了。
commit 09dbcf422e9b791d2d43cad8c283d9bdaef019a9
Author: Michal Hocko <mhocko@suse.com>
Date: Sat Nov 30 17:54:27 2019 -0800
mm/sparse.c: do not waste pre allocated memmap space
Vincent has noticed [1] that there is something unusual with the memmap
allocations going on on his platform
: I noticed this because on my ARM64 platform, with 1 GiB of memory the
: first [and only] section is allocated from the zeroing path while with
: 2 GiB of memory the first 1 GiB section is allocated from the
: non-zeroing path.
The underlying problem is that although sparse_buffer_init allocates
enough memory for all sections on the node sparse_buffer_alloc is not
able to consume them due to mismatch in the expected allocation
alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE
alignment the real memmap has to be aligned to section_map_size() this
results in a wasted initial chunk of the preallocated memmap and
unnecessary fallback allocation for a section.
While we are at it also change __populate_section_memmap to align to the
requested size because at least VMEMMAP has constrains to have memmap
properly aligned.
[1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
[akpm@linux-foundation.org: tweak layout, per David]
Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org
Fixes: 35fd1eb1e821 ("mm/sparse: abstract sparse buffer allocations")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Debugged-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>