文章目录
- 1.memory-compaction简介
- 2.memory-compaction调用流程
- 3.memory-compaction源码分析
- 3.1内存规整关键数据结构
- 3.2struct zone中与内存规整相关的成员
- 3.3内存规整扫描zone的基本单位pageblock
- 3.4 fragmentation index(碎片指数)
- 3.5__alloc_pages_direct_compact函数
1.memory-compaction简介
linux内存管理中内存碎片化是一个老大难的问题。严重的内存碎片化可能会大幅减弱linux内存管理系统分配大块连续内存的能力。很多时候我门通过free命令发现系统中还有足够的空闲内存,但是当我们去申请一个大块连续内存时,常常会申请失败,主要原因可能就是内存碎片化导致。
为了缓解内存碎片化,linux提供了很多方式。memory-compaction就是其中一个很有效地技术
memory-compactio(内存规整),该技术是将正在使用的可移动(MIGRATE_MOVABLE)页面中的内容迁移到其它内存页面中以此来获取到更大块连续的空闲内存块。
在内存中内存碎片分为:
- 内碎片:页内部碎片
- 外碎片:页与页间的碎片,造成空闲内存足够,连续内存块分配失败。
对于物理页,linux内核用migrate_type来描述页面的迁移类型
- MIGRATE_UNMOVABLE:不可移动页,主要是内核分配的页(linux内核分配页很多是线性映射的页,这些页的虚拟地址和物理地址是通过固定的偏移进行映射
的,因此不能将物理页的内容移动到其他空闲物理页中)
- MIGRATE_MOVABLE:可移动页,能将页中的内容迁移到其他物理页中,主要是一些用户空间分配的页。
- MIGRATE_RECLAIMABLE:可回收页,不能迁移,但能进行回收处理
linux是如何实现memory-compaction?
对于一个zone,linux维护两个扫描器,空闲页扫描器和迁移页扫器。两个扫描器都是以page_block(order为MAX_ORDER-1的页块)为单位对zone区域进行扫描。空闲页扫描器从zone区域的低地址往高地址进行扫描,而迁移页扫描器从zone区域的高地址往低地址扫描。在扫描过程中空闲页扫描器会将遇到的空闲页记录在它的私有链表中,而迁移页扫描器会将正在使用可迁移的页记录在它的私有链表中。当两个扫描器在zone的中间相遇时,表示扫描过程结束。
扫描结束后将移动页扫描器链表中的所有页逐页迁移到空闲页扫描器链表中的页中(页迁移 page migration)。这样整个zone区域上面就会形成一段连续的空闲物理内存,缓解了zone的内存碎片化程度。
页迁移过程就是将一个可以移动的在使用页迁移到空闲页上。linux通过__unmap_and_move实现页迁移的过程,大致思路如下:
页迁移(page migration)
流程:
1.通过反向映射定位到迁移页映射的所有虚拟地址空间。
2.调用try_to_unmap解除迁移页和其对应虚拟地址空间的映射,并将迁移页面的虚拟地址空间映射到空闲物理页上去(修改 每个pte页表项)
3.最后完成页内容和参数的拷贝工作(迁移页内容拷贝到空闲页中)
应用:
1.子进程fork时的写时复制
2.ksm页
3.CMA引起的页迁移
3.透明巨页(THP)引起的页迁移
4.NUMA balanceing引起的页迁移
2.memory-compaction调用流程
liux内核中常见触发memory-compaction的触发条件:
-
伙伴系统分配连续内存块时,当进入慢速内存分配流程后(__alloc_pages_slowpath()),若降低watermak值和进行kswap内存回收操作后,系统内存仍然吃紧,则伙伴系统会触发memory-compaction。
__alloc_pages_slowpath() ---------->__alloc_pages_direct_compact() ---------->try_to_compact_pages() ----------> compact_zone_order() ---------->compact_zone()
-
linux os内存吃紧,kcompactd线程唤醒触发memory-compaction.
kcompact_do_word() ---------->compact_zone()
-
手动触发。
echo 1 > /sys/devices/system/node/nodexx/compact sysfs_compact_node() ---------->compact_node() ---------->compact_zone() && echo 1 > /proc/sys/vm/compact_memory sysctl_compaction_handler() ---------->compact_nodes ---------->compact_zone()
通过上述的触发条件,发现一个共同点是最后都会通过compact_zone()函数对某个zone区域进行内存规整操作。
内存碎片整理。它会在四个地方调用到:
1.内核从伙伴系统以min阀值获取连续页框,但是连续页框又不足时
2.当需要从指定地方获取连续页框,但是中间有页框正在使用时。
3.因为内存短缺导致kswapd被唤醒时,在进行内存回收之后会进行内存碎片整理。
4.将1写入sysfs中的/vm/compact_memory时,系统会对所有zone进行内存碎片整理。
3.memory-compaction源码分析
由于memory-compaction触发条件较多,本小节只对伙伴系统内存分配过程中因内存吃紧而触发的内存规整的流程进行分析。
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
{
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
...
enum compact_priority compact_priority;
enum compact_result compact_result;
...
/*
*前面内存分配失败,可能是内存碎片化严重所致,调用__alloc_pages_direct_compact,开启内存规整操作。
*开启条件:
*(1)分配请求允许进行直接内存回收,及是(gfp_mask & __GFP_DIRECT_RECLAIM为1)
*(2)内存分配的阶要大于PAGE_ALLOC_COSTLY_ORDER(默认为3)。低阶内存分配失败往往和内存碎片无关
*(3)不应使用pfmemalloc的通用分配请求(PF_MEMALLOC表示当进程进行页面分配时,可以忽略内存管理的水印进行分配)
*/
if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER &&
!gfp_pfmemalloc_allowed(gfp_mask)) {
page = __alloc_pages_direct_compact(gfp_mask, order,
alloc_flags, ac,
INIT_COMPACT_PRIORITY,
&compact_result);
if (page)
goto got_pg;
/*
* Checks for costly allocations with __GFP_NORETRY, which
* includes THP page fault allocations
*/
if (gfp_mask & __GFP_NORETRY) {
/*
* If compaction is deferred for high-order allocations,
* it is because sync compaction recently failed. If
* this is the case and the caller requested a THP
* allocation, we do not want to heavily disrupt the
* system, so we fail the allocation instead of entering
* direct reclaim.
*/
if (compact_result == COMPACT_DEFERRED)
goto nopage;
/*
* Looks like reclaim/compaction is worth trying, but
* sync compaction could be very expensive, so keep
* using async compaction.
*/
compact_priority = INIT_COMPACT_PRIORITY;
}
}
...
retry:
...
//二次进行内存规整操作
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
compact_priority, &compact_result);
if (page)
goto got_pg;
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;
...
/*检查是否有必要重新进行内存规整*/
if (did_some_progress > 0 &&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
&compaction_retries))
goto retry;
...
nopage:
/*
*如果检查到cpuset更新,并检测到竞争情况则跳转到retry_cpuset,重新按流程分配内存
*/
if (read_mems_allowed_retry(cpuset_mems_cookie))
goto retry_cpuset;
warn_alloc(gfp_mask,
"page allocation failure: order:%u", order);
got_pg:
return page;
}
在慢速内存分配过程中,经过一系列的内存调节,伙伴系统还是没有分配到指定order的内存块,可能是由于内存碎片化严重的原因导致order阶内存分配失败,因此调用__alloc_pages_direct_compact()函数对对应节点进行内存规整操作,以此缓解内存碎片化程度。
3.1内存规整关键数据结构
参考:https://www.cnblogs.com/tolimit/p/5286663.html
在介绍__alloc_pages_direct_compact()函数源码前,先了解下几个与内存规整有关的数据结构
-
enum compact_priority:描述内存规整的模式(或优先级)
/* * Determines how hard direct compaction should try to succeed. * Lower value means higher priority, analogically to reclaim priority. * 1.优先级关系: COMPACT_PRIO_SYNC_FULL > COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC * 2.compation对应的成本:COMPACT_PRIO_SYNC_FULL > COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC * 3.COMPACT_PRIO_SYNC_FULL完全同步成功率最高 */ enum compact_priority { //整个内存规整以同步方式完成(允许阻塞,允许将脏页写回到存储设备上,直到等待完成) COMPACT_PRIO_SYNC_FULL, MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL, //轻量级同步模式,允许绝大多数祖塞,但是不允许将脏页写回到存储设备上,因为等待时间比较长 COMPACT_PRIO_SYNC_LIGHT, MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT, DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT, //整个内存规整操作以异步方式处理,不允许祖塞 COMPACT_PRIO_ASYNC, INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC };
-
enum compact_result:描述内存规整完成后的状态信息
/* Return values for compact_zone() and try_to_compact_pages() */ /* When adding new states, please adjust include/trace/events/compaction.h */ /* *1.COMPACT_SKIPPED:跳过此zone,可能此zone不适合 *2.COMPACT_DEFERRED:此zone不能开始,是由于此zone最近失败过 *3.COMPACT_CONTINUE:继续尝试做page compaction *4.COMPACT_COMPLETE: 对整个zone扫描已经完成,但是没有规整出合适的页 *5.COMPACT_PARTIAL_SKIPPED: 扫描了部分的zone,但是没有找到合适的页 *6.COMPACT_SUCCESS:规整成功,并且合并出空闲的页 */ enum compact_result { /* For more detailed tracepoint output - internal to compaction */ COMPACT_NOT_SUITABLE_ZONE, /* * compaction didn't start as it was not possible or direct reclaim * was more suitable */ COMPACT_SKIPPED, /* compaction didn't start as it was deferred due to past failures */ COMPACT_DEFERRED, /* compaction not active last round */ COMPACT_INACTIVE = COMPACT_DEFERRED, /* For more detailed tracepoint output - internal to compaction */ COMPACT_NO_SUITABLE_PAGE, /* compaction should continue to another pageblock */ COMPACT_CONTINUE, /* * The full zone was compacted scanned but wasn't successfull to compact * suitable pages. */ COMPACT_COMPLETE, /* * direct compaction has scanned part of the zone but wasn't successfull * to compact suitable pages. */ COMPACT_PARTIAL_SKIPPED, /* compaction terminated prematurely due to lock contentions */ COMPACT_CONTENDED, /* * direct compaction terminated after concluding that the allocation * should now succeed */ COMPACT_SUCCESS, };
-
enum migrate_mode:用于描述页迁移时的模式
/* * MIGRATE_ASYNC means never block * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking * on most operations but not ->writepage as the potential stall time * is too significant * MIGRATE_SYNC will block when migrating pages * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages * with the CPU. Instead, page copy happens outside the migratepage() * callback and is likely using a DMA engine. See migrate_vma() and HMM * (mm/hmm.c) for users of this mode. */ enum migrate_mode { /* *内存碎片整理最常用的模式(默认初始是异步模式),在此模式中不会进行阻塞(但是时间片到了可以进行主动调 *度),也就是此种模式不会对文件页进行处理,文件页用于映射文件数据使用,这种模式也是对整体系统压力较小 *的模式。 */ MIGRATE_ASYNC, /* *当异步模式整理不了更多内存时,有两种情况下会使用轻同步模式再次整理内存:1.明确表示分配的不是透明大 *页的情况下;2.当前进程是内核线程的情况下。这个模式中允许大多数操作进行阻塞(比如隔离了太多页,需要阻 *塞等待一段时间)。这种模式会处理匿名页和文件页,但是不会对脏文件页执行回写操作,而当处理的页正在回写 *时,也不会等待其回写结束。 */ MIGRATE_SYNC_LIGHT, /* *所有操作都可以进行阻塞,并且会等待处理的页回写结束,并会对文件页、匿名页进行回写到磁盘,所以导致最 *耗费系统资源,对系统造成的压力最大。它会在三种情况下发生: * 1.从cma中分配内存时; * 2.调用alloc_contig_range()尝试分配一段指定了开始页框号和结束页框号的连续页框时; * 3.通过写入1到sysfs中的/vm/compact_memory文件手动实现同步内存碎片整理。 *同步模式会增加推迟计数器阀值,并且在同步模式下,会设置好compact_control,让同步模式时忽略 *pageblock的PB_migrate_skip标记 */ MIGRATE_SYNC, //同步迁移,但不等待页面的拷贝过程。页面的拷贝通过回调migratepage(),过程可能会涉及DMA MIGRATE_SYNC_NO_COPY, };
**ps:**内存规整(compact)中会调用migrate_pages(),同时也会设置迁移模式(位于compact_control->mode)。若是sysfs主动触发的内存规整会用MIGRATE_SYNC模式;若是kcompactd触发的规整会用MIGRATE_SYNC_LIGHT模式;若是内存分配slowpath中触发的会根据compact prior去设置用MIGRATE_ASYNC或MIGRATE_SYNC_LIGHT模式。
在内存不足以分配连续页框后导致内存碎片整理时,首先会进行异步的内存碎片整理,如果异步的内存碎片整理后还是不能够获取连续的页框(这种情况发生在很多离散的页的类型是MIGRATE_RECLAIMABLE),并且gfp_mask明确表示不处理透明大页的情况或者该进程是个内核线程时,则进行轻同步的内存碎片整理。 在kswapd中,永远只进行异步的内存碎片整理,不会进行同步的内存碎片整理,并且在kswapd中会跳过标记了PB_migrate_skip的pageblock。相反非kswapd中的内存碎片整理,当推迟次数超过了推迟阀值时,会将pageblock的PB_migrate_skip标记清除,也就是会扫描之前有PB_migrate_skip标记的pageblock。 在同步内存碎片整理时,会忽略所有标记了PB_migrate_skip的pageblock,强制对这段内存中所有pageblock进行扫描(当然除了MIGRATE_UNMOVEABLE的pageblock)。 异步是用得最多的,它整理的速度最快,因为它只处理MIGRATE_MOVABLE和MIGRATE_CMA两种类型,并且不处理脏页和阻塞的情况,遇到需要阻塞的情况就返回。而轻同步的情况是在异步无法有效的整理足够内存时使用,它会处理MIGRATE_RECLAIMABLE、MIGRATE_MOVABLE、MIGRATE_CMA三种类型的页框,在一些阻塞情况也会等待阻塞完成(比如磁盘设备回写繁忙,待移动的页正在回写),但是它不会对脏文件页进行回写操作。同步整理的情况就是在轻同步的基础上会对脏文件页进行回写操作。 这里需要说明一下,非文件映射页也是有可能被当成脏页的,当它加入swapcache后会被标记为脏页,不过在内存碎片整理时,即使匿名页被标记为脏页也不会被回写,它只有在内存回收时才会对脏匿名页进行回写到swap分区。在脏匿名页进行回写到swap分区后,基本上此匿名页占用的页框也快被释放到伙伴系统中作为空闲页框了。
-
struct compact_control:内存规整的控制器,主要是维护两个扫描器(freepages和migratepages),最后将migratepages扫描器中的页面迁移到freepages扫描器中对应的页面。
/* *locate in /mm/internal.h * compact_control is used to track pages being migrated and the free pages * they are being migrated to during memory compaction. The free_pfn starts * at the end of a zone and migrate_pfn begins at the start. Movable pages * are moved to the end of a zone during a compaction run and the run * completes when free_pfn <= migrate_pfn */ struct compact_control { //扫描到的空闲页的页的链表 struct list_head freepages; /* List of free pages to migrate to */ //扫描到的可移动的页的链表 struct list_head migratepages; /* List of pages being migrated */ //空闲页链表中的页数量 unsigned long nr_freepages; /* Number of isolated free pages */ //可移动页链表中的页数量 unsigned long nr_migratepages; /* Number of pages to migrate */ //空闲页框扫描器所在页框号 unsigned long free_pfn; /* isolate_freepages search base */ //可移动页框扫描器所在页框号 unsigned long migrate_pfn; /* isolate_migratepages search base */ unsigned long last_migrated_pfn;/* Not yet flushed page being freed */ //页迁移使用的模式: 同步,轻同步,异步 enum migrate_mode mode; /* Async or sync migration mode */ /*是否忽略pageblock的PB_migrate_skip标志对需要跳过的pageblock进行扫描 ,并且也不会对pageblock设置 *跳过,只有两种情况会使用: * 1.调用alloc_contig_range()尝试分配一段指定了开始页框号和结束页框号的连续页框时; * 2.通过写入1到sysfs中的/vm/compact_memory文件手动实现同步内存碎片整理。 */ bool ignore_skip_hint; /* Scan blocks even if marked skip */ bool ignore_block_suitable; /* Scan blocks considered unsuitable */ bool direct_compaction; /* False from kcompactd or /proc/... */ bool whole_zone; /* Whole zone should/has been scanned */ int order; /* order a direct compactor needs */ const gfp_t gfp_mask; /* gfp mask of a direct compactor */ const unsigned int alloc_flags; /* alloc flags of a direct compactor */ const int classzone_idx; /* zone index of a direct compactor */ struct zone *zone; //保存结果,比如异步模式下是否因为需要阻塞而结束了本次内存碎片整理 bool contended; /* Signal lock or sched contention */ };
3.2struct zone中与内存规整相关的成员
struct zone
{
......
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
//下面两个参数保存的是内存碎片整理的两个扫描的起始位置
/*
*pfn where compaction free scanner should start
* 空闲页框扫描起始位置,开始设置时是管理区的最后一个页框
* 在内存碎片整理扫描可以移动的页时,从本次内存碎片整理开始到此pageblock结束都没有隔离出可移动页时,
* 会将此值设置为pageblock的最后一页,此值默认是zone的结束页框
*/
unsigned long compact_cached_free_pfn;
/*
* pfn where async and sync compaction migration scanner should start
* 0用于异步,1用于同步,用于保存管理区可移动页框扫描起始位置
* 在内存碎片整理扫描空闲页时,从本次内存碎片整理开始到此pageblock结束都没有隔离出空闲页时,会将此值
* 设置为pageblock的最后一页,此值默认是zone的开始页框
*/
unsigned long compact_cached_migrate_pfn[2];
#endif
#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
/*
* 下面两次成员用于判断该zone某次内存规整是否需要推迟(判断前提是order要小于该zone的compact_order_failed,
* 若大于等于compact_order_failed则不需要判断直接跳过内存规整):
* (1) compact_considered:用于判断是否需要推迟,每次推迟会++,然后判断是否超过
* 1UL << compact_defer_shift,超过了则要进行内存碎片整理
* (2) compact_defer_shift:用于定量推迟计数,主要用于内存规整,分为
* compact_considered < (1 << compact_defer_shift)和
* compact_considered >= (1 << compact_defer_shift)两种情况,当管理区的内存
* 规整成功后被置0,不会大于COMPACT_MAX_DEFER_SHIFT只有在同步和轻同步模式下进行内存规整后,zone的空闲
* 页框数量没达到 (low阀值 + 1<<order + 保留内存) 时,才会增加此值
*/
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
/*
* 表示该zone所有失败内存规整中最大的order值,也是为内存规整推迟机制而设置的:
* (1)当进行内存规整时,使用的order小于此值,则允许进行内存规整,否则记一次推迟
* (2)当内存规整完成时,此值为使用的order值+1,意思是该zone在以后的内存规整操作中大一级的order在规整
* 中可能会失败,将会被考虑推迟
* (3)当内存规整失败时,此值则是等于失败时的order值,表示使用此大小的order值,有可能会导致失败
*/
int compact_order_failed;
#endif
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* Set to true when the PG_migrate_skip bits should be cleared
*(1)当进行同步内存碎片整理时,都会设置忽略pageblock的PB_migrate_skip标记,也就是会对跳过的
* pageblock进行扫描
*(2)在可移动页扫描和空闲页扫描碰头时,会设置zone->compact_blockskip_flush标志,此标志会导致
* kswapd准备睡眠时,对此zone的所有pageblock清除PB_migrate_skip
*/
bool compact_blockskip_flush;
#endif
......
}
3.3内存规整扫描zone的基本单位pageblock
下面对内存规整过程中对zone扫描的基本单位pageblock做一个简单介绍:
-
一个pageblock页块同常是一块较大的连续内存块(2^order)。linux将一个pageblock认为是一块足够大的内存块,linux用两个宏来描述pageblock的大小:
- pageblock_order:连续内存块的阶
- pageblock_nr_pages:连续内存块包含的页数
struct zone中包含了一个字段pageblock_flags,它用于跟踪包含pageblock_nr_pages个页的内存区的属性。
struct zone { #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */ unsigned long *pageblock_flags; #endif /* CONFIG_SPARSEMEM */ };
-
在初始化期间,内核自动保证对每个迁移类型,在pageblock_flags中都分配了足够存储NR_PAGEBLOCK_BITS个比特的空间。zone中的每个pageplock都用一个NR_PAGEBLOCK_BITS个比特位来表示自己的迁移类型。set_pageblock_migratetype用于设置一个以指定的页为起始地址的内存区的迁移类型。页的迁移类型是预先分配好的,对应的比特位总是可用,在页释放时,必须将其返还给正确的链表。get_pageblock_migratetype可用于从struct page中获取页的迁移类型。
/* Bit indices that affect a whole block of pages */ enum pageblock_bits { PB_migrate, PB_migrate_end = PB_migrate + 3 - 1, /* 3 bits required for migrate types */ PB_migrate_skip,/* If set the block is skipped by compaction */ /* * Assume the bits will always align on a word. If this assumption * changes then get/set pageblock needs updating. */ NR_PAGEBLOCK_BITS };
3.4 fragmentation index(碎片指数)
向伙伴系统申请连续内存失败原因主要有两种:(1)内存不足,(2)内存碎片化严重。如何确定时什么原因导致伙伴系统内存分配失败。linux提出了碎片指数fragindex ,取值范围为[0,1000].
- fragindex越接近于0,表面内存分配原因由内存不足造成
- fragindex越接近于1000,表面内分配失败原因有内碎片化造成。
linux也设置了一个阈值来控制碎片指数。int sysctl_extfrag_threshold,该值默认为500,用户可以通过虚拟文件系统的/proc/sys/vm/extfrag_threshold接口来读取或修改该阈值。
/ # cat /proc/sys/vm/extfrag_threshold
500
extfrag_threshold值默认值未500,该值越大linux会将多数的linux内存申请失败原因归结于zone区域内存不足,反之该值越小linux会将多数内存分配失败原因归结于zone内存碎片化严重。因此extfrag_threshold若设置过小,则zone的内存规整(memory compaction)会频繁触发,系统负载过大。
linux中通过fragmentation_index(zone, order)函数来获取zone在该order值下的内存指数。
//mm/vmstat.c
/*
* Calculate the number of free pages in a zone, how many contiguous
* pages are free and how many are large enough to satisfy an allocation of
* the target size. Note that this function makes no attempt to estimate
* how many suitable free blocks there *might* be if MOVABLE pages were
* migrated. Calculating that is possible, but expensive and can be
* figured out from userspace
*/
static void fill_contig_page_info(struct zone *zone,
unsigned int suitable_order,
struct contig_page_info *info)
{
unsigned int order;
info->free_pages = 0;
info->free_blocks_total = 0;
info->free_blocks_suitable = 0;
for (order = 0; order < MAX_ORDER; order++) {
unsigned long blocks;
/* Count number of free blocks */
//统计该zone所有空闲的block数量
blocks = zone->free_area[order].nr_free;
info->free_blocks_total += blocks;
/* Count free base pages */
//统计该zone所有空闲的page数量
info->free_pages += blocks << order;
/* Count the suitable free blocks */
//统计该zone能满足此次内存分配的block数量(要求suitable_order阶的内存块)
if (order >= suitable_order)
info->free_blocks_suitable += blocks <<
(order - suitable_order);
}
}
/*
* A fragmentation index only makes sense if an allocation of a requested
* size would fail. If that is true, the fragmentation index indicates
* whether external fragmentation or a lack of memory was the problem.
* The value can be used to determine if page reclaim or compaction
* should be used
*/
static int __fragmentation_index(unsigned int order, struct contig_page_info *info)
{
unsigned long requested = 1UL << order;
//没有空闲内存,此次内存分配会失败,碎片指数为0
if (!info->free_blocks_total)
return 0;
/* Fragmentation index only makes sense when a request would fail */
//free_blocks_suitable大于0说明此次内存分配可以成功
if (info->free_blocks_suitable)
return -1000;
/*
* Index is between 0 and 1 so return within 3 decimal places
*
* 0 => allocation would fail due to lack of memory
* 1 => allocation would fail due to fragmentation
*/
/*
*return值趋向0表示此次内存分配失败是由于内存不足导致
*return值趋向1表示此次内存分配失败是由于内存碎片导致
*/
return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info-
>free_blocks_total);
}
/* Same as __fragmentation index but allocs contig_page_info on stack */
int fragmentation_index(struct zone *zone, unsigned int order)
{
struct contig_page_info info;
fill_contig_page_info(zone, order, &info);
return __fragmentation_index(order, &info);
}
3.5__alloc_pages_direct_compact函数
__alloc_pages_direct_compact是直接内存规整的入口函数,它先通过try_to_compact_pages对内存进行规整,以此来缓解内存碎片化,然后通过get_page_from_freelist再次尝试内存块的分配。
//默认compact_priority是INIT_COMPACT_PRIORITY就是以异步方式进行内存规整
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio, enum compact_result *compact_result)
{
struct page *page;
unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
//order为0情况,不用进行内存规整。
if (!order)
return NULL;
//进行内存规整,当前进程会置PF_MEMALLOC,避免进程迁移时发生死锁
current->flags |= PF_MEMALLOC;
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
/*
* At least in one zone compaction wasn't deferred or skipped, so let's
* count a compaction stall
*/
count_vm_event(COMPACTSTALL);
//进行内存分配
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
/*
*此处表示经过内存规整操作后,对应zone上通过get_page_from_freelist函数成功分配到order阶连续内存块,则会做一
*下几个事:
* (1)将zone的compact_blockskip_flush设置为false
* (2)通过compaction_defer_reset将本次内存规整成功的zone区域的compact_considered成员和
* compact_defer_shift设置为0,而zone的compact_order_failed成员则在try_to_compact_pages函数成功规整
* 后就被设置为order+1
*
*/
if (page) {
struct zone *zone = page_zone(page);
/*
*若置位zone->compact_blockskip_flush标志,此标志会导致kswapd准备睡眠时,对此zone的所有
*pageblock清除PB_migrate_skip.
*/
zone->compact_blockskip_flush = false;
//重置zone中与内存规整相关的成员
compaction_defer_reset(zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
/*
* It's bad if compaction run occurs and fails. The most likely reason
* is that pages exist, but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
//主动让出cpu控制权
cond_resched();
return NULL;
}
try_to_compact_pages直接内存规整代码的主要函数,该函数传入一个伙伴系统内存分配参数控制器ac,该函数对ac的zonelist中的所有zone都进行内存规整操作,每对一个zone进行内存规整前会先通过compaction_deferred函数判断对应zone是否需要被推迟,需要则跳过对该zone的规整操作,若不需要推迟则通过compact_zone_order函数对对应zone继续进行内存规整操作。
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio)
{
/* 表示能够使用文件系统的IO操作 */
int may_enter_fs = gfp_mask & __GFP_FS;
/* 表示可以使用磁盘的IO操作 */
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;
/* Check if the GFP flags allow compaction */
//若不能进行文件系统IO操作或磁盘IO操作,跳过该compaction,因为不适用上述操作可能会死锁
if (!may_enter_fs || !may_perform_io)
return COMPACT_SKIPPED;
trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
/* Compact each zone in the list */
//对ac的zonelist中的所有zone都进行内存规整操作
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
enum compact_result status;
/*
*在非完全同步模式下,判断是否要延迟该zone的规整(跳过当前zone,规整其他满足要求的zone,当需要跳过时,
*zone->compact_considered会加1):
*(1)pro默认为COMPACT_PRIO_ASYNC,以异步方式进行规整
*(2)compaction_deferred是检查是否需要跳过该zone的内存规整,判断标准(详细参考内存规整推迟机制):
* a.当order < zone->compact_order_failed时是不需要跳过的
* b.zone->compact_considered是否小于1UL << zone->compact_defer_shift
* 小于则推迟,并且zone->compact_considered++,也就是这个函数会主动去推迟此管理区的内存
* 碎片整理
* zone->compact_considered和zone->compact_defer_shift会只有在内存碎片整理完成后,从此zone
* 获取到了连续的1 << order个页框的情况下会重置为0。
*/
if (prio > MIN_COMPACT_PRIORITY
&& compaction_deferred(zone, order)) {
//更新内存规整 compact_result
rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
continue;
}
//通过compact_zone_order函数对遍历的该zone进行内存规整操作
status = compact_zone_order(zone, order, gfp_mask, prio,
alloc_flags, ac_classzone_idx(ac));
rc = max(status, rc);
/* The allocation should succeed, stop compacting */
/*
*若该zone的本次内存规整成功,预判该zoen能够分配order阶的内存块,但是还未实际验证,所以此处只更新zone的
*compact_order_failed成员为order+1.而zone的compact_considered和compact_defer_shift成员只有通过后续
*成功在该zone中分配出要求的内存块后,才会被初始化为0.
*/
if (status == COMPACT_SUCCESS) {
/*
* We think the allocation will succeed in this zone,
* but it is not certain, hence the false. The caller
* will repeat this with true if allocation indeed
* succeeds in this zone.
*/
compaction_defer_reset(zone, order, false);
break;
}
/*
*若该zone本次内存规整失败:
*1.对非完全异步模型的内存规整会通过defer_compaction函数将当前zone与内存规整相关的成员做如下更新:
* a.compact_considered设置为0
* b.zone->compact_defer_shift加1,但不会超过COMPACT_MAX_DEFER_SHIFT
* c.若本次order小于zone->compact_order_failed,则zone->compact_order_failed = order.
* zone成员更新完后,会继续对zonelist中其他zone继续进行内存规整操作
*2.对于完全异模型的内存规整:
* a.若当前规整线程通过need_resched()函数判断,自身需要被调度,则再次被调度后会break,结束整个内
* 存规整操作.
* b.若当前规整线程收到fatal信号,直接退出整个内存规整操作.
* c.除上两种情况,继续循环,对zonelist中其他zone进行内存规整操作.
*/
if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
status == COMPACT_PARTIAL_SKIPPED))
/*
* We think that allocation won't succeed in this zone
* so we defer compaction there. If it ends up
* succeeding after all, it will be reset.
*/
defer_compaction(zone, order);
/*
* We might have stopped compacting due to need_resched() in
* async compaction, or due to a fatal signal detected. In that
* case do not try further zones
*/
if ((prio == COMPACT_PRIO_ASYNC && need_resched())
|| fatal_signal_pending(current))
break;
}
return rc;
3.5.1compaction_deferred函数(内存规整推迟机制)
该函数用于判断传入的zone是否需要跳过本次的内存规整操作。
为什么zone的内存规整操作需要被推迟???
如果zone上一次内存规整失败,若在很短的时间内该zone又进行下一次的内存规整操作,则很大概率此次内存规整仍然会失败,这样会白白地增加系统负载。因此为了优化系统性能,linux内存管理系统会对zone的内存规整操作采用推迟机制,并在结构体zone中就定义了相关的3个成员来对内存规整的推迟机制进行维护(见本文档的3.2章节 struct zone中与内存规整相关的成员)。
内存规整中会调用compaction_deferred函数来判断该zone是否需要进行内存规整:
/*
*当一个zone要进行内存规整时,首先会判断本次规整需不需要推迟,如果本次内存规整使用的order值小于zone内存规整失败最
*大order值时,不用进行推迟,可以直接进行内存规整;但是当order值大于等于zone内存规整失败最大order值时,会增加内存
*规整推迟计数器,当内存规整推迟计数器未达到内存规整推迟阀值,则会跳过本次内存规整,如果达到了,那就需要进行内存规
*整。
*/
bool compaction_deferred(struct zone *zone, int order)
{
unsigned long defer_limit = 1UL << zone->compact_defer_shift;
if (order < zone->compact_order_failed)
return false;
/* Avoid possible overflow */
if (++zone->compact_considered > defer_limit)
zone->compact_considered = defer_limit;
if (zone->compact_considered >= defer_limit)
return false;
trace_mm_compaction_deferred(zone, order);
return true;
}
- 当zone的此次内存规整的order小于zone->compact_order_failed,则zone的此次内存规整不需要被推迟。
- 当zone的此次内存规整的order大于等于zone->compact_order_failed,则该zone的此次内存规整被初步判定为需要进行推迟操作,但最终结果还需要zone中的compact_considered和compact_defer_shift两个成员来进行判定。
- 先将zone的compact_considered自加1(但是zone->compact_considered不能超过1UL << zone->compact_defer_shift),然后进行如下判定:
- 若zone->compact_considered 大于等于1UL << zone->compact_defer_shift则此次内存规整不需要被推迟
- 若zone->compact_considered 小于1UL << zone->compact_defer_shift则此次内存规整需要被推迟。
- 先将zone的compact_considered自加1(但是zone->compact_considered不能超过1UL << zone->compact_defer_shift),然后进行如下判定:
当zone内存规整失败时,会调用defer_compaction对zone与内存规整推迟机制的有关3个成员进行调整:
/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
/*
* Compaction is deferred when compaction fails to result in a page
* allocation success. 1 << compact_defer_limit compactions are skipped up
* to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
*/
void defer_compaction(struct zone *zone, int order)
{
zone->compact_considered = 0;
zone->compact_defer_shift++;
if (order < zone->compact_order_failed)
zone->compact_order_failed = order;
if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
trace_mm_compaction_defer_compaction(zone, order);
}
- 将zone->compact_considered设置为0.
- zone->compact_defer_shift自加1,如果compact_defer_shift值大于6,则zone->compact_defer_shift = 6,由此可知,zone内存规整的最大推迟次数64,当超过64次后就不能再推迟了( (1 << 6) = 64 )。
- 若order < zone->compact_order_failed,则zone->compact_order_failed = order(order为此次内存分配order阶内存块触发的内存规整中的阶order)
当zone内存规整成功时,会调用函数compaction_defer_reset对zone与内存规整推迟机制有关的3个成员进行调整:
/*
* Update defer tracking counters after successful compaction of given order,
* which means an allocation either succeeded (alloc_success == true) or is
* expected to succeed.
*/
void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success)
{
if (alloc_success) {
zone->compact_considered = 0;
zone->compact_defer_shift = 0;
}
if (order >= zone->compact_order_failed)
zone->compact_order_failed = order + 1;
trace_mm_compaction_defer_reset(zone, order);
}
- 若规整后zone的内存块分配成功则将zone->compact_considered和zone->compact_defer_shift都赋值为0.
- 如果该次内存分配的阶order >= zone->compact_order_failed,则zone->compact_order_failed = order + 1.
在对一个zone进行内存规整时,结果一般分为三种:
- 规整结束后,zone的空闲页框数量达到了 (low阀值 + 1 << order + 保留的页框数量),这种情况就称为内存规整半成功,
- 规整结束后,顺利从zone中获取到1 << order个连续页框,这种情况称为内存规整成功
- 规整结束后,zone的空闲页框数量没达到 (low阀值 + 1 << order + 保留的页框数量),这种情况称为内存规整失败
由上面一系列内存规整的代码可知:
-
当内存规整实现半成功时会调用compaction_defer_reset(zone,order,false),如果规整使用的order大于等于zone的内存规整失败最大阶值zone->compact_order_failed,则将内存规整失败最大阶值zone->compact_order_failed设置为本次内存规整使用的order值+1。
-
当内存规整实现成功时会调用compaction_defer_reset(zone,order,true),重置该zone的内存规整推迟计数器(compact_considered)和内存规整推迟阀值计数(compact_defer_shift)为0,并且如果使用的order大于等于zone的内存规整失败最大阶值zone->compact_order_failed,则将内存规整失败最大阶值设置为本次内存规整使用的order值+1。
-
当内存规整失败时,在轻同步和同步模式下(prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||status == COMPACT_PARTIAL_SKIPPED)):
-
先将该zone内存规整推迟计数器zone->compact_considered重置为0。
-
然后会对该zone内存规整推迟阀值计数zone->compact_defer_shift自加1,因为计算内存规整推迟量时,是使用
1 << zone->compact_defer_shift计算的,所以这个+1,实际上是让原来的推迟量*2。
-
最后若此次zone规整的order小于该zone的内存规整失败最大阶值zone->compact_order_failed,则将order赋值给该zone的内存规整失败最大阶值zone->compact_order_failed。
-
3.5.2compact_zone_order函数
该函数主要是初始化zone区域内存规整过程的控制结构体struct compact_control,然后调用compact_zone()对zone区域进行规整操作。
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int classzone_idx)
{
enum compact_result ret;
struct compact_control cc = {
//规整后空闲页框数
.nr_freepages = 0,
//规整后移动页框数
.nr_migratepages = 0,
.order = order,
//需要移动的页框类型(movable和reclaimable两种,可以同时设置)
.gfp_mask = gfp_mask,
//需要规整的zone区域
.zone = zone,
//页面迁移模式,由compact_priority决定
.mode = (prio == COMPACT_PRIO_ASYNC) ?
MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
.alloc_flags = alloc_flags,
.classzone_idx = classzone_idx,
.direct_compaction = true,
//若内存规整模式是同步模型(COMPACT_PRIO_SYNC_FULL),扫描整个zone
.whole_zone = (prio == MIN_COMPACT_PRIORITY),
//若内存规整模型是同步模型,扫描所有的pageblock,即使被标记为skip
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
//若内存规整模型是同步模型,扫描所有的pageblock,即使扫描的pageblock被认为是unsuitable。
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
// list->next = list->prev = list;
INIT_LIST_HEAD(&cc.freepages);
// list->next = list->prev = list;
INIT_LIST_HEAD(&cc.migratepages);
//在cc的控制下对该zone进行内存规整
ret = compact_zone(zone, &cc);
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
return ret;
}
3.5.3compact_zone函数
compact_zone函数通过compact_control控制模块来对特定zone进行内存规整操作(结合下面源码一起分析).
-
利用compaction_suitable函数判断该zone是否适合进行内存规整操作(只有该函数返回COMPACT_CONTINUE,才会继续对该zone进行规整操作,否则结束整个内存规整任务(返回COMPACT_SUCCESS)或者跳过该zone对其他zone调用compact_zone函数进行内存规整操作(返回COMPACT_SKIP)).
-
order == -1,说明该zone的内存规整是通过虚拟文件系统的/proc/sys/vm/compact_memory接口强制触发的,所以不受任何限制直接对该zone进行内存规整,compaction_suitable函数返回COMPACT_CONTINUE.
-
通过zone_watermark_ok函数判断出该zone具有分配order阶空闲内存块的能力,若具有该能力则结束整个内存规整任务(其他zone也不进行内存规整操作),compaction_suitable函数返回COMPACT_SUCCESS.
-
zone空闲页面减去两倍申请页面,低于水线值则表明该zone不适合进行内存规整,函数返回COMPACT_SKIPPED,跳过对该zone的内存规整,开始对其他zone进行内存规整;若高于水线值,则需要进行进一步的验证来判断该zone是否适合进行内存规整,验证如下:
1. 若申请连续内存块的阶order<=PAGE_ALLOC_COSTLY_ORDER,则该zone适合进行内存规整,函数返回 COMPACT_CONTINUE 2. 若order>PAGE_ALLOC_COSTLY_ORDER,则该zone是否适合进行内存规整还需要通过碎片指数(fragindex) 来进行判断: a.当0<fragindex<sysctl_extfrag_threshold,表明该zone分配不到order阶连续内存是因为内存不 足导致,此时并不需要对该zone进行内存规整操作,因此跳过该zone,函数返COMPACT_SKIP b.当fragindex>sysctl_extfrag_threshold,表明该zone此时内存碎片化严重,order阶内存块分 配失败由碎片化所致,因此该zone适合进行内存规整操作,函数返回COMPACT_CONTINUE
-
-
利用compaction_restarting函数来判断历史上该zone执行order阶及以上规模的内存规整操作失败次数是否过多,若失败次数过多则会调用__reset_isolation_suitable函数来更新该zone中与内存规整相关的数据,以求本次内存规整操作能够成功.若失败次数较少,则不需执行__reset_isolation_suitable函数:
-
对于compaction_restarting函数:
-
对于函数compaction_restarting,若下列条件为True表明该zone历史上执行order阶及以上规模的内存规整操作失败次数过多,则函数返回true,后续调用__reset_isolation_suitable函数将zone中所有pageblock的PB_migrate_skip标志清除,并更新zone中与内存规整相关的成员数据.
( order >= zone->compact_order_failed && zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT && zone->compact_considered >= 1UL << zone->compact_defer_shift )
-
其他条件下compaction_restarting函数返回false,则不会执行__reset_isolation_suitable函数.
-
-
对于__reset_isolation_suitable函数,执行操作:
-
将该zone的所有pageblock的PB_migrate_skip标志清除,
-
将zone的compact_blockskip_flush成员设置为false(表示后续不需要清除该zone的所有pageblock的PB_migrate_skip标志),
-
通过reset_cached_positions(zone)函数将zone的compact_cached_migrate_pfn和compact_cached_free_pfn成员进行重新赋值.
static void reset_cached_positions(struct zone *zone) { zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn; zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn; //compact_cached_free_pfn等于zone最后一个pageblockd的首页页框号 zone->compact_cached_free_pfn = pageblock_start_pfn(zone_end_pfn(zone) - 1); }
通过__reset_isolation_suitable执行操作可以看出,就是加大内存规整对该zone的页的扫描返回(360度无死角扫描),力求本次内存规整成功.
-
-
-
初始化此次内存规整操做中两个扫描器在zone中的起始位置(分完全同步模式和非完全同步模式两种情况)。
-
若对zone进行全局扫描(cc->whole_zone为true,此时为完全同步模式):
- 迁移扫描器的起始位置(cc->migrate_pfn)为该zone的第一个页框
- 空闲页扫描器的起始位置(cc->free_pfn)为该zone最后一个pageblock的首页
-
若不是对zone进行全局扫此时为非完全同步模式:
- 因为迁移模式为非完全同步模式迁,所以移扫描器的起始位置为zone->compact_cached_migrate_pfn[1].
- 空闲页扫描器的起始位置为zone->compact_cached_free_pfn.
赋值完后会对两个扫描器的初始位置做一个范围验证,越界则默认采用a中的方式进行初始化,并将cc->whole_zone设置 为true
-
-
调用migrate_prep_local()函数将当前CPU的LRU缓冲pagevec中的页进行刷新(将处于pagevec中的页都放回原本所属的lru中)
-
执行while循环对整个zone的页进行扫描和迁移操作,最终目的是尽量将zone头部可迁移页面的内容依次迁移到zone尾部的空闲页面中。
-
循环结束的标志通过compact_finished函数来判断(参考后面关于compact_finished函数的详细分析)。
-
每次循环执行的操作:
-
先调用isolate_migratepages函数,在zone中从cc->migrate_pfn页框开始往高地址的页框进行扫描,直到找到第一个含有可迁移页面的pageblock,然后将该pageblock中的所有适合迁移的页面从lru链表中取出隔离到 cc->migratepages链表中,以此作为本次循环的迁移页面集,函数最后会将cc->migrate_pfn赋值为含有可迁移页面的pageblock的最后一个页的页框号,下次循环执行该函数扫描该zone的起始位置就会从当前pageblock的下一个相邻pageblock起始位置开始。
1.从cc->migrate_pfn开始往高地址页方向地扫描的过程中该zone的有些pageblock的 PB_migrate_skip标志被置位还有些pageblock中不存在适合迁移的页,这些pageblock都会直接 在isolate_migratepages函数的扫描中被跳过,但函数扫描的总页框有数量和边界限制详情见代 码分析 2.每次循环进行隔离可移动页框是以一个pageblock为单位,也就是从一个pageblock中将可以移动 页进行隔离,因此cc->migratepages一次循环最多也就只能对一个pageblock进行隔离,.
执行完isolate_migratepages函数后,本次循环走向由该函数返回值决定:
1.若返回ISOLATE_ABORT,表明此次循环寻找适合迁移页面失败,直接退出该zone的整个规整操作 2.若返回ISOLATE_NONE,表明此次循环未找到适合迁移的页面可能是未刷新某些缓存,进行相关操 作后,执行下一次循环 3.若返回ISOLATE_SUCCESS表明isolate_migratepages找到了适合迁移的页,继续执行本次循环后 续的迁移操作.
-
执行到调用migrate_pages处说明isolate_migratepages找到了适合迁移的页,并隔离在cc->migratepages链表中.函数migrate_pages的工作是将cc->migratepages可移动页的内容迁移到该zone尾部对应的空闲页中。
迁移方式:遍历cc->migratepages迁移扫描器中所有的可迁移页,每次对一个页进行迁移操作,每个页的迁移步骤如下: 1.通过compaction_alloc函数给需要迁移的页分配空闲页: a.compaction_alloc函数先判断cc->freepages链表中是否为空,若为空则利用 isolate_freepages将zone高端的空闲页隔离到cc->freepages链表中(数量尽量等于 cc-> cc->nr_migratepages) b.然后从cc->freepages链表中拿出一个空闲页newpage给cc->migratepages中某一页迁 移用。 2.然后对需要迁移的页page和空闲页newpage做一系列判断,看能否进行迁移操作。 3.若满足迁移要求,调用__unmap_and_move函数将page迁移到newpage中去
-
-
compact_zone函数源码实现细节:
static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
enum compact_result ret;
//zone区开始页框号
unsigned long start_pfn = zone->zone_start_pfn;
//zone区域结束页框号
unsigned long end_pfn = zone_end_pfn(zone);
//获取可进行移动的页框类型(__GFP_RECLAIMABLE、__GFP_MOVABLE)
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
//页迁移时同步还是异步:1---》同步 0---》异步
const bool sync = cc->mode != MIGRATE_ASYNC;
//判断该zone是否适合进行内存规整操作
ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
/* Compaction is likely to fail */
/*
*COMPACT_SUCCESS-->内存足够用于分配,所以此次整理直接跳过
*COMPACT_SKIPPED--->free内存数量不足以进行内存碎片整理
*/
if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
return ret;
/* huh, compaction_suitable is returning something unexpected */
VM_BUG_ON(ret != COMPACT_CONTINUE);
/*
*通过compaction_restarting函数来判断当前zone在历史上执行order阶及以上规模的内存规整操作的失败次数是否过
*多.若过多则调用__reset_isolation_suitable函数来更新该zone中与内存规整相关的数据,让本次对zone进行内存规
*整操作时,能够对该zone进行更大范围的页扫描,以此来提高本次内存规整的成功率
*/
if (compaction_restarting(zone, cc->order))
__reset_isolation_suitable(zone);
/*
* Setup to move all movable pages to the end of the zone. Used cached
* information on where the scanners should start (unless we explicitly
* want to compact the whole zone), but check that it is initialised
* by ensuring the values are within zone boundaries.
*/
if (cc->whole_zone) {//扫描整个zone,完全异步模式下执行
cc->migrate_pfn = start_pfn;
cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
} else {//非完全异步模式
cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
cc->free_pfn = zone->compact_cached_free_pfn;
}
/* 检查cc->free_pfn,如果空闲页框扫描起始页框不在zone的范围内,则将空闲页框扫描起始页框设置为zone的最后
*一个页框并且也会将zone->compact_cached_free_pfn设置为zone的最后一个页框
*/
if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
zone->compact_cached_free_pfn = cc->free_pfn;
}
/* 同上,检查cc->migrate_pfn,如果可移动页框扫描起始页框不在zone的范围内,则将可移动页框扫描起始页框设置
*为zone的第一个页框并且也会将zone->compact_cached_free_pfn设置为zone的第一个页框
*/
if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
cc->migrate_pfn = start_pfn;
//0用于异步内存规整,1用于同步内存规整
zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}
if (cc->migrate_pfn == start_pfn)
cc->whole_zone = true;
}
cc->last_migrated_pfn = 0;
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync);
// 将处于pagevec中的页都放回原本所属的lru中
migrate_prep_local();
/*
*执行while循环对整个zone的页进行扫描和迁移操作,最终目的是尽量将zone头部可迁移页面的内容依次迁移到zone尾部的
*空闲页面中。
*1.循环结束zone的本次内存规整结束,利用compact_finished判断是否结束本次内存规整,结束标准如下:
* (a)异步模式下因阻塞或调度结束zone本次内存规整
* (b)可移动页扫描器的扫描范围和空闲页扫描器扫描范围相遇
* (c)在直接内存规整中(内存分配失败触发的规整):zone的空闲页块数量达到zone的low阈值标准,伙伴系统的指定
* 迁移类型区或指定迁移类型备用区中有大于或等于order阶大的空闲页块存在
* PS:若是通过/proc/sys/vm/compact_memory进行强制内存规整,则只有条件a才会退出循环,结束内存规整
*2.每次循环执行的操作:在zone中从cc->migrate_pfn指向的页框开始向zone高地址扫描,直到找到一个含有至少一个适
* 合迁移页的pageblock,并将该pagebloc中所有适合迁移的页隔离在cc->migratepages链表中,然后将
* cc->migratepages中的页通过migrate_pages函数迁移到zone高端区域对应的空闲页中,最后就是在本次循环迁移完成
* 后,初始化cc控制器中的参数,以便下次循环操作的执行。
*/
while ((ret = compact_finished(zone, cc, migratetype)) ==
COMPACT_CONTINUE) {
int err;
switch (isolate_migratepages(zone, cc)) {
//失败,把cc->migratepages中残余的页放回到lru或者原来的地方,并退出该zone的内存规整
case ISOLATE_ABORT:
ret = COMPACT_CONTENDED;
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
goto out;
//未找到可移动页可能是未刷新某些缓存
case ISOLATE_NONE:
/*
* We haven't isolated and migrated anything, but
* there might still be unflushed migrations from
* previous cc->order aligned block.
*/
goto check_drain;
case ISOLATE_SUCCESS:
;
}
/*
*函数执行到此处表明cc->migratepages页迁移扫描器中至少存在一个适合迁移的页面。migrate_pages函数就是要将
*cc->migratepages中的页面都迁移到该zone的高端空闲页面中去。
*空闲页框会通过compaction_alloc获取(获取方式见该函数代码分析)
*/
err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION);
trace_mm_compaction_migratepages(cc->nr_migratepages, err,
&cc->migratepages);
/* All pages were either migrated or will be released */
//迁移完成设置控制器中可移动页框为0
cc->nr_migratepages = 0;
//err>0表明cc->migratepages中有迁移失败的页
if (err) {
//将剩余的可移动页框返回原来的位置
putback_movable_pages(&cc->migratepages);
/*
* migrate_pages() may return -ENOMEM when scanners meet
* and we want compact_finished() to detect it
*迁移过程中出现异常而导致内存规整结束,函数返回COMPACT_CONTENDED状态
*/
if (err == -ENOMEM && !compact_scanners_met(cc)) {
ret = COMPACT_CONTENDED;
goto out;
}
/*
* We failed to migrate at least one page in the current
* order-aligned block, so skip the rest of it.
*/
if (cc->direct_compaction &&
(cc->mode == MIGRATE_ASYNC)) {
cc->migrate_pfn = block_end_pfn(
cc->migrate_pfn - 1, cc->order);
/* Draining pcplists is useless in this case */
cc->last_migrated_pfn = 0;
}
}
check_drain:
/*
* Has the migration scanner moved away from the previous
* cc->order aligned block where we migrated from? If yes,
* flush the pages that were freed, so that they can merge and
* compact_finished() can detect immediately if allocation
* would succeed.
*/
if (cc->order > 0 && cc->last_migrated_pfn) {
int cpu;
unsigned long current_block_start =
block_start_pfn(cc->migrate_pfn, cc->order);
if (cc->last_migrated_pfn < current_block_start) {
cpu = get_cpu_light();
local_lock_irq(swapvec_lock);
lru_add_drain_cpu(cpu);
local_unlock_irq(swapvec_lock);
drain_local_pages(zone);
put_cpu_light();
/* No more flushing until we migrate again */
cc->last_migrated_pfn = 0;
}
}
}
out:
/*
* Release free pages and update where the free scanner should restart,
* so we don't leave any returned pages behind in the next attempt.
*控制器中剩余空闲页框放回伙伴系统中
*/
if (cc->nr_freepages > 0) {
unsigned long free_pfn = release_freepages(&cc->freepages);
cc->nr_freepages = 0;
VM_BUG_ON(free_pfn == 0);
/* The cached pfn is always the first in a pageblock */
free_pfn = pageblock_start_pfn(free_pfn);
/*
* Only go back, not forward. The cached pfn might have been
* already reset to zone end in compact_finished()
*/
if (free_pfn > zone->compact_cached_free_pfn)
zone->compact_cached_free_pfn = free_pfn;
}
trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync, ret);
return ret;
}
3.5.3.1判断该zoen是否适合进行内存规整(compaction_suitable)
判断zone是否适合执行内存的碎片规整,需要满足以下三个条件:
- 若order == -1,说明该zone的内存规整是通过虚拟文件系统的/proc/sys/vm/compact_memory接口强制触发的,所以不受任何限制直接对该zone进行内存规整
- 通过zone_watermark_ok函数判断出该zone具有分配order阶空闲内存块的能力,若具有该能力则结束整个内存规整任务(其他zone也不进行内存规整操作)
- zone空闲页面减去两倍申请页面,低于水线值则表明该zone不适合进行内存规整,跳过该zone,对其他zone进行内存规整;若高于水线值,则需要进行进一步的验证来判断该zone是否适合进行内存规整。
- 若申请连续内存块的阶order<=PAGE_ALLOC_COSTLY_ORDER,则该zone适合进行内存规整
- 若order>PAGE_ALLOC_COSTLY_ORDER,则该zone是否适合进行内存规整还需要通过碎片指数来进行判断(fragindex)
- 当0<fragindex<sysctl_extfrag_threshold,表明该zone分配不到order阶连续内存是因为内存不足导致,此时并不需要对该zone进行内存规整操作,因此跳过该zone
- 当fragindex>sysctl_extfrag_threshold,表明该zone此时内存碎片化严重,order阶内存块分配失败由碎片化所致,因此该zone适合进行内存规整操作。
compaction_suitable函数源码实现细节:
enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int classzone_idx)
{
enum compact_result ret;
int fragindex;
ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
*
* index of -1000 would imply allocations might succeed depending on
* watermarks, but we already failed the high-order watermark check
* index towards 0 implies failure is due to lack of memory
* index towards 1000 implies failure is due to fragmentation
*
* Only compact if a failure would be due to fragmentation. Also
* ignore fragindex for non-costly orders where the alternative to
* a successful reclaim/compaction is OOM. Fragindex and the
* vm.extfrag_threshold sysctl is meant as a heuristic to prevent
* excessive compaction for costly orders, but it should not be at the
* expense of system stability.
*通过__compaction_suitable初步验证该zone能够进行内存规整操作,若order > PAGE_ALLOC_COSTLY_ORDER时还需
*要通过该zone的碎片指数fragindex来进一步判断该zone是否能够进行内存规整:
* 1.若fragindex >= sysctl_extfrag_threshold说明内存块分配失败原因是内存碎片化严重所致,因此该zoen适合进
* 行内存规整
* 2.若fragindex < sysctl_extfrag_threshold,内存分配失败是内存不足导致,不适合通过内存规整来解决内存分配
* 失败
*/
if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
fragindex = fragmentation_index(zone, order);
if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
ret = COMPACT_NOT_SUITABLE_ZONE;
}
trace_mm_compaction_suitable(zone, order, ret);
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED;
return ret;
}
/*
*判断该zone是否适合进行内存规整?
*返回值:
* COMPACT_SKIPPED:zone没有足够的空闲页来进行内存规整,跳过该zone的内存规整
* COMPACT_SUCCESS:zone能够分配order阶的连续内存块,不需要进行内存规整,其它zone也不会进行内存规整
* COMPACT_CONTINUE:zone适合进行内存规整操作。
*/
static enum compact_result __compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int classzone_idx,
unsigned long wmark_target)
{
unsigned long watermark;
/*
* order==-1,说明是通过虚拟文件系统的/proc/sys/vm/compact_memory接口强制触发的内存规整,不受条件限制直接进行
* 内存规整操作,返回COMPACT_CONTINUE
*/
if (is_via_compact_memory(order))
return COMPACT_CONTINUE;
watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
/*
*zone有足够的内存分配order阶内存块,因此不需要进行内规整,返回COMPACT_SUCCESS,整个内存规整操作结束
*/
if (zone_watermark_ok(zone, order, watermark, classzone_idx,
alloc_flags))
return COMPACT_SUCCESS;
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
* watermark and alloc_flags have to match, or be more pessimistic than
* the check in __isolate_free_page(). We don't use the direct
* compactor's alloc_flags, as they are not relevant for freepage
* isolation. We however do use the direct compactor's classzone_idx to
* skip over zones where lowmem reserves would prevent allocation even
* if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
* ALLOC_CMA is used, as pages in CMA pageblocks are considered
* suitable migration targets
*/
//order<=PAGE_ALLOC_COSTLY_ORDER取min水线值,order > PAGE_ALLOC_COSTLY_ORDER取low水线值
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
//空闲页面减去两倍申请的页面若低于水印值,则该zone不适合进行内存规整操作,没有足够的空闲页来对可迁移页进行迁移
watermark += compact_gap(order);
if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
ALLOC_CMA, wmark_target))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
}
最终compaction_suitable函数会返回3中状态:
- COMPACT_SKIPPED:
- zone没有足够空闲内存进行内存规整,跳过该zone,对其他zone进行规整操作。
- zone其他条件满足,但是该zone的内存碎片指数显示该zone并不是由于内存碎片验证导致的连续内存块分配失败,而是由于内存不足所致,因此跳过对该zone进行内存规整操作,对其他zone进行内存规整。
- COMPACT_SUCCESS:zone能够分配order阶的连续内存块,不需要进行内存规整,其它zone也不会进行内存规整
- COMPACT_CONTINUE:zone适合进行内存规整操作
3.5.3.2zone内存规整循环结束标志(compact_finished)
内存规整通过compact_finished函数来判断是否结束,结束标准如下:
-
异步模式下因阻塞或调度结束zone本次内存规整
-
可移动页扫描器和空闲页扫描器的扫描游标相遇:在结束当前zone规整前,会调用reset_cached_position对当前zone执行如下操作,下次对zone进行扫描时就能再次从zone的头尾分别进行扫描,而不受上次内存规整的影响.
zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn; zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn; zone->compact_cached_free_pfn = pageblock_start_pfn(zone_end_pfn(zone) - 1);
-
在直接内存规整中(内存分配失败触发的规整):zone的空闲页块数量在order阶条件下达到zone的low阈值标准,且伙伴系统的指定迁移类型区或指定迁移类型备用区中有大于或等于order阶的连续空闲页块存在
PS:若是通过/proc/sys/vm/compact_memory进行强制内存规整,则只有条件a才会退出循环,结束内存规整
static enum compact_result compact_finished(struct zone *zone,
struct compact_control *cc,
const int migratetype)
{
int ret;
ret = __compact_finished(zone, cc, migratetype);
trace_mm_compaction_finished(zone, cc->order, ret);
if (ret == COMPACT_NO_SUITABLE_PAGE)
ret = COMPACT_CONTINUE;
return ret;
}
static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
unsigned int order;
unsigned long watermark;
//异步模式下因阻塞或调度结束zone本次内存规整
if (cc->contended || fatal_signal_pending(current))
return COMPACT_CONTENDED;
/* Compaction run completes if the migrate and free scanner meet */
//扫描可迁移页面和空闲页面,从zone的头尾向中间运行。当两者相遇,停止该zone的内存规整
if (compact_scanners_met(cc)) {
/* 初始化zone结构描述符中与内存规整有关的成员 */
reset_cached_positions(zone);
/*
* Mark that the PG_migrate_skip information should be cleared
* by kswapd when it goes to sleep. kcompactd does not set the
* flag itself as the decision to be clear should be directly
* based on an allocation request.
* 内存分配失败触发直接内存规整,规整结束是因为两扫描器扫描标志相遇则
* 会将zone的compact_blockskip_flush设置为true,则当kswap线程睡眠时,
* 会以异步方式将该zone的所有PG_migrate_skip清除。两个扫描起始位置重
* 置为zone的开始页框和结束页框位置,后续对该zone进行规整就会对整个zone
* 区域进行无死角扫描,并不会受上次规整的影响
*/
if (cc->direct_compaction)
zone->compact_blockskip_flush = true;
//对整个zone扫描已经完成,但是没有规整出合适的页
if (cc->whole_zone)
return COMPACT_COMPLETE;
else //对zone部分进行了扫描,但是没有规整出合适的页
return COMPACT_PARTIAL_SKIPPED;
}
/*
*cc->order == -1,通过/proc/sys/vm/compact_memory接口手动触发内存规整,zone强制内存规整不考虑zone目前空闲
*页和水线限制。因此结束该类型内存规整的条件只有zone的两个规整扫描器扫描游标相遇
*/
if (is_via_compact_memory(cc->order))
return COMPACT_CONTINUE;
/* Compaction run is not finished if the watermark is not met */
watermark = zone->watermark[cc->alloc_flags & ALLOC_WMARK_MASK];
//以cc->order为条件zone空闲内存不满内存分配指定的水线条件(low),继续进行内存规整
if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
cc->alloc_flags))
return COMPACT_CONTINUE;
/* Direct compactor: Is a suitable page free? */
/*
*对于直接内存规整(内存分配失败触发的内存规整):若该zone的伙伴系统的指定迁移类型区或指定迁移类型备用区中已经
*存在大于等于cc->order阶的内存块时,该zone内存规整成功,结束内存规整操作。
*/
for (order = cc->order; order < MAX_ORDER; order++) {
struct free_area *area = &zone->free_area[order];
bool can_steal;
/* Job done if page is free of the right migratetype */
//伙伴系统中指定迁移类型区中存在大于等于cc->order阶内存块,zone内存规整成功,退出
if (!list_empty(&area->free_list[migratetype]))
return COMPACT_SUCCESS;
#ifdef CONFIG_CMA
/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
if (migratetype == MIGRATE_MOVABLE &&
!list_empty(&area->free_list[MIGRATE_CMA]))
return COMPACT_SUCCESS;
#endif
/*
* Job done if allocation would steal freepages from
* other migratetype buddy lists.
* 伙伴系统中指定迁移类型备用区中存在大于等于cc->order阶内存块,zone内存规整成功,退出
*/
if (find_suitable_fallback(area, order, migratetype,
true, &can_steal) != -1)
return COMPACT_SUCCESS;
}
return COMPACT_NO_SUITABLE_PAGE;
}
小结:
可移动页扫描器和空闲页扫描器游标相遇,才算是对zone进行了一次完整的内存碎片整理,这个完整的内存碎片整理并不代表一次内存碎片整理就能实现,也有可能是对zone进行多次内存碎片整理才达到的,因为每次内存碎片整理结束时机还有另外两种。当zone达到一次完整的内存碎片整理时,会重置两个扫描器的起始位置为zone的第一个页和zone最后一个pageblock的首页,并且在不是处于kswap中时,会设置zone->compact_blockskip_flush为真,这个zone->compact_blockskip_flush在kswapd准备睡眠时,会将zone的所有pageblock的PB_migrate_skip标志清除。
3.5.3.3扫描zone中可迁移页面(isolate_migratepages)
isolate_migratepages函数利用compact_control控制模块cc来对指定的zone进行扫描,并将扫描到的可迁移页面添加到cc->migratepages链表中进行隔离。因为Linux内核以pageblock为单位来管理页的迁移属性(一个pageblock为一个order为10的连续页块),所以该函数每次对zone的可移动页框进行隔离是以一个pageblock为单位,也就是每次执行该函数只会从该zone的一个pageblock中选取适合迁移的页进行隔离,执行步骤如下所示:
在zone中从cc->migrate_pfn页框开始往高地址的页框进行扫描,直到找到第一个含有可迁移页面的pageblock,然后将该pageblock中的所有适合迁移的页面从lru链表中取出隔离到cc->migratepages链表中,以此作为本次循环的迁移页面集,函数最后会将cc->migrate_pfn赋值为此次扫描的含有可迁移页面的pageblock的最后一个页的页框号,这样下次循环执行该函数时扫描该zone的起始位置就会从当前pageblock的下一个相邻pageblock起始位置开始,这样使得内存规整对迁移页面的扫描在zone区域是以连贯的形式进行的。
扫描过程的细节处理参考下面的代码和注释。
/*
* Allow userspace to control policy on scanning the unevictable LRU for
* compactable pages.
*/
int sysctl_compact_unevictable_allowed __read_mostly = 1;
/*
* Isolate all pages that can be migrated from the first suitable block,
* starting at the block pointed to by the migrate scanner pfn within
* compact_control.
*/
static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long block_start_pfn;
unsigned long block_end_pfn;
unsigned long low_pfn;
struct page *page;
const isolate_mode_t isolate_mode =
(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
/*
* Start at where we last stopped, or beginning of the zone as
* initialized by compact_zone()
* 通过cc->migrate_pfn,定位其所处pageblock的起始页的页框号block_start_pfn和结束页框号block_end_pfn。
* 并会对这些数据做一些对齐或边界地判断和处理
*/
low_pfn = cc->migrate_pfn;
block_start_pfn = pageblock_start_pfn(low_pfn);
if (block_start_pfn < zone->zone_start_pfn)
block_start_pfn = zone->zone_start_pfn;
/* Only scan within a pageblock boundary */
block_end_pfn = pageblock_end_pfn(low_pfn);
/*
* Iterate over whole pageblocks until we find the first suitable.
* Do not cross the free scanner.
* 该for循环会以low_pfn为起点,pageblock为步长,在zone区域向高地址迭代搜寻第一个合适的pageblock,合适的判定
* 标准为:pageblock中至少有一个页能适合迁移。在寻找到合适的pageblock后,调用isolate_migratepages_block函
* 数将目的pageblock中所有的适合迁移的页从lru链表取出隔离到cc->migratepages链表中。然后退出循环。
*/
for (; block_end_pfn <= cc->free_pfn;
low_pfn = block_end_pfn,
block_start_pfn = block_end_pfn,
block_end_pfn += pageblock_nr_pages) {
/*
* This can potentially iterate a massively long zone with
* many pageblocks unsuitable, so periodically check if we
* need to schedule, or even abort async compaction.
* 迁移页扫描器对扫描的页数量有一个限制,若扫描的页超过了32*1024
* 个页框,则表明此处扫描执行时间过长。当扫描页数过多时:
* (1)对于同步模式的内存规:
* (a)若该进程通过need_resched()判断需要被调度,则主动调用cond_resched()让出cpu
* 控制权休眠一会儿后继续执行该循环扫描的后续指令
* (b)若该进程通过need_resched()判断不需要被调度,继续执行该循环扫描的后续指令
* (2)对于异步模式的内存规:
* (a)若该进程通过need_resched()判断需要被调度,则break退出此次对zone迁移页面扫描隔离循环
( (b)若该进程通过need_resched()判断不需要被调度,则继续执行该循环扫描的后续指令
*/
if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
&& compact_should_abort(cc))
break;
//获取本次循环的pageblock的第一个页page,并检查page是否属于此zone,若检查不通过跳过该pageblock的扫描.
page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,zone);
if (!page)
continue;
/* If isolation recently failed, do not retry */
//若cc->ignore_skip_hint为false,检查该pageblock的PB_migrate_skip是否置位,如果置位跳过该pageblock
if (!isolation_suitable(cc, page))
continue;
/*
* For async compaction, also only scan in MOVABLE blocks.
* Async compaction is optimistic to see if the minimum amount
* of work satisfies the allocation.
* 异步模式的内存规整若该pageblock的迁移类型不是MIGRATE_MOVABLE或MIGRATE_CMA
* 类型则,跳过页块(异步模式内存规整不会迁移RECLAIMABLE的页)
*/
if (cc->mode == MIGRATE_ASYNC &&
!migrate_async_suitable(get_pageblock_migratetype(page)))
continue;
/*
*Perform the isolation
*isolate_migratepages_block,隔离pageblock中适合迁移页的核心函数:
* 执行完该函数后,该pageblock中所有适合迁移的页框都会从zone的lru链表中取出,添加到
* cc->migratepages链表中进行隔离,最后会将该pageblock的最后一个扫描的页的页框号返回
* 赋值给low_pfn
*/
low_pfn = isolate_migratepages_block(cc, low_pfn,
block_end_pfn, isolate_mode);
//compaction terminated prematurely due to lock contentions,退出此次内存规整
if (!low_pfn || cc->contended)
return ISOLATE_ABORT;
/*
* Either we isolated something and proceed with migration. Or
* we failed and compact_zone should decide if we should
* continue or not.
*/
break;
}
/*
*Record where migration scanner will be restarted.
*将上面循环找到的合适的pageblock的最后一个页的页框号赋值给cc->migrate_pfn,下次执行该函数就很从其下一个相邻
*的pageblock的进行扫描,保证规整过程迁移页扫描在zone中的连贯性。
*/
cc->migrate_pfn = low_pfn;
/*
*迁移页扫描器中页数量大于0表示,获取到可迁移页返回ISOLATE_SUCCESS,小于0表示本次扫描在zone中并未获取到适合
*迁移的页返回ISOLATE_NONE
*/
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}
从zone中的某个pageblock隔离出所有可迁移页(isolate_migratepages_block)
isolate_migratepages_block函数是将zone中指定的pageblock中所有能够迁移的页从zone->lru链表中隔离到cc->migratepages链表中。实现代码较多,但都比较容易理解。
/**
* isolate_migratepages_block() - isolate all migrate-able pages within
* a single pageblock
* @cc: Compaction control structure.
* @low_pfn: The first PFN to isolate
* @end_pfn: The one-past-the-last PFN to isolate, within same pageblock
* @isolate_mode: Isolation mode to be used.
*
* Isolate all pages that can be migrated from the range specified by
* [low_pfn, end_pfn). The range is expected to be within same pageblock.
* Returns zero if there is a fatal signal pending, otherwise PFN of the
* first page that was not scanned (which may be both less, equal to or more
* than end_pfn).
*
* The pages are isolated on cc->migratepages list (not required to be empty),
* and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
* is neither read nor updated.
*/
static unsigned long
isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
{
struct zone *zone = cc->zone;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
/*
* Ensure that there are not too many pages isolated from the LRU
* list by either parallel reclaimers or compaction. If there are,
* delay for some time until fewer pages are isolated
* 检查linux系统目前是否已经有大量的页被隔离(ioslated)过了,若已隔离页的数量
* 超过了zone->lru链表中页数量的一半:
* (1)完全异步模式的内存规整会直接退出该pageblock的页迁移。
* (2)非完全异步模式的内存规整会进入睡眠状态,后续会出现两种情况:
* a.进程收到fatal信号退出该次进程(return)
* b.系统中隔离的迁移页数量低于阈值,则任务唤醒,跳出while循环,接着执行后续隔离操作。
*/
while (unlikely(too_many_isolated(zone))) {
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
return 0;
// 进行100ms的休眠,等待设备没那么繁忙
congestion_wait(BLK_RW_ASYNC, HZ/10);
if (fatal_signal_pending(current))
return 0;
}
/*
*完全异步规整,且当前进程需要被调度返回true,其他时候都返回false。需要注意的是对于非完全异步规整
*若当前进程需要被调度进程会主动执行cond_resched()让出cpu控制权,当后续再次获得cpu控制权时,返回false
*
*/
if (compact_should_abort(cc))
return 0;
if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
skip_on_failure = true;
next_skip_pfn = block_end_pfn(low_pfn, cc->order);
}
/* Time to isolate some pages for migration */
//遍历[low_pfn,end_pfn]中的每一个页
for (; low_pfn < end_pfn; low_pfn++) {
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
* We have isolated all migration candidates in the
* previous order-aligned block, and did not skip it due
* to failure. We should migrate the pages now and
* hopefully succeed compaction.
*/
if (nr_isolated)
break;
/*
* We failed to isolate in the previous order-aligned
* block. Set the new boundary to the end of the
* current block. Note we can't simply increase
* next_skip_pfn by 1 << order, as low_pfn might have
* been incremented by a higher number due to skipping
* a compound or a high-order buddy page in the
* previous loop iteration.
*/
next_skip_pfn = block_end_pfn(low_pfn, cc->order);
}
/*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort async compaction
* if contended.
*释放掉zone->lru_lock
*/
if (!(low_pfn % SWAP_CLUSTER_MAX)
&& compact_unlock_should_abort(zone_lru_lock(zone), flags,
&locked, cc))
break;
//low_pfn页框号合法性验证
if (!pfn_valid_within(low_pfn))
goto isolate_fail;
//扫描次数
nr_scanned++;
//根据页框号获取页描述符
page = pfn_to_page(low_pfn);
//设置valid_page
if (!valid_page)
valid_page = page;
/*
* Skip if free. We read page order here without zone lock
* which is generally unsafe, but the race window is small and
* the worst thing that can happen is that we skip some
* potential isolation targets.
*判断该页是否在伙伴系统中(通过page->_mapcount = PAGE_BUDDY_MAPCOUNT_VALUE)
*若在伙伴系统则跳过该空闲页所在的空闲内存块(注意是跳过该页在伙伴系统中的页块,并不是单独一个页)
*/
if (PageBuddy(page)) {
//获取page在伙伴系统位于空闲页块的阶
unsigned long freepage_order = page_order_unsafe(page);
/*
* Without lock, we cannot be sure that what we got is
* a valid page order. Consider only values in the
* valid order range to prevent low_pfn overflow.
*跳过页块(伙伴系统中空闲页块是由连续的空闲页构成,页数量为2^freepage_order)
*/
if (freepage_order > 0 && freepage_order < MAX_ORDER)
low_pfn += (1UL << freepage_order) - 1;
continue;
}
/*
* Regardless of being on LRU, compound pages such as THP and
* hugetlbfs are not to be compacted. We can potentially save
* a lot of iterations if we skip them at once. The check is
* racy, but we can consider only valid values and the only
* danger is skipping too much.
*如果是巨页或是透明巨页直接跳过,也是多个页一起跳过。
*/
if (PageCompound(page)) {
unsigned int comp_order = compound_order(page);
if (likely(comp_order < MAX_ORDER))
low_pfn += (1UL << comp_order) - 1;
goto isolate_fail;
}
/*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU and non-lru movable pages.
* Skip any other type of page
* 能迁移的页有两种:(1)LRU中可移动的页,(2)non-lru movable pages。其它类型的的页都不能
* 用于迁移,此处就是判断该页是否适合迁移如不适合则跳过。(isolated的pages,伙伴系统空闲page和non-lru
* movable pages是不处于lru中的)
*/
if (!PageLRU(page)) {//该if语句就是将不在lru中的UNMOVABLE类型页进行跳过
/*
* __PageMovable can return false positive so we need
* to verify it under page_lock.
*/
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
if (locked) {
spin_unlock_irqrestore(zone_lru_lock(zone),
flags);
locked = false;
}
if (isolate_movable_page(page, isolate_mode))
goto isolate_success;
}
goto isolate_fail;
}
/*
* Migration will fail if an anonymous page is pinned in memory,
* so avoid taking lru_lock and isolating it unnecessarily in an
* admittedly racy check.
*如果page是一个匿名页,并且被引用次数大于page->_mapcount,则跳过此页,因为此页很有可能被锁定在内存中
*不允许换出
*/
if (!page_mapping(page) &&
page_count(page) > page_mapcount(page))
goto isolate_fail;
/* If we already hold the lock, we can skip some rechecking */
//检查是否有上锁(zone->lru_lock)
if (!locked) {
locked = compact_trylock_irqsave(zone_lru_lock(zone),
&flags, cc);
if (!locked)
break;
/* Recheck PageLRU and PageCompound under lock */
//没上锁需要检查是否在lru上,没有则跳过该页
if (!PageLRU(page))
goto isolate_fail;
/*
* Page become compound since the non-locked check,
* and it's on LRU. It can only be a THP so the order
* is safe to read and it's 0 for tail pages.
* 如果在lru中,检查是否是大页,若是大页做个对齐,防止low_pfn不是页首,然后跳过该大页
*/
if (unlikely(PageCompound(page))) {
//对齐操作防止low_pfn,不是大页的首页
low_pfn += (1UL << compound_order(page)) - 1;
goto isolate_fail;
}
}
lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
/* Try isolate the page */
/*
*将页从lru中取出来,同时要判断该页是否适合迁移:
*(1)PageUnevictable(page) && !(isolate_mode & ISOLATE_UNEVICTABLE)
*(2)PageWriteback(page) && (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)
*(3)PageDirty(page) && (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)
*以上几种条件若为true,都不适合将该page进行迁移都会goto isolate_fail。具体判断见
*__isolate_lru_page函数
*/
if (__isolate_lru_page(page, isolate_mode) != 0)
goto isolate_fail;
VM_BUG_ON_PAGE(PageCompound(page), page);
/* Successfully isolated */
del_page_from_lru_list(page, lruvec, page_lru(page));
inc_node_page_state(page,
NR_ISOLATED_ANON + page_is_file_cache(page));
isolate_success:
//将成功获取到的可迁移页添加到cc->migratepages链表中
list_add(&page->lru, &cc->migratepages);
//迁移链表页框数量加1
cc->nr_migratepages++;
//隔离页数量加1
nr_isolated++;
/*
* Record where we could have freed pages by migration and not
* yet flushed them to buddy allocator.
* - this is the lowest page that was isolated and likely be
* then freed by migration.
*/
if (!cc->last_migrated_pfn)
cc->last_migrated_pfn = low_pfn;
/* Avoid isolating too much */
//COMPACT_CLUSTER_MAX表示内存内存规整所能迁移的最大页框数,此处防止隔离页框过多
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
break;
}
continue;
isolate_fail:
if (!skip_on_failure)
continue;
/*
* We have isolated some pages, but then failed. Release them
* instead of migrating, as we cannot form the cc->order buddy
* page anyway.
*/
if (nr_isolated) {
if (locked) {
spin_unlock_irqrestore(zone_lru_lock(zone), flags);
locked = false;
}
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
cc->last_migrated_pfn = 0;
nr_isolated = 0;
}
if (low_pfn < next_skip_pfn) {
low_pfn = next_skip_pfn - 1;
/*
* The check near the loop beginning would have updated
* next_skip_pfn too, but this is a bit simpler.
*/
next_skip_pfn += 1UL << cc->order;
}
}
/*
* The PageBuddy() check could have potentially brought us outside
* the range to be scanned.
*/
if (unlikely(low_pfn > end_pfn))
low_pfn = end_pfn;
//解锁(zone->lru)
if (locked)
spin_unlock_irqrestore(zone_lru_lock(zone), flags);
/*
* Update the pageblock-skip information and cached scanner pfn,
* if the whole pageblock was scanned without isolating any page.
* 假如全部的页框块都扫描过了,并且没有隔离任何一个页,则标记最后这个页所在的pageblock为PB_migrate_skip,然
* 按如下代码更新zone中与内存规整相关册成员:
* if (valid_page > zone->compact_cached_migrate_pfn[0])
zone->compact_cached_migrate_pfn[0] = valid_page;
if (cc->mode != MIGRATE_ASYNC && valid_page > zone->compact_cached_migrate_pfn[1])
zone->compact_cached_migrate_pfn[1] = valid_page;
*valid_page为本次页块隔离扫描过程中的起始页
*/
if (low_pfn == end_pfn)
update_pageblock_skip(cc, valid_page, nr_isolated, true);
//数据统计隔离页个数,扫描次数等
trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
nr_scanned, nr_isolated);
count_compact_events(COMPACTMIGRATE_SCANNED, nr_scanned);
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
return low_pfn;
}
从上面的代码我们可以总结出哪些页不适合添加到cc->migratepages链表中进行隔离:
1. 伙伴系统中的page不适合隔离
2. 不在lru链表的页不适合隔离(但non-lru movable页例外,适合隔离)
3. 巨页或透明巨页不适合隔离
4. 已经处于隔离状态的页(不在lru链表中)不适合隔离
5. 在lru链表中的页只有部分页适合隔离,需要通过_isolate_lru_page利用isolate_mode_t结构来判断该页是否适合隔离。
*(1)PageUnevictable(page) && !(isolate_mode & ISOLATE_UNEVICTABLE)
*(2)PageWriteback(page) && (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)
*(3)PageDirty(page) && (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)
以上3种条件若为true,都不适合将该page进行隔离都会goto isolate_fail。具体判断详见__isolate_lru_page函数
**PS:**因为lru上挂的都是用户空间的页,往往都是从伙伴系统分配器中迁移类型为movable或reclaim的页块(pageblock)上分配出来的页。 所以被迁移页通常从lru上的movable页群中进行获取。但是我们在下面迁移页扫描的代码中发现有些no-lru页也被迁移页扫描器隔离why?
Reason:一些在使用的no-lru页,通常是内核空间分配的页这些页基本上都是unmovable的页。但是为了让驱动中的某些内核页也
能够被迁移。bda807d44("mm: migrate: support non-lru movable page migration")该补丁中的PageMovable() 让驱动中某些在使用的内核页具有movable状态称为可能,因此这些内核中的特殊页也能够被迁移。
3.5.3.4cc->migratepages链表中页框的迁移(migrate_pages)
先看看在compact_zone函数的每次循环中是如何调用migrate_pages函数的:
err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION);
上面函数的任务是将cc->migratepages链表中隔离的可迁移页迁移到zone高端区域对应的空闲页中去。compaction_alloc是一个回调函数其作用就是为页面迁移分配合适的空闲页框,compaction_free也是一个回调函数用于将迁移失败的空闲页(由compaction_alloc函数分配)重新放回到cc->freepages链表中。
在介绍migrate_pages前先对其参数中的两个回调函数compaction_alloc和compaction_free进行分析。
内存规整空闲页框分配(compaction_alloc)
/*
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
*/
static struct page *compaction_alloc(struct page *migratepage,
unsigned long data,
int **result)
{
//获取cc数据
struct compact_control *cc = (struct compact_control *)data;
struct page *freepage;
/*
* Isolate free pages if necessary, and if we are not aborting due to
* contention.
*如果cc->freepages中没有页,需要向通过空闲页扫描器往cc->freepages中添加足够的空闲页
*/
if (list_empty(&cc->freepages)) {
//cc控制器没有错误记录
if (!cc->contended)
isolate_freepages(cc);
//空闲页扫描器扫描后cc->freepages仍然为空,返回NULL,内存规整空闲页获取失败
if (list_empty(&cc->freepages))
return NULL;
}
//从cc->freepages隔离的页中拿出一个空闲页用于页面迁移(拿出放到lru链表)
freepage = list_entry(cc->freepages.next, struct page, lru);
list_del(&freepage->lru);
cc->nr_freepages--;
//成功获取到迁移用的空闲页返回
return freepage;
}
扫描zone中用于迁移的空闲页面(isolate_freepages)
从上面代码看出迁移过程中空闲页分配按函数流程较简单,其核心是空闲页扫函数isolate_freepages,其目的是:
在特定zone的内存规整过程中,当内存规整控制器cc隔离的空闲页为空时(cc->freepages is empty),isolate_freepages函数会以pageblock为单位从cc->free_pfn开始向zone的低地址进行扫描,在扫描过程中若遇到合适的空闲页,会将其取出隔离到cc->freepages链表中。函数扫描结束的标志是(1或2):
- 空闲页扫描器的游标遇到cc->migrate_pfn后。
- 或者空闲页扫描器隔离的空闲页总数超过了迁移页扫描器隔离的可迁移页总数(cc->nr_freepages >= cc->nr_migratepages)。
需要注意的是对于空闲页扫描器扫描zone是以pageblock为单位,从zone高端区域的页框cc->free_pfn所在的pageblock为起点往zone的低地址方向进行扫描,但是在pageblock的内部是从pageblock的首页开始往pageblock尾页进行扫描的。
前面介绍过迁移页扫描器(isolate_migratepages),我们知道它每执行一次隔离的页只会来自一个pageblock,也就是说cc->migratepages中隔离的可迁移页属于zone区域中的同一个pageblock。但是对于空闲页扫描器,它每次扫描隔离的空闲页是跨pageblock的,也就是说 cc->freepages中隔离的页可以来自zone中的不同pageblock。另外空闲页扫描器隔离的页总数总是小于等于迁移页扫描器中隔离的页总数。
/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/
static void isolate_freepages(struct compact_control *cc)
{
struct zone *zone = cc->zone;
struct page *page;
unsigned long block_start_pfn; /* start of current pageblock */
unsigned long isolate_start_pfn; /* exact pfn we start at */
unsigned long block_end_pfn; /* end of current pageblock */
unsigned long low_pfn; /* lowest pfn scanner is able to scan */
struct list_head *freelist = &cc->freepages;
/*
* Initialise the free scanner. The starting point is where we last
* successfully isolated from, zone-cached value, or the end of the
* zone when isolating for the first time. For looping we also need
* this pfn aligned down to the pageblock boundary, because we do
* block_start_pfn -= pageblock_nr_pages in the for loop.
* For ending point, take care when isolating in last pageblock of a
* a zone which ends in the middle of a pageblock.
* The low boundary is the end of the pageblock the migration scanner
* is using.
* 获取第一个扫描的pageblock(cc->free_pfn页框所在的pageblock)
*/
isolate_start_pfn = cc->free_pfn;
block_start_pfn = pageblock_start_pfn(cc->free_pfn);
block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
zone_end_pfn(zone));
//low_pfn保持的是前面迁移扫描器最后扫描页的页框号,按照pageblock_nr_pages对齐
low_pfn = pageblock_end_pfn(cc->migrate_pfn);
/*
* Isolate free pages until enough are available to migrate the
* pages on cc->migratepages. We stop searching if the migrate
* and free page scanners meet or enough free pages are isolated.
* 1.从zone最后一个pageblock向cc->migrate_pfn所在的pageblock扫描,并将扫描过程中获得的空闲页隔离在
* cc->freepages链表中。
* 2.每次循环只是扫描一个pageblock,其中block_start_pfn为当次循环处理的pageblock的首页页框号,
* block_end_pfn为当次循环处理的pageblock尾页的页号。
*/
for (; block_start_pfn >= low_pfn;
block_end_pfn = block_start_pfn,
block_start_pfn -= pageblock_nr_pages,
isolate_start_pfn = block_start_pfn) {
/*
* This can iterate a massively long zone without finding any
* suitable migration targets, so periodically check if we need
* to schedule, or even abort async compaction.
* 空闲页扫描器对扫描的页数量有一个限制,若扫描的页超过了32*1024
* 个页框,则表明此处扫描执行时间过长。当扫描页数过多时:
* (1)对于同步模式的内存规:
* (a)若该进程通过need_resched()判断需要被调度,则主动调用cond_resched()让出cpu
* 控制权休眠一会儿后继续执行该循环扫描的后续指令
* (b)若该进程通过need_resched()判断不需要被调度,继续执行该循环扫描的后续指令
* (2)对于异步模式的内存规:
* (a)若该进程通过need_resched()判断需要被调度,则break退出此次对zone迁移页面扫描隔离循环
( (b)若该进程通过need_resched()判断不需要被调度,则继续执行该循环扫描的后续指令
*/
*/
if (!(block_start_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
&& compact_should_abort(cc))
break;
/*
*先对block_start_pfn和block_end_pfn进行合法性检查,然后将本次循环pageblock的首页结构描述符返回给page
*/
page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
zone);
//page为空跳过该pageblock,处理下一个pageblock
if (!page)
continue;
/* Check the block is suitable for migration
* 判断是否能够用于迁移页框(3个判断条件,详细可自己查看suitable_migration_target):
* (1)若cc->ignore_block_suitable为true,不需要判定默认所有pageblock适合作为被迁移对象,返回True
* (2)当page位于伙伴系统中,则该page所在的连续空闲页块的order值若大于等于pageblock的order值,该页块不
* 适合作为被迁移对象,返回false
* (3)该pageblock必须为为MIGRATE_MOVABLE或者MIGRATE_CMA迁移类型,若为MIGRATE_RECLAIMABLE迁移类型则
* 不是和。返回False
*/
if (!suitable_migration_target(cc, page))
continue;
/* If isolation recently failed, do not retry */
if (!isolation_suitable(cc, page))
continue;
/*
*Found a block suitable for isolating free pages from.
*pageblock在合法性检查后,在其内部进行空闲页扫描和隔离工作:
* (1)从isolate_start_pfn往zone的高地址方向扫描,到block_end_pfn结束。
* (2)循环第一次可能会直接跳过,因为第一次isolate_start_pfn为zone的最后一个页框号
*/
isolate_freepages_block(cc, &isolate_start_pfn, block_end_pfn,
freelist, false);
/*
* If we isolated enough freepages, or aborted due to lock
* contention, terminate.
*若扫描的过程中隔离的空闲页超过cc->migratepages链表中被隔离的迁移页或者出现因锁竞争而被aborted直接退出
*此次扫描循环。
*/
if ((cc->nr_freepages >= cc->nr_migratepages)
|| cc->contended) {
if (isolate_start_pfn >= block_end_pfn) {
/*
* Restart at previous pageblock if more
* freepages can be isolated next time.
*因为循环退出后cc->free的值由isolate_start_pfn决定也就是说下次空闲页扫描器重启,扫描的起始位置
*很可能就是isolate_start_pfn指向的页, 若不更新isolate_start_pfn值,可能存在当下次空闲页扫描
*器重启时,扫描被以前扫描器扫描过的页(因为整个空闲页扫描过程,在pageblock外部是按照由高往低的
*顺序推进,而在pageblock内部是按由低往高的地址顺序进行扫描)
*/
isolate_start_pfn =
block_start_pfn - pageblock_nr_pages;
}
break;
} else if (isolate_start_pfn < block_end_pfn) {
/*
* If isolation failed early, do not continue
* needlessly.
*/
break;
}
}
/* __isolate_free_page() does not map the pages */
map_pages(freelist);
/*
* Record where the free scanner will restart next time. Either we
* broke from the loop and set isolate_start_pfn based on the last
* call to isolate_freepages_block(), or we met the migration scanner
* and the loop terminated due to isolate_start_pfn < low_pfn
*记录此处空闲页扫描器结束时所处的页的位置,保存在cc->free_pfn中(下次扫描可能会复用)
*/
cc->free_pfn = isolate_start_pfn;
}
扫描单个pageblock中用于迁移的空闲页面(isolate_freepages_block)
内存规整流程中isolate_freepages_block函数作用扫描并隔离单个pageblock中可用于迁移的空闲页面,隔离的空闲页面存放在 cc->freepages链表中,函数具体实现细节参考下面的代码和注释。
/*
* Isolate free pages onto a private freelist. If @strict is true, will abort
* returning 0 on any invalid PFNs or non-free pages inside of the pageblock
* (even though it may still end up isolating some pages).
*在zone的[start_pfn,end_pfn]区间里从低地址往高地址进行扫描,将其中可用于迁移的空闲页面隔离到
*freelist链表(cc->freepages)中,函数返回的是该[start_pfn,end_pfn]被隔离页的总数:
* (1)若函数参数strict为false,则当freelists中页的总数大于等于cc->migratepages中页的总数时,扫描循环break
* (2)若函数参数strict为true,则在循环扫描过程中只要pageblock中出现一个不适合迁移的空闲页,整个扫描循环立即
* break结束
*/
static unsigned long isolate_freepages_block(struct compact_control *cc,
unsigned long *start_pfn,
unsigned long end_pfn,
struct list_head *freelist,
bool strict)
{
int nr_scanned = 0, total_isolated = 0;
struct page *cursor, *valid_page = NULL;
unsigned long flags = 0;
bool locked = false;
unsigned long blockpfn = *start_pfn;
unsigned int order;
cursor = pfn_to_page(blockpfn);
/* Isolate free pages. */
//从页框首页向尾页循环扫描,遇到适合迁移的空闲页从伙伴系统取出隔离到freelist链表
for (; blockpfn < end_pfn; blockpfn++, cursor++) {
int isolated;
//当前页描述符
struct page *page = cursor;
/*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort if fatal signal
* pending or async compaction detects need_resched()
*定期释放锁(如果持有),以便给irq机会,释放后:
* 1.若进程收到有待处理的信号循环break
* 2.若进程未收到有待处理的信号:
* (1).若规整模式是非完全异步:判断进程是否需要调度:
* a.若需要调度则主动调用cond_resched()让出cpu,休眠一段时间,下次再次被调度获得cpu后,继续后
* 面的规整扫描工作
* b.若不需要被调度,退出该判断直接继续进行规整扫描工作
* (2)若规整模型为完全异步:判断进程是否需要被调度:
* a.若需要被调度,则直接break该循环
* b.若不需要被调度,则退出该判断直接进行后续扫描工作
*/
*/
if (!(blockpfn % SWAP_CLUSTER_MAX)
&& compact_unlock_should_abort(&cc->zone->lock, flags,
&locked, cc))
break;
//记录函数的扫描总次数
nr_scanned++;
//检查当前页框号的合法性
if (!pfn_valid_within(blockpfn))
goto isolate_fail;
//valid记录本次循环第一次处理页的描述符
if (!valid_page)
valid_page = page;
/*
* For compound pages such as THP and hugetlbfs, we can save
* potentially a lot of iterations if we skip them at once.
* The check is racy, but we can consider only valid values
* and the only danger is skipping too much.
*若该页在巨页(或透明巨页)中,跳过该巨页(多个4k的普通页)
*/
if (PageCompound(page)) {
unsigned int comp_order = compound_order(page);
if (likely(comp_order < MAX_ORDER)) {
blockpfn += (1UL << comp_order) - 1;
cursor += (1UL << comp_order) - 1;
}
goto isolate_fail;
}
//若该页不是空闲页(没在伙伴系统中),跳过该页
if (!PageBuddy(page))
goto isolate_fail;
/*
* If we already hold the lock, we can skip some rechecking.
* Note that if we hold the lock now, checked_pageblock was
* already set in some previous iteration (or strict is true),
* so it is correct to skip the suitable migration target
* recheck as well.
*获取锁
*/
if (!locked) {
/*
* The zone lock must be held to isolate freepages.
* Unfortunately this is a very coarse lock and can be
* heavily contended if there are parallel allocations
* or parallel compactions. For async compaction do not
* spin on the lock and we acquire the lock as late as
* possible.
*/
locked = compact_trylock_irqsave(&cc->zone->lock,
&flags, cc);
if (!locked)
break;
/* Recheck this is a buddy page under lock */
if (!PageBuddy(page))
goto isolate_fail;
}
/* Found a free page, will break it into order-0 pages */
/*
*下面一系列语句执行的操作是:将伙伴系统中以page为首页的连续页块从伙伴系统中移除,
*并将页块中的所有页隔离到freelists中
*/
//获取空闲页块的阶
order = page_order(page);
//将以page为首页的页块从伙伴系统中移除,页块页的个数记录在isolated中(取出空闲页要满足水线要求)
isolated = __isolate_free_page(page, order);
//获得的空闲页个数为0,多半是迁移类型出问题,直接break循环,介绍本次pageblock的扫描
if (!isolated)
break;
//从伙伴系统中移除的页块,还是以page->lru为头部的链表连接起来的,此重新设置该它的阶order值
set_page_private(page, order);
//更新隔离空闲页总数
total_isolated += isolated;
//更新cc控制器中空闲页总数
cc->nr_freepages += isolated;
//将以page为首的页块加入到freelist链表中
list_add_tail(&page->lru, freelist);
//当函数参数strict为fales,cc控制器中隔离的迁移页数量低于隔离的空闲页数量,break,退出该扫描
if (!strict && cc->nr_migratepages <= cc->nr_freepages) {
//记录最后一个扫描的页框
blockpfn += isolated;
break;
}
/* Advance to the end of split page */
//记录本次扫描的最后一个页的页框号
blockpfn += isolated - 1;
//记录本次扫描最后一个页的页描述符
cursor += isolated - 1;
continue;
isolate_fail:
//函数参数strict为true,扫描要求严格,该pagelock区间内存在一个页不适合作为迁移空闲页,整个pageblock扫描结束
if (strict)
break;
else
continue;
}
//解锁
if (locked)
spin_unlock_irqrestore(&cc->zone->lock, flags);
/*
* There is a tiny chance that we have read bogus compound_order(),
* so be careful to not go outside of the pageblock.
*防止越界
*/
if (unlikely(blockpfn > end_pfn))
blockpfn = end_pfn;
trace_mm_compaction_isolate_freepages(*start_pfn, blockpfn,
nr_scanned, total_isolated);
/* Record how far we have got within the block */
//记录空闲页扫描去当前当前扫描到的页的页框号
*start_pfn = blockpfn;
/*
* If strict isolation is requested by CMA then check that all the
* pages requested were isolated. If there were any failures, 0 is
* returned and CMA will fail.
*函数参数strict为true,严格要求,pageblock中有一个页被隔离失败,整个pageblock页框隔离失败,return 0.
*/
if (strict && blockpfn < end_pfn)
total_isolated = 0;
/*
*Update the pageblock-skip if the whole pageblock was scanned
*扫描完了此pageblock,如果此pageblock中没有隔离出空闲页框,则标记此pageblock为跳过
*/
if (blockpfn == end_pfn)
update_pageblock_skip(cc, valid_page, total_isolated, false);
//数据统计,更新并统计内存规整扫描总次数
count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
if (total_isolated)
//更新并统计内存规隔离的总页数
count_compact_events(COMPACTISOLATED, total_isolated);
//返回本次扫描总共隔离的页总数
return total_isolated;
}
迁移失败空闲页释放回cc->freepages链表(compaction_free)
compaction_free函数较简单,就是在页面迁移的过程中会首先调用commpaction_alloc函数从cc->freepages中取出一被隔离的空闲页用于页面迁移,若本次页面迁移操作失败,则利用compaction_free函数将刚刚取出的空闲页再次放回cc->freepages链表中隔离起来。
/*
* This is a migrate-callback that "frees" freepages back to the isolated
* freelist. All pages on the freelist are from the same zone, so there is no
* special handling needed for NUMA.
*/
static void compaction_free(struct page *page, unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
list_add(&page->lru, &cc->freepages);
cc->nr_freepages++;
}
在介绍完migrate_pages函数参数中的两个回调函数后下面开始对migrate_pages函数进行代码分析
migrate_pages函数代码解析
该函数在内存规整流程中的作用就是将cc->migratepages链表中的所有可迁移页的内容和状态逐页迁移到cc->freepages链表中对应的空闲页中去(若空闲页数量不足,则先调用isolate_freepages扫描器扫描zone区域然后往cc->freepages链表中添加合适的空闲页(页块),最后通过compaction_alloc函数从cc->freepages取出并分配)。
抛开内存规整下面对页面迁移函数migrate_pages做一个详细介绍。
函数定义如下:
int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
enum migrate_mode mode, int reason)
-
该函数目的是:将from链表中的所有页逐页迁移到get_new_page回调函数分配的新页中去。
-
函数参数put_new_page页是一个回调函数,该函数作用是当get_new_page分配的新页在迁移过程中失败了,则将page新页释放到指定位置。(内存规整时是释放道cc->freepages链表中进行隔离)
-
参数private表示migrate_pages调用者传递过来的特定数据,用于控制get_new_page回调函数获取满足迁移要求的新页(比如在内存规整中该数据会被强制转化为内存规整的控制参数指针struct compact_control * cc)。
-
参数mod表示本次页迁移模型enum migrate_mode (内存规整mode等于cc->mode)
-
参数reason表示本次迁移原因enum migrate_reason(对于内存规整操作,该参数等于MR_COMPACTION)
enum migrate_reason { MR_COMPACTION, MR_MEMORY_FAILURE, MR_MEMORY_HOTPLUG, MR_SYSCALL, /* also applies to cpusets */ MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CMA, MR_TYPES };
在migrate_pages中每进行一次页框处理,都会先判断当前的进程是否需要被调度,后续再进行页迁移工作
/*
* migrate_pages - migrate the pages specified in a list, to the free pages
* supplied as the target for the page migration
*
* @from: The list of pages to be migrated.
* @get_new_page: The function used to allocate free pages to be used
* as the target of the page migration.
* @put_new_page: The function used to free target pages if migration
* fails, or NULL if no special handling is necessary.
* @private: Private data to be passed on to get_new_page()
* @mode: The migration mode that specifies the constraints for
* page migration, if any.
* @reason: The reason for page migration.
*
* The function returns after 10 attempts or if no pages are movable any more
* because the list has become empty or no retryable pages exist any more.
* The caller should call putback_movable_pages() to return pages to the LRU
* or free list only if ret != 0.
*
* Returns the number of pages that were not migrated, or an error code.
*/
int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
enum migrate_mode mode, int reason)
{
int retry = 1;
int nr_failed = 0;
int nr_succeeded = 0;
int pass = 0;
struct page *page;
struct page *page2;
//当前进程是否允许将页写到swap区
int swapwrite = current->flags & PF_SWAPWRITE;
int rc;
if (!swapwrite)//如果当前进程不支持将页写到swap,要强制其支持
current->flags |= PF_SWAPWRITE;
//循环10次迁移操作保证,迁移完全
for(pass = 0; pass < 10 && retry; pass++) {
retry = 0;
//遍历from链表中的每个页,page表示当前页,page2表示page在from链表中的下一个页
list_for_each_entry_safe(page, page2, from, lru) {
/*
*判断当前进程是否需要被调度,若需要先主动让出CPU等待下次被调度,原因不知道,可能是防止内存规整操作占
*用cpu资源过久???
*/
cond_resched();
//判断page是否属于普通巨页(不包含透明巨页)
if (PageHuge(page))
//巨页情况下,页的迁移处理(不分析,因为内存规整触发的页迁移不会对巨页或透明巨页进行迁移)
rc = unmap_and_move_huge_page(get_new_page,
put_new_page, private, page,
pass > 2, mode, reason);
else
//常规页的迁移处理,通过get_new_page获取一个新空闲页,将page页迁移到新空闲页中
rc = unmap_and_move(get_new_page, put_new_page,
private, page, pass > 2, mode,
reason);
//unmap_and_move返回值处理
switch(rc) {
//page迁移失败,由于迁移的新页分配失败导致
case -ENOMEM:
nr_failed++;
goto out;
//page迁移失败,迁移页page锁获取失败,会break内循环,重新尝试对该页进行迁移,不超过10次
case -EAGAIN:
retry++;
break;
//迁移成功
case MIGRATEPAGE_SUCCESS:
nr_succeeded++;
break;
default:
/*
* Permanent failure (-EBUSY, -ENOSYS, etc.):
* unlike -EAGAIN case, the failed page is
* removed from migration page list and not
* retried in the next outer loop.
*/
nr_failed++;
break;
}
}
}
nr_failed += retry;
rc = nr_failed;
out:
if (nr_succeeded)
//统计成功迁移的页数
count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
if (nr_failed)
//统计迁移失败次数
count_vm_events(PGMIGRATE_FAIL, nr_failed);
trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
//恢复本次迁移前设置的PF_SWAPWRITE标志
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;
return rc;
}
单页的迁移(unmap_and_move)
本文只对常规页的迁移函数unmap_and_move进行分析,对巨页的迁移函数unmap_and_move_huge_page不做分析。
static ICE_noinline int unmap_and_move(new_page_t get_new_page,
free_page_t put_new_page,
unsigned long private, struct page *page,
int force, enum migrate_mode mode,
enum migrate_reason reason)
unmap_and_move函数的定义如上所示,其作用是:利用private传递过来的数据,通过回调函数get_new_page分配一个新的空闲页newpage,然后将page对应的页取消映射,最后将page的数据拷贝到新分配的页newpage中去。函数具体实现细节参考下面的源码和注释。
/*
* Obtain the lock on page, remove all ptes and migrate the page
* to the newly allocated page in newpage.
*/
static ICE_noinline int unmap_and_move(new_page_t get_new_page,
free_page_t put_new_page,
unsigned long private, struct page *page,
int force, enum migrate_mode mode,
enum migrate_reason reason)
{
int rc = MIGRATEPAGE_SUCCESS;
int *result = NULL;
struct page *newpage;
//利用private控制数据通过get_new_page函数分配一个空闲页,用于迁移(参考内存规整介绍的compaction_alloc())
newpage = get_new_page(page, private, &result);
//新页分配失败,page迁移fail,函数返回对应错误码
if (!newpage)
return -ENOMEM;
//内核中引用该页次数为1,该page没有被引用,且并未被进程映射可以直接释放掉,不需要被迁移
if (page_count(page) == 1) {
/* page was freed from under us. So we are done. */
//释放该页,清除lru标志
ClearPageActive(page);
ClearPageUnevictable(page);
if (unlikely(__PageMovable(page))) {
lock_page(page);
if (!PageMovable(page))
//清除被隔离标志
__ClearPageIsolated(page);
unlock_page(page);
}
//迁移失败释放新页
if (put_new_page)//若put_new_page定义了,通过put_new_page释放newpage(参考compacytion_free函数)
put_new_page(newpage, private);
else//若没定义put_new_page,用常规函数put_page函数接触新野的引用(page->count--),放回原处
put_page(newpage);
goto out;
}
//该页属于透明巨页,跳过迁移
if (unlikely(PageTransHuge(page))) {
lock_page(page);
rc = split_huge_page(page);
unlock_page(page);
if (rc)
goto out;
}
//取消page的映射,然后将page的内容copy到newpage中
rc = __unmap_and_move(page, newpage, force, mode);
if (rc == MIGRATEPAGE_SUCCESS)
//迁移成功设置新页被迁移的原因
set_page_owner_migrate_reason(newpage, reason);
out://下面是迁移完成后,根据迁移成功与否对page和newpage页状态的恢复或释放操作。
if (rc != -EAGAIN) {
/*
* A page that has been migrated has all references
* removed and will be freed. A page that has not been
* migrated will have kepts its references and be
* restored.
*迁移完成后,不论失败还是成功,该页若不需要再被迁移,则先解除该页的隔离状态。因为页在迁移过程中会被隔离在
*一个指定的链表中(如内存规整过程迁移的页page会被隔离在cc->migratepages链表中),page->lru指向该链表,
*此处解除这种隔离,于是删除page->lru链接。
*/
list_del(&page->lru);
/*
* Compaction can migrate also non-LRU pages which are
* not accounted to NR_ISOLATED_*. They can be recognized
* as __PageMovable
* 有些迁移的页迁移前未在lru中,然因为页迁移而被隔离,这些页以前被认为是PageMovable状态的页,这类页解
* 除隔离状态与普通页不一样
*/
if (likely(!__PageMovable(page)))
dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
}
/*
* If migration is successful, releases reference grabbed during
* isolation. Otherwise, restore the page to right list unless
* we want to retry.
*/
if (rc == MIGRATEPAGE_SUCCESS) {//页迁移成功
put_page(page);//page->count--
if (reason == MR_MEMORY_FAILURE) {
/*
* Set PG_HWPoison on just freed page
* intentionally. Although it's rather weird,
* it's how HWPoison flag works at the moment.
*/
if (!test_set_page_hwpoison(page))
num_poisoned_pages_inc();
}
} else {//页迁移失败
if (rc != -EAGAIN) {//迁移完后,页迁移失败,但page页若不需要再被迁移
if (likely(!__PageMovable(page))) {
putback_lru_page(page);//解除隔离的页放回原来的lru链表中,等待被释放
goto put_new;
}
lock_page(page);
if (PageMovable(page))
putback_movable_page(page);//前面提到的迁移前未在lru链表特殊页,隔离在本身该隔离的区域
else
__ClearPageIsolated(page);//清除隔离状态
unlock_page(page);
put_page(page);//page->count--
}
put_new://页迁移失败,对新分配页的释放工作
if (put_new_page)
put_new_page(newpage, private);
else
put_page(newpage);
}
if (result) {
if (rc)
*result = rc;
else
*result = page_to_nid(newpage);
}
return rc;
}
unmap_and_move函数如果迁移成功了,则会将旧页释放到伙伴系统中。
- http://www.elecfans.com/d/1263966.html
- https://zhuanlan.zhihu.com/p/81983973
- https://www.cnblogs.com/tolimit/p/5286663.html