目录
从匿名页加入Inactive lru引出 一个非常重要内核patch
本文以Linux5.9源码讲述
匿名页的生成
- 用户空间malloc/mmap(非映射文件时)来分配内存,在内核空间发生缺页中断时,do_anonymous_page会产生匿名页,这是最主要的生成场景。
- 写时复制。缺页中断出现写保护错误,新分配的页面是匿名页,主要是do_wp_page和do_cow_fault。
- do_swap_page,从swap分区读回数据会分配匿名页。
匿名页生成时的状态
migrate type: moveable
page->_refcount: 2
page->_mapcount: 0
page->mapping: 指向vma中的anon_vma数据结构,跟rmap反向映射有关
page->index: 虚拟地址是vma中第几个页面,这个offset即为index
Lru :inactive aono lru
flags: [PG_Swapbacked | PG_lru]。页面支持swap,android上比如时zram压缩,注意没有设置PG_referenced.
- PG_swapbacked:匿名页do_anonymous_page调用page_add_new_anon_rmap时设置了该flag,代表可以交换到swap分区(比如android的zram)。内核有个函数叫PageSwapBacked,满足条件是两种页面:一是此处的anon page,另外一种是shmem page。
- moveable可以理解,因为匿名页面也会缺页中断do_anonymous_page的时候会填充页表,page mirgrate迁移的时候只要修改页表映射即可。参见 do_anonymous_page中的alloc_zeroed_user_highpage_movable。
- _refcount 等于2说明被内核中引用了两次。
- 第一次引用:alloc_pages从buddy中申请出来的page默认_refcount = 1。这个很好理解,被分配就相当于”出嫁“有了约束,相当于被引用(约束)了一次,释放回buddy之后意味了自由和无约束,那么_refcount = 0;
- 第二次引用:加入inactive lru。匿名页产生的时候会加入inactive anon lru中,参见do_anonymous_page代码中的lru_cache_add_inactive_or_unevictable
- 注意:这里加入inactive anon lru严格意义讲是一个lru cache(批处理提升性能),一旦从lru cache迁移到真正的lru中,引用计数会-1,此时_refcount = 1。
- _mapcount: 0,说明匿名页生成时,只有一个进程页表映射了该匿名页。设置该字段参见下面的page_add_new_anon_rmap函数。
- mapping:指向anon_vma结构
- 对于匿名页来讲,其mapping指向匿名映射的anon_vam数据结构(文件页对一个address_space)。
- 既然mapping字段对于不同类型的文件指向不同对象,内核可以利用该字段判定page是否是匿名页,即PageAnon函数:mapping指针的最低位不是0,那么就是匿名页。
-
#define PAGE_MAPPING_ANON 0x1 #define PAGE_MAPPING_MOVABLE 0x2 #define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) static __always_inline int PageAnon(struct page *page) { page = compound_head(page); return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0; }
- mapping字段赋值:参见do_anonymous_page的page_add_new_anon_rmap函数
-
/** * page_add_new_anon_rmap - add pte mapping to a new anonymous page * @page: the page to add the mapping to * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped * @compound: charge the page as compound or small page * * Same as page_add_anon_rmap but must only be called on *new* pages. * This means the inc-and-test can be bypassed. * Page does not have to be locked. */ void page_add_new_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address, bool compound) { int nr = compound ? hpage_nr_pages(page) : 1; VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); __SetPageSwapBacked(page); if (compound) { VM_BUG_ON_PAGE(!PageTransHuge(page), page); /* increment count (starts at -1) */ atomic_set(compound_mapcount_ptr(page), 0); __inc_node_page_state(page, NR_ANON_THPS); } else { /* Anon THP always mapped first with PMD */ VM_BUG_ON_PAGE(PageTransCompound(page), page); /* increment count (starts at -1) */ atomic_set(&page->_mapcount, 0); } __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr); __page_set_anon_rmap(page, vma, address, 1); } /** * __page_set_anon_rmap - set up new anonymous rmap * @page: Page to add to rmap * @vma: VM area to add page to. * @address: User virtual address of the mapping * @exclusive: the page is exclusively owned by the current process */ static void __page_set_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address, int exclusive) { struct anon_vma *anon_vma = vma->anon_vma; BUG_ON(!anon_vma); if (PageAnon(page)) return; /* * If the page isn't exclusively mapped into this vma, * we must use the _oldest_ possible anon_vma for the * page mapping! */ if (!exclusive) anon_vma = anon_vma->root; anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); }
do_anonymous_page缺页中断源码
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_lock still held, but pte unmapped and unlocked.
*/
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page;
vm_fault_t ret = 0;
pte_t entry;
...
//从该函数名字就知道最终调用的伙伴系统申请了zero且moveable的页面
//从伙伴系统中刚分配的页面:_refcount = 1,_mapcount = -1;
page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
if (!page)
goto oom;
...
/*
* The memory barrier inside __SetPageUptodate makes sure that
* preceding stores to the page contents become visible before
* the set_pte_at() write.
*/
__SetPageUptodate(page);
...
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
//核心设置anon page的mapping和index字段
page_add_new_anon_rmap(page, vma, vmf->address, false);
//添加到相应的LRU链表中
lru_cache_add_inactive_or_unevictable(page, vma);
...
}
lru_cache_add_inactive_or_unevictable
函数作用:将page加入特定的lru链表中,源码如下:
/**
* lru_cache_add_inactive_or_unevictable
* @page: the page to be added to LRU
* @vma: vma in which page is mapped for determining reclaimability
*
* Place @page on the inactive or unevictable LRU list, depending on its
* evictability. Note that if the page is not evictable, it goes
* directly back onto it's zone's unevictable list, it does NOT use a
* per cpu pagevec.
*/
void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma)
{
...
lru_cache_add(page);
}
void lru_cache_add(struct page *page)
{
struct pagevec *pvec;
VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);
get_page(page);
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
//pagevec_add是使用了缓存机制,因为向lru链表中插入page需要枷锁,单个page增加效率太低
//所以内核先将page加入一个缓存中,缓存满之后批量插入lru,批处理效率更高
if (!pagevec_add(pvec, page) || PageCompound(page))
__pagevec_lru_add(pvec);
local_unlock(&lru_pvecs.lock);
}
page_add_new_anon_rmap
/**
* page_add_new_anon_rmap - add pte mapping to a new anonymous page
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
* @compound: charge the page as compound or small page
*
* Same as page_add_anon_rmap but must only be called on *new* pages.
* This means the inc-and-test can be bypassed.
* Page does not have to be locked.
*/
void page_add_new_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, bool compound)
{
int nr = compound ? thp_nr_pages(page) : 1;
VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
__SetPageSwapBacked(page);
if (compound) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
__inc_lruvec_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
/* increment count (starts at -1) */
atomic_set(&page->_mapcount, 0);
}
__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
__page_set_anon_rmap(page, vma, address, 1);
}
/**
* __page_set_anon_rmap - set up new anonymous rmap
* @page: Page or Hugepage to add to rmap
* @vma: VM area to add page to.
* @address: User virtual address of the mapping
* @exclusive: the page is exclusively owned by the current process
*/
static void __page_set_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, int exclusive)
{
struct anon_vma *anon_vma = vma->anon_vma;
BUG_ON(!anon_vma);
if (PageAnon(page))
return;
/*
* If the page isn't exclusively mapped into this vma,
* we must use the _oldest_ possible anon_vma for the
* page mapping!
*/
if (!exclusive)
anon_vma = anon_vma->root;
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);
}
函数描述:
- 这里只考虑非compoun page的情况,设置page->_mapcount = 0。_mapcount = 0代表只有一个进程的pte映射该page。
- __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);增加NR_ANON_MAPPED引用计数,NR_ANON_MAPPED对应/proc/meminfo AnonPages 字段,即匿名页大小。
3. __page_set_anon_rmap,设置page的anon_vma,mapping,index字段。
从匿名页加入Inactive lru引出 一个非常重要内核patch
上面有个很重要的点:anon page刚产生时候在5.9源码版本上加入的是Inactive anon lru列表中。而在更早的内核版本中,比如4.14的时候anon page还是加入active anon lru,这个点要特别注意,而内核改动这个逻辑主要是由于如下patch引入:
[PATCH v7 0/6] workingset protection/detection on the anonymous LRU list
说明:内核之所以如此修改主要是因为系统可能产生大量的仅used-once的anon page,如果将这些匿名页加入active page会导致active过度增长,进而active : inactive lru链表的比例失调,我们知道页面老化shrink的时候如果比例失调会触发shrink_active_list,那么这些used-once anon page就会将active lru中hot的page给老化到inactive anon lru链表中,这个patch将anon page创建后加入了inactive anon lru链表中。
不过万事有利也有弊,这个patch也说明了一个缺点:anon page加入了inactive anon lru,就是anon page更容易被换出释放掉。比如anon re-access interval介于inactive list但是小于active + inactive list的时候,就被换出了,而内核workingset的refault-distance算法正是为了解决这个问题,起初内核只对file-back page使用该算法,即算法只保护了file-back page,而在5.9内核中anon page也被该算法保护,所以也就可以将刚生成的anon page加入到inactive anon lru链表了。
匿名页何时回收
1. used-once
如果匿名页只使用一次,且如上面所述,anon page处于inactive anon lru之中,会经历两次老化才能释放页面,这也是"两次机会法"的体现,也就是说两次机会在访问和释放page的时候都会给page两次机会,不能稍有风吹草动就把page给释放,即两次shrink_page_list才能释放used-once anon page:
第一次shrink: 清理掉referenced_ptes和PG_referenced状态,page_check_references返回PAGEREF_KEEP
第二次shrink: 第一次shrink清理了标志状态,第二次shrink可直接回收了。
2.多次访问
第一种情况:访问间隔很短 - 迁移入active anon lru
当前anon page处于inactive anon lru链表中,推动其在inactive和inactive切换的驱动力也是页面老化(这个点非常重要):如果内存一直充足而不触发页面回收老化,那么anon page将一直保持在inactive 列表中,只有内存紧张触发page reclaim的时候才开始决定page何去何从:回收或者保持在inactive或者迁移到active列表中。
基于上面描述,由于页面re-access,那么pte访问重新置位,那么page_check_referenced返回PAGEREF_ACTIVATE,将该anon page迁移到active anon lru链表中。
static enum page_references page_check_references(struct page *page,
struct scan_control *sc)
{
int referenced_ptes, referenced_page;
unsigned long vm_flags;
referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
&vm_flags);
referenced_page = TestClearPageReferenced(page);
if (referenced_ptes) {
/*
* All mapped pages start out with page table
* references from the instantiating fault, so we need
* to look twice if a mapped file page is used more
* than once.
*
* Mark it and spare it for another trip around the
* inactive list. Another page table reference will
* lead to its activation.
*
* Note: the mark is set for activated pages as well
* so that recently deactivated but used pages are
* quickly recovered.
*/
SetPageReferenced(page);
//re-acess page触发该逻辑
if (referenced_page || referenced_ptes > 1)
return PAGEREF_ACTIVATE;
/*
* Activate file-backed executable pages after first usage.
*/
if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
return PAGEREF_ACTIVATE;
return PAGEREF_KEEP;
}
/* Reclaim if clean, defer dirty pages to writeback */
if (referenced_page && !PageSwapBacked(page))
return PAGEREF_RECLAIM_CLEAN;
return PAGEREF_RECLAIM;
}
第二种情况:访问间隔很长 - refault distance算法决定page到底迁入inactive还是active
如果访问间隔较长,两次老化shrink后就会将该anon page回收(anon page对于android上就是放入swap分区,即zram压缩中)。被回收之后再次访问时缺页称为refault,refault之后该内核会判定该anon page再回收释放时,到re-access refault时候,内核一共老化了多少页面,假设是num:
- num < inactive anon lru 那么将anon page加入inactive lru.
- inactive anon list < num < inactive anon lru + active anon lru,那么将anon page迁移到active anon lru中,这样可以尽量避免anon page被再次回收释放。
匿名页换出到磁盘或者zram流程
内存不足触发内存时候的时候,我们知道vmscan.c中的shrink_page_list会将适合回收的page进行回收,其中匿名页会写入交换分区(磁盘或者zram内存压缩),代码如下:
pageout是个“多面手”:不仅能将匿名页换出到磁盘或者zram这种交换分区,同时也可以file-back的page cache写入磁盘,也可以将shmem共享内存的物理page写入swap交换分区,pageout之所以能够这么神通广大源自于struct address_space参数:mapping。pageout最终是实现page换出或者写回磁盘是调用: mapping->a_ops->writepage:
vmscan.c:static pageout_t pageout(struct page *page, struct address_space *mapping)
mapping 对应的类型address_space 一般译为"地址空间”,晦涩而难懂,还不如源码对于该数据结构的描述:
address_space参数mapping 本质上就是一堆缓存对象的集合。那么怎么通过page获取其address_space呢?page_mapping函数:
函数解析:
1. slab中的page:返回NULL。
2. PageSwapCache的page:返回的是swap_address_space,我们主要关心swap_address_sapce的a_ops设置,这个设置是在swap_state.c中的init_swap_address_space函数,注意:PageSwapCache意味着page处于交换高速缓存中,即调用过add_to_swap_cache
交换空间的address_space_operations对应的是swap_aops,其writepage函数是swap_writepage,也就是处于交换高速缓存中的匿名页,其写回磁盘或者zram的调用mapping->a_ops->writepage调用的是swap_writepage。
3. 如果file-backed的page cache或者shmem对应page直接返回page->mapping。
file-backed page cache : 假设文件是ext4文件系统,那么其mapping->a_ops->writepage最终调用的是ext4_writepage。
shmem page:其mapping->a_ops对应shmem.c:shmem_aops,设置流程如下:
shmem.c: shmem_get_inode
创建完shmem对应的page,需要将page加入mapping对应缓存中,同时会设置page->mapping,这个mapping就对应inode->i_imapping,故其a_ops为shmem_aops.
4. 匿名页没有加速swap cache之前,满足mapping & PAGE_MAPPING_ANON,这种情况返回NULL.
展示一张匿名页写入zram的流程图: