本文分析基于linux内核4.19.195
最近想到了一个以前没有考虑过的问题:
hugetlbfs大页遇到cow时,是怎么处理的?
最开始的想法是,按照cow的原理,内核会重新申请一个大页,供父进程或子进程中第一个触发写操作的进程进行使用,但事实真的是如此吗?这么做的话,如果我们一开始往/sys/kernel/mm/hugepages/hugepages-xxxkB/nr_hugepages里写的值不够会怎么样?
先给出结论,然后再进行分析:
- 若使用MAP_SHARED进行的映射,则不会走cow,在fork时就会将页表建立好,fork完后子进程访问该段内存也不会触发缺页中断
- 若没使用MAP_SHARED进行的映射,则父进程继续使用原先分配出来的大页内存,而子进程一开始也会映射到原先的大页内存,但是一旦父进程或者子进程进行写操作,就会触发cow(这里和普通页的cow是一样的),如果页分配成功,则最终的结果是一个进程使用新的大页,一个进程继续使用旧的大页。
下面先结合代码分析情况1。
执行fork()时,会有如下流程:
_do_fork()->copy_process()->copy_mm()->dup_mm()->dup_mmap()->copy_page_range()->copy_hugetlb_page_range()
int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma)
{
pte_t *src_pte, *dst_pte, entry, dst_entry;
struct page *ptepage;
unsigned long addr;
int cow;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
int ret = 0;
cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
break;
}
/*
* If the pagetables are shared don't copy or take references.
* dst_pte == src_pte is the common case of src/dest sharing.
*
* However, src could have 'unshared' and dst shares with
* another vma. If dst_pte !none, this implies sharing.
* Check here before taking page table lock, and once again
* after taking the lock below.
*/
dst_entry = huge_ptep_get(dst_pte);
if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
continue;
dst_ptl = huge_pte_lock(h, dst, dst_pte);
src_ptl = huge_pte_lockptr(h, src, src_pte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
dst_entry = huge_ptep_get(dst_pte);
if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
/*
* Skip if src entry none. Also, skip in the
* unlikely case dst entry !none as this implies
* sharing with another vma.
*/
;
} else if (unlikely(is_hugetlb_entry_migration(entry) ||
is_hugetlb_entry_hwpoisoned(entry))) {
swp_entry_t swp_entry = pte_to_swp_entry(entry);
if (is_write_migration_entry(swp_entry) && cow) {
/*
* COW mappings require pages in both
* parent and child to be set to read.
*/
make_migration_entry_read(&swp_entry);
entry = swp_entry_to_pte(swp_entry);
set_huge_swap_pte_at(src, addr, src_pte,
entry, sz);
}
set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
} else {
if (cow) {
/*
* No need to notify as we are downgrading page
* table protection not changing it to point
* to a new page.
*
* See Documentation/vm/mmu_notifier.rst
*/
huge_ptep_set_wrprotect(src, addr, src_pte); //作用是将映射大页物理内存的最后一级页表设置为写保护
}
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
get_page(ptepage);
page_dup_rmap(ptepage, true);
set_huge_pte_at(dst, addr, dst_pte, entry);
hugetlb_count_add(pages_per_huge_page(h), dst);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
}
if (cow)
mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
return ret;
}
从代码中可以看到,fork时会根据我们mmap时是否有VM_SHARED参数而确定后续是否需要cow,如果需要的话,会调用huge_ptep_set_wrprotect,先将父进程的页表项改为写保护,然后复制到子进程的页表项中,也就是父子进程对于该内存的页表项均是写保护;如果不需要的话,则父子进程对于该内存的页表项均是可读可写,所以可以得出情况1的结论。
接下来分析情况2。
通过对情况1的分析,我们知道,如果没有使用MAP_SHARED进行的映射,则父子进程对应的页表都是写保护的,那么,进行写操作时,就会触发到cow。在缺页中断中,会走到函数hugetlb_cow()
/*
* Hugetlb_cow() should be called with page lock of the original hugepage held.
* Called with hugetlb_instantiation_mutex held and pte_page locked so we
* cannot race with other handlers or page migration.
* Keep the pte_same checks anyway to make transition from the mutex easier.
*/
static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
struct page *pagecache_page, spinlock_t *ptl)
{
pte_t pte;
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
int outside_reserve = 0;
vm_fault_t ret = 0;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
unsigned long haddr = address & huge_page_mask(h);
pte = huge_ptep_get(ptep);
old_page = pte_page(pte);
retry_avoidcopy:
/* If no-one else is actually using this page, avoid the copy
* and just make the page writable */
if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { //只有一个虚拟页映射&&匿名页;这个条件下只需要直接修改页表项
page_move_anon_rmap(old_page, vma);
set_huge_ptep_writable(vma, haddr, ptep);
return 0;
}
/*
* If the process that created a MAP_PRIVATE mapping is about to
* perform a COW due to a shared page count, attempt to satisfy
* the allocation without using the existing reserves. The pagecache
* page is used to determine if the reserve at this address was
* consumed or not. If reserves were used, a partial faulted mapping
* at the time of fork() could consume its reserves on COW instead
* of the full address range.
*/
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
old_page != pagecache_page)
outside_reserve = 1;
get_page(old_page);
/*
* Drop page table lock as buddy allocator may be called. It will
* be acquired again before returning to the caller, as expected.
*/
spin_unlock(ptl);
new_page = alloc_huge_page(vma, haddr, outside_reserve); //分配巨型页
if (IS_ERR(new_page)) { //分配失败的话
/*
* If a process owning a MAP_PRIVATE mapping fails to COW,
* it is due to references held by a child and an insufficient
* huge page pool. To guarantee the original mappers
* reliability, unmap the page from child processes. The child
* may get SIGKILLed if it later faults.
*/
if (outside_reserve) {
put_page(old_page);
BUG_ON(huge_pte_none(pte));
unmap_ref_private(mm, vma, old_page, haddr);
BUG_ON(huge_pte_none(pte));
spin_lock(ptl);
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep &&
pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
/*
* race occurs while re-acquiring page table
* lock, and our job is done.
*/
return 0;
}
ret = vmf_error(PTR_ERR(new_page));
goto out_release_old;
}
/*
* When the original hugepage is shared one, it does not have
* anon_vma prepared.
*/
if (unlikely(anon_vma_prepare(vma))) {
ret = VM_FAULT_OOM;
goto out_release_all;
}
copy_user_huge_page(new_page, old_page, address, vma, //copy
pages_per_huge_page(h));
__SetPageUptodate(new_page);
mmun_start = haddr;
mmun_end = mmun_start + huge_page_size(h);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
/*
* Retake the page table lock to check for racing updates
* before the page tables are altered
*/
spin_lock(ptl);
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
ClearPagePrivate(new_page);
/* Break COW */
huge_ptep_clear_flush(vma, haddr, ptep);
mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
set_huge_pte_at(mm, haddr, ptep, //映射到新页
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_page_huge_active(new_page);
/* Make the old page be freed below */
new_page = old_page;
}
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out_release_all:
restore_reserve_on_error(h, vma, haddr, new_page);
put_page(new_page);
out_release_old:
put_page(old_page);
spin_lock(ptl); /* Caller expects lock to be held */
return ret;
}
fork后第一个对该页进行写操作的进程,会走进缺页中断;该函数会分配一个hugepage,分配成功后,就调用copy_user_huge_page完成大页的复制工作,然后对大页进行一些初始化工作后,便可完成退出。
fork后第二个对该页进行写操作的进程,也会走进缺页中断;我们注意标签retry_avoidcopy处的代码,这个时候会进行mapcount的判断,如果只有一个进程对这个页进行了映射,那么我们只需要简单的把写保护去掉(即配置成writable就可以了),然后函数即可退出。
如果分配大页分配失败的话,会有两种情况:
如果触发页错误异常的进程是创建私有映射的进程,那么删除所有子进程的映射,为子进程的虚拟内存区域的成员vm_private_data设置标志HPAGE_RESV_UNMAPPED,让子进程在发生页错误异常的时候被杀死。
如果触发页错误异常的进程不是创建私有映射的进程,返回错误。