x86内核页错误处理流程

https://www.cnblogs.com/arnoldlu/p/8335475.html

https://book.aikaiyuan.com/kernel/6.5.3.htm

记录几个相关的网页

Page Fault的IDT项

idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1

idtentry是一个汇编宏。

/**
 * idtentry - Generate an IDT entry stub
 * @sym:		Name of the generated entry point
 * @do_sym:		C function to be called
 * @has_error_code:	True if this IDT vector has an error code on the stack
 * @paranoid:		non-zero means that this vector may be invoked from
 *			kernel mode with user GSBASE and/or user CR3.
 *			2 is special -- see below.
 * @shift_ist:		Set to an IST index if entries from kernel mode should
 *			decrement the IST stack so that nested entries get a
 *			fresh stack.  (This is for #DB, which has a nasty habit
 *			of recursing.)
 * @create_gap:		create a 6-word stack gap when coming from kernel mode.
 * @read_cr2:		load CR2 into the 3rd argument; done before calling any C code
 *
 * idtentry generates an IDT stub that sets up a usable kernel context,
 * creates struct pt_regs, and calls @do_sym.  The stub has the following
 * special behaviors:
 *
 * On an entry from user mode, the stub switches from the trampoline or
 * IST stack to the normal thread stack.  On an exit to user mode, the
 * normal exit-to-usermode path is invoked.
 *
 * On an exit to kernel mode, if @paranoid == 0, we check for preemption,
 * whereas we omit the preemption check if @paranoid != 0.  This is purely
 * because the implementation is simpler this way.  The kernel only needs
 * to check for asynchronous kernel preemption when IRQ handlers return.
 *
 * If @paranoid == 0, then the stub will handle IRET faults by pretending
 * that the fault came from user mode.  It will handle gs_change faults by
 * pretending that the fault happened with kernel GSBASE.  Since this handling
 * is omitted for @paranoid != 0, the #GP, #SS, and #NP stubs must have
 * @paranoid == 0.  This special handling will do the wrong thing for
 * espfix-induced #DF on IRET, so #DF must not use @paranoid == 0.
 *
 * @paranoid == 2 is special: the stub will never switch stacks.  This is for
 * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
 */
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 read_cr2=0
...
.endm

最终会进入到【do_page_fault】。

do_page_fault

dotraplinkage void
do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)

这个函数里面会跟踪记录上下文(CONFIG_CONTEXT_TRACKING)和错误页(CONFIG_TRACING),然后进入到【__do_page_fault】。

__do_page_fault

/*
 * Explicitly marked noinline such that the function tracer sees this as the
 * page_fault entry point.
 */
static noinline void
__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
		unsigned long address)

【__do_page_fault】检查错误页的线性地址是不是在地址空间的内核控制的部分。一般情况下不会是内核页出错,但如果内核页出错,那么执行【do_kern_addr_fault】。

我更关注非内核控制部分的地址空间上的页错误,这部分会由【do_user_addr_fault】来处理。

do_user_addr_fault

/* Handle faults in the user portion of the address space */
static inline
void do_user_addr_fault(struct pt_regs *regs,
			unsigned long hw_error_code,
			unsigned long address)
{
	...
	/* kprobes don't want to hook the spurious faults: */
	...
	/*
	 * Reserved bits are never expected to be set on
	 * entries in the user portion of the page tables.
	 */
	...
	/*
	 * If SMAP is on, check for invalid kernel (supervisor) access to user
	 * pages in the user address space.  The odd case here is WRUSS,
	 * which, according to the preliminary documentation, does not respect
	 * SMAP and will have the USER bit set so, in all cases, SMAP
	 * enforcement appears to be consistent with the USER bit.
	 */
	...
	/*
	 * If we're in an interrupt, have no user context or are running
	 * in a region with pagefaults disabled then we must not take the fault
	 */
	...
	/*
	 * It's safe to allow irq's after cr2 has been saved and the
	 * vmalloc fault has been handled.
	 *
	 * User-mode registers count as a user access even for any
	 * potential system fault or CPU buglet:
	 */
	...
#ifdef CONFIG_X86_64
	/*
	 * Faults in the vsyscall page might need emulation.  The
	 * vsyscall page is at a high address (>PAGE_OFFSET), but is
	 * considered to be part of the user address space.
	 *
	 * The vsyscall page does not have a "real" VMA, so do this
	 * emulation before we go searching for VMAs.
	 *
	 * PKRU never rejects instruction fetches, so we don't need
	 * to consider the PF_PK bit.
	 */
	...
#endif

	/*
	 * Kernel-mode access to the user address space should only occur
	 * on well-defined single instructions listed in the exception
	 * tables.  But, an erroneous kernel fault occurring outside one of
	 * those areas which also holds mmap_sem might deadlock attempting
	 * to validate the fault against the address space.
	 *
	 * Only do the expensive exception table search when we might be at
	 * risk of a deadlock.  This happens if we
	 * 1. Failed to acquire mmap_sem, and
	 * 2. The access did not originate in userspace.
	 */
	...
	/*
	 * Ok, we have a good vm_area for this memory access, so
	 * we can handle it..
	 */
good_area:
	...
	/*
	 * If for any reason at all we couldn't handle the fault,
	 * make sure we exit gracefully rather than endlessly redo
	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
	 *
	 * Note that handle_userfault() may also release and reacquire mmap_sem
	 * (and not return with VM_FAULT_RETRY), when returning to userland to
	 * repeat the page fault later with a VM_FAULT_NOPAGE retval
	 * (potentially after handling any pending signal during the return to
	 * userland). The return to userland is identified whenever
	 * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
	 */
	...
	/*
	 * If we need to retry the mmap_sem has already been released,
	 * and if there is a fatal signal pending there is no guarantee
	 * that we made any progress. Handle this case first.
	 */
	...
	/*
	 * Major/minor page fault accounting. If any of the events
	 * returned VM_FAULT_MAJOR, we account it as a major fault.
	 */
	...
}
NOKPROBE_SYMBOL(do_user_addr_fault);

首先看看页错误是不是kprobe中发生的,如果是,就试图用kprobe中注册的fault_handler去处理,如果处理并不ok,走常规的页错误处理逻辑。

用户的页表部分的Page Fault的具体错误代码中的保留位不应该被设置。

如果SMAP被开启,那么内核模式被拒绝访问用户地址,【bad_area_nosemaphore】。

如果没有可用的用户上下文,或者页错误处理句柄当前不可用,那么【bad_area_nosemaphore】。

如果页错误上下文是用户模式(通过被检测上下文中记录的CS寄存器来判断),允许本地中断(最终通过调用硬件指令STI),并对局部变量标志符号(初始时候FAULT_FLAG_ALLOW_RETRY、FAULT_FLAG_KILLABLE置位)FAULT_FLAG_USER置位。

/*
 * Page fault error code bits:
 *
 *   bit 0 ==	 0: no page found	1: protection fault
 *   bit 1 ==	 0: read access		1: write access
 *   bit 2 ==	 0: kernel-mode access	1: user-mode access
 *   bit 3 ==				1: use of reserved bit detected
 *   bit 4 ==				1: fault was an instruction fetch
 *   bit 5 ==				1: protection keys block access
 */
enum x86_pf_error_code {
	X86_PF_PROT	=		1 << 0,
	X86_PF_WRITE	=		1 << 1,
	X86_PF_USER	=		1 << 2,
	X86_PF_RSVD	=		1 << 3,
	X86_PF_INSTR	=		1 << 4,
	X86_PF_PK	=		1 << 5,
};
#define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
#define FAULT_FLAG_MKWRITE	0x02	/* Fault was mkwrite of existing pte */
#define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED	0x20	/* Second try */
#define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
#define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
/**
 * enum vm_fault_reason - Page fault handlers return a bitmask of
 * these values to tell the core VM what happened when handling the
 * fault. Used to decide whether a process gets delivered SIGBUS or
 * just gets major/minor fault counters bumped up.
 *
 * @VM_FAULT_OOM:		Out Of Memory
 * @VM_FAULT_SIGBUS:		Bad access
 * @VM_FAULT_MAJOR:		Page read from storage
 * @VM_FAULT_WRITE:		Special case for get_user_pages
 * @VM_FAULT_HWPOISON:		Hit poisoned small page
 * @VM_FAULT_HWPOISON_LARGE:	Hit poisoned large page. Index encoded
 *				in upper bits
 * @VM_FAULT_SIGSEGV:		segmentation fault
 * @VM_FAULT_NOPAGE:		->fault installed the pte, not return page
 * @VM_FAULT_LOCKED:		->fault locked the returned page
 * @VM_FAULT_RETRY:		->fault blocked, must retry
 * @VM_FAULT_FALLBACK:		huge page fault failed, fall back to small
 * @VM_FAULT_DONE_COW:		->fault has fully handled COW
 * @VM_FAULT_NEEDDSYNC:		->fault did not modify page tables and needs
 *				fsync() to complete (for synchronous page faults
 *				in DAX)
 * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
 *
 */

如果页错误上下文是内核模式,STI设置当前上下文,允许本地中断

根据PF错误码,设置FAULT_FLAG_WRITE和FAULT_FLAG_INSTRUCTION

X64下,如果vsyscall(syscall实现被映射到用户空间)页出现错误,但它本身不具有真实的VMA,需要对它进行模拟(emulation)。

使用mm_sem,让当前上下文能够对mmap(vma列表)进行只读访问。

找到页错误地址对应的vma。检查vma非空、地址落在vma中。

如果address小于vma起始地址,并且落在Stack中(VM_GROWSDOWN),那么进一步向下扩展一下VMA即可(也就是增加栈对应的vma)。

【access_error】判断当前vma是否可写或者可执行

【handle_mm_fault】处理页错误。

如果失败之后重试一次可能能够成功,并且允许重试,那么重试一次。如果不允许重试,就直接返回到用户态(如果是用户模式的错误),或者内核模式处理异常【no_context】(还不ok就OOPS/SIGBUS)。针对包括页错误在内的段错误,内核会发送SIGSEGV信号,如果用户态注册了SIGSEGV处理句柄,那么就会将控制流移交到SIGSEGV处理句柄。

彻底失败后,就报错。

handle_mm_fault

/*
 * By the time we get here, we already hold the mm semaphore
 *
 * The mmap_sem may have been released depending on flags and our
 * return value.  See filemap_fault() and __lock_page_or_retry().
 */
vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
		unsigned int flags)

标记当前状态为TASK_RUNNING。

PGFAULT事件计数。

RSS Counter更新到mm中去。

如果错误页是Huge TLB Page,那么【hugetlb_fault】来处理。不然就用【__handle_mm_fault】处理.

__handle_mm_fault

/*
 * By the time we get here, we already hold the mm semaphore
 *
 * The mmap_sem may have been released depending on flags and our
 * return value.  See filemap_fault() and __lock_page_or_retry().
 */
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
		unsigned long address, unsigned int flags)

对线程地址创建各级页表项(会为页表项分配页【alloc_pages】),pgd->p4d->pud->pmd。Huge Page的流程会略有不同。

PTE的错误在【handle_pte_fault】中处理。

handle_pte_fault

/*
 * These routines also need to handle stuff like marking pages dirty
 * and/or accessed for architectures that don't do it in hardware (most
 * RISC architectures).  The early dirtying is also good on the i386.
 *
 * There is also a hook called "update_mmu_cache()" that architectures
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 *
 * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
 * concurrent faults).
 *
 * The mmap_sem may have been released depending on flags and our return value.
 * See filemap_fault() and __lock_page_or_retry().
 */
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)

构建PTE项,处理PTE错误。

下表来自Linux内存管理 (10)缺页中断处理

各种场景缺页中断类型处理函数
页不在内存中(Not Present)PTE内容为空有vm_ops文件映射缺页中断do_fault
没有vm_ops匿名页面缺页中断do_anonymous_page
PTE内容存在页被交换到swap分区do_swap_page
页在内存中写时复制do_wp_page

Kernel 5.4代码结构略有变化,如下。

判断缺页中断类型处理
PTE内容为空

虚拟地址域(VMA:比如代码段、数据段、栈段、堆段分别是一个VMA)不是匿名的(普通映射)

匿名页面缺页中断【do_anonymous_page】
虚拟地址域是匿名的(匿名映射)文件映射缺页中断【do_fault】
PTE内容非空PTE Not Present(P位为0且None位为0)认为页之前被交换到swap分区【do_swap_page】,如果页也不存在于SWAP空间,就会出错
PTE Present(P位为1或None位为1)PTE None,并且VMA是可访问的去其他NUMA结点找【do_numa_page】
写访问出错,并且PTE RW位为0写时复制【do_wp_page】,并标记Dirty位

handle_pte_fault处理不了的或者处理出错的,会返回到前面提到的【do_user_addr_fault】函数。内核会以内核模式尝试处理异常,如果仍出错,就会OOPS。针对包括页错误在内的段错误,内核会发送SIGSEGV信号,如果用户态注册了SIGSEGV处理句柄,那么就会将控制流移交到SIGSEGV处理句柄。

段错误和页错误的关系可以参考这个https://www.zhihu.com/question/358909046/answer/920302851

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值