x86内核页错误处理流程

最新推荐文章于 2023-12-23 11:51:47 发布

小气球归来

最新推荐文章于 2023-12-23 11:51:47 发布

阅读量1.5k

点赞数 1

分类专栏：系统

本文链接：https://blog.csdn.net/clh14281055/article/details/109608822

版权

系统专栏收录该内容

35 篇文章 2 订阅

订阅专栏

https://www.cnblogs.com/arnoldlu/p/8335475.html

https://book.aikaiyuan.com/kernel/6.5.3.htm

记录几个相关的网页

Page Fault的IDT项

idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1

idtentry是一个汇编宏。

/**
 * idtentry - Generate an IDT entry stub
 * @sym:		Name of the generated entry point
 * @do_sym:		C function to be called
 * @has_error_code:	True if this IDT vector has an error code on the stack
 * @paranoid:		non-zero means that this vector may be invoked from
 *			kernel mode with user GSBASE and/or user CR3.
 *			2 is special -- see below.
 * @shift_ist:		Set to an IST index if entries from kernel mode should
 *			decrement the IST stack so that nested entries get a
 *			fresh stack.  (This is for #DB, which has a nasty habit
 *			of recursing.)
 * @create_gap:		create a 6-word stack gap when coming from kernel mode.
 * @read_cr2:		load CR2 into the 3rd argument; done before calling any C code
 *
 * idtentry generates an IDT stub that sets up a usable kernel context,
 * creates struct pt_regs, and calls @do_sym.  The stub has the following
 * special behaviors:
 *
 * On an entry from user mode, the stub switches from the trampoline or
 * IST stack to the normal thread stack.  On an exit to user mode, the
 * normal exit-to-usermode path is invoked.
 *
 * On an exit to kernel mode, if @paranoid == 0, we check for preemption,
 * whereas we omit the preemption check if @paranoid != 0.  This is purely
 * because the implementation is simpler this way.  The kernel only needs
 * to check for asynchronous kernel preemption when IRQ handlers return.
 *
 * If @paranoid == 0, then the stub will handle IRET faults by pretending
 * that the fault came from user mode.  It will handle gs_change faults by
 * pretending that the fault happened with kernel GSBASE.  Since this handling
 * is omitted for @paranoid != 0, the #GP, #SS, and #NP stubs must have
 * @paranoid == 0.  This special handling will do the wrong thing for
 * espfix-induced #DF on IRET, so #DF must not use @paranoid == 0.
 *
 * @paranoid == 2 is special: the stub will never switch stacks.  This is for
 * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
 */
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 read_cr2=0
...
.endm

最终会进入到【do_page_fault】。

do_page_fault

dotraplinkage void
do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)

这个函数里面会跟踪记录上下文（CONFIG_CONTEXT_TRACKING）和错误页(CONFIG_TRACING)，然后进入到【__do_page_fault】。

__do_page_fault

/*
 * Explicitly marked noinline such that the function tracer sees this as the
 * page_fault entry point.
 */
static noinline void
__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
		unsigned long address)

【__do_page_fault】检查错误页的线性地址是不是在地址空间的内核控制的部分。一般情况下不会是内核页出错，但如果内核页出错，那么执行【do_kern_addr_fault】。

我更关注非内核控制部分的地址空间上的页错误，这部分会由【do_user_addr_fault】来处理。

do_user_addr_fault

/* Handle faults in the user portion of the address space */
static inline
void do_user_addr_fault(struct pt_regs *regs,
			unsigned long hw_error_code,
			unsigned long address)
{
	...
	/* kprobes don't want to hook the spurious faults: */
	...
	/*
	 * Reserved bits are never expected to be set on
	 * entries in the user portion of the page tables.
	 */
	...
	/*
	 * If SMAP is on, check for invalid kernel (supervisor) access to user
	 * pages in the user address space.  The odd case here is WRUSS,
	 * which, according to the preliminary documentation, does not respect
	 * SMAP and will have the USER bit set so, in all cases, SMAP
	 * enforcement appears to be consistent with the USER bit.
	 */
	...
	/*
	 * If we're in an interrupt, have no user context or are running
	 * in a region with pagefaults disabled then we must not take the fault
	 */
	...
	/*
	 * It's safe to allow irq's after cr2 has been saved and the
	 * vmalloc fault has been handled.
	 *
	 * User-mode registers count as a user access even for any
	 * potential system fault or CPU buglet:
	 */
	...
#ifdef CONFIG_X86_64
	/*
	 * Faults in the vsyscall page might need emulation.  The
	 * vsyscall page is at a high address (>PAGE_OFFSET), but is
	 * considered to be part of the user address space.
	 *
	 * The vsyscall page does not have a "real" VMA, so do this
	 * emulation before we go searching for VMAs.
	 *
	 * PKRU never rejects instruction fetches, so we don't need
	 * to consider the PF_PK bit.
	 */
	...
#endif

	/*
	 * Kernel-mode access to the user address space should only occur
	 * on well-defined single instructions listed in the exception
	 * tables.  But, an erroneous kernel fault occurring outside one of
	 * those areas which also holds mmap_sem might deadlock attempting
	 * to validate the fault against the address space.
	 *
	 * Only do the expensive exception table search when we might be at
	 * risk of a deadlock.  This happens if we
	 * 1. Failed to acquire mmap_sem, and
	 * 2. The access did not originate in userspace.
	 */
	...
	/*
	 * Ok, we have a good vm_area for this memory access, so
	 * we can handle it..
	 */
good_area:
	...
	/*
	 * If for any reason at all we couldn't handle the fault,
	 * make sure we exit gracefully rather than endlessly redo
	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
	 *
	 * Note that handle_userfault() may also release and reacquire mmap_sem
	 * (and not return with VM_FAULT_RETRY), when returning to userland to
	 * repeat the page fault later with a VM_FAULT_NOPAGE retval
	 * (potentially after handling any pending signal during the return to
	 * userland). The return to userland is identified whenever
	 * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
	 */
	...
	/*
	 * If we need to retry the mmap_sem has already been released,
	 * and if there is a fatal signal pending there is no guarantee
	 * that we made any progress. Handle this case first.
	 */
	...
	/*
	 * Major/minor page fault accounting. If any of the events
	 * returned VM_FAULT_MAJOR, we account it as a major fault.
	 */
	...
}
NOKPROBE_SYMBOL(do_user_addr_fault);

首先看看页错误是不是kprobe中发生的，如果是，就试图用kprobe中注册的fault_handler去处理，如果处理并不ok，走常规的页错误处理逻辑。

用户的页表部分的Page Fault的具体错误代码中的保留位不应该被设置。

如果SMAP被开启，那么内核模式被拒绝访问用户地址，【bad_area_nosemaphore】。

如果没有可用的用户上下文，或者页错误处理句柄当前不可用，那么【bad_area_nosemaphore】。

如果页错误上下文是用户模式(通过被检测上下文中记录的CS寄存器来判断)，允许本地中断（最终通过调用硬件指令STI），并对局部变量标志符号（初始时候FAULT_FLAG_ALLOW_RETRY、FAULT_FLAG_KILLABLE置位）FAULT_FLAG_USER置位。

/*
 * Page fault error code bits:
 *
 *   bit 0 ==	 0: no page found	1: protection fault
 *   bit 1 ==	 0: read access		1: write access
 *   bit 2 ==	 0: kernel-mode access	1: user-mode access
 *   bit 3 ==				1: use of reserved bit detected
 *   bit 4 ==				1: fault was an instruction fetch
 *   bit 5 ==				1: protection keys block access
 */
enum x86_pf_error_code {
	X86_PF_PROT	=		1 << 0,
	X86_PF_WRITE	=		1 << 1,
	X86_PF_USER	=		1 << 2,
	X86_PF_RSVD	=		1 << 3,
	X86_PF_INSTR	=		1 << 4,
	X86_PF_PK	=		1 << 5,
};

#define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
#define FAULT_FLAG_MKWRITE	0x02	/* Fault was mkwrite of existing pte */
#define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED	0x20	/* Second try */
#define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
#define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */

/**
 * enum vm_fault_reason - Page fault handlers return a bitmask of
 * these values to tell the core VM what happened when handling the
 * fault. Used to decide whether a process gets delivered SIGBUS or
 * just gets major/minor fault counters bumped up.
 *
 * @VM_FAULT_OOM:		Out Of Memory
 * @VM_FAULT_SIGBUS:		Bad access
 * @VM_FAULT_MAJOR:		Page read from storage
 * @VM_FAULT_WRITE:		Special case for get_user_pages
 * @VM_FAULT_HWPOISON:		Hit poisoned small page
 * @VM_FAULT_HWPOISON_LARGE:	Hit poisoned large page. Index encoded
 *				in upper bits
 * @VM_FAULT_SIGSEGV:		segmentation fault
 * @VM_FAULT_NOPAGE:		->fault installed the pte, not return page
 * @VM_FAULT_LOCKED:		->fault locked the returned page
 * @VM_FAULT_RETRY:		->fault blocked, must retry
 * @VM_FAULT_FALLBACK:		huge page fault failed, fall back to small
 * @VM_FAULT_DONE_COW:		->fault has fully handled COW
 * @VM_FAULT_NEEDDSYNC:		->fault did not modify page tables and needs
 *				fsync() to complete (for synchronous page faults
 *				in DAX)
 * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
 *
 */

如果页错误上下文是内核模式，STI设置当前上下文，允许本地中断

根据PF错误码，设置FAULT_FLAG_WRITE和FAULT_FLAG_INSTRUCTION

X64下，如果vsyscall（syscall实现被映射到用户空间）页出现错误，但它本身不具有真实的VMA，需要对它进行模拟（emulation）。

使用mm_sem，让当前上下文能够对mmap（vma列表）进行只读访问。

找到页错误地址对应的vma。检查vma非空、地址落在vma中。

如果address小于vma起始地址，并且落在Stack中（VM_GROWSDOWN），那么进一步向下扩展一下VMA即可（也就是增加栈对应的vma）。

【access_error】判断当前vma是否可写或者可执行

【handle_mm_fault】处理页错误。

如果失败之后重试一次可能能够成功，并且允许重试，那么重试一次。如果不允许重试，就直接返回到用户态（如果是用户模式的错误），或者内核模式处理异常【no_context】（还不ok就OOPS/SIGBUS）。针对包括页错误在内的段错误，内核会发送SIGSEGV信号，如果用户态注册了SIGSEGV处理句柄，那么就会将控制流移交到SIGSEGV处理句柄。

彻底失败后，就报错。

handle_mm_fault

/*
 * By the time we get here, we already hold the mm semaphore
 *
 * The mmap_sem may have been released depending on flags and our
 * return value.  See filemap_fault() and __lock_page_or_retry().
 */
vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
		unsigned int flags)

标记当前状态为TASK_RUNNING。

PGFAULT事件计数。

RSS Counter更新到mm中去。

如果错误页是Huge TLB Page，那么【hugetlb_fault】来处理。不然就用【__handle_mm_fault】处理.

__handle_mm_fault

/*
 * By the time we get here, we already hold the mm semaphore
 *
 * The mmap_sem may have been released depending on flags and our
 * return value.  See filemap_fault() and __lock_page_or_retry().
 */
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
		unsigned long address, unsigned int flags)

对线程地址创建各级页表项（会为页表项分配页【alloc_pages】），pgd->p4d->pud->pmd。Huge Page的流程会略有不同。

PTE的错误在【handle_pte_fault】中处理。

handle_pte_fault

/*
 * These routines also need to handle stuff like marking pages dirty
 * and/or accessed for architectures that don't do it in hardware (most
 * RISC architectures).  The early dirtying is also good on the i386.
 *
 * There is also a hook called "update_mmu_cache()" that architectures
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 *
 * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
 * concurrent faults).
 *
 * The mmap_sem may have been released depending on flags and our return value.
 * See filemap_fault() and __lock_page_or_retry().
 */
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)

构建PTE项，处理PTE错误。

下表来自Linux内存管理 (10)缺页中断处理。

各种场景			缺页中断类型	处理函数
页不在内存中（Not Present）	PTE内容为空	有vm_ops	文件映射缺页中断	do_fault
	PTE内容为空	没有vm_ops	匿名页面缺页中断	do_anonymous_page
	PTE内容存在		页被交换到swap分区	do_swap_page
页在内存中			写时复制	do_wp_page

Kernel 5.4代码结构略有变化，如下。

判断			缺页中断类型	处理
PTE内容为空	虚拟地址域（VMA：比如代码段、数据段、栈段、堆段分别是一个VMA）不是匿名的（普通映射）		匿名页面缺页中断	【do_anonymous_page】
	虚拟地址域是匿名的（匿名映射）		文件映射缺页中断	【do_fault】
PTE内容非空	PTE Not Present（P位为0且None位为0）		认为页之前被交换到swap分区	【do_swap_page】，如果页也不存在于SWAP空间，就会出错
	PTE Present（P位为1或None位为1）	PTE None，并且VMA是可访问的	去其他NUMA结点找	【do_numa_page】
		写访问出错，并且PTE RW位为0	写时复制	【do_wp_page】，并标记Dirty位

handle_pte_fault处理不了的或者处理出错的，会返回到前面提到的【do_user_addr_fault】函数。内核会以内核模式尝试处理异常，如果仍出错，就会OOPS。针对包括页错误在内的段错误，内核会发送SIGSEGV信号，如果用户态注册了SIGSEGV处理句柄，那么就会将控制流移交到SIGSEGV处理句柄。

段错误和页错误的关系可以参考这个https://www.zhihu.com/question/358909046/answer/920302851

小气球归来

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
x86内核页错误处理流程

Page Fault的IDT项idtentry page_fault do_page_fault has_error_code=1 read_cr2=1idtentry是一个汇编宏。/** * idtentry - Generate an IDT entry stub * @sym: Name of the generated entry point * @do_sym: C function to be called * @has_error_code: True if t
复制链接

扫一扫