https://www.cnblogs.com/arnoldlu/p/8335475.html
https://book.aikaiyuan.com/kernel/6.5.3.htm
记录几个相关的网页
Page Fault的IDT项
idtentry page_fault do_page_fault has_error_code=1 read_cr2=1
idtentry是一个汇编宏。
/**
* idtentry - Generate an IDT entry stub
* @sym: Name of the generated entry point
* @do_sym: C function to be called
* @has_error_code: True if this IDT vector has an error code on the stack
* @paranoid: non-zero means that this vector may be invoked from
* kernel mode with user GSBASE and/or user CR3.
* 2 is special -- see below.
* @shift_ist: Set to an IST index if entries from kernel mode should
* decrement the IST stack so that nested entries get a
* fresh stack. (This is for #DB, which has a nasty habit
* of recursing.)
* @create_gap: create a 6-word stack gap when coming from kernel mode.
* @read_cr2: load CR2 into the 3rd argument; done before calling any C code
*
* idtentry generates an IDT stub that sets up a usable kernel context,
* creates struct pt_regs, and calls @do_sym. The stub has the following
* special behaviors:
*
* On an entry from user mode, the stub switches from the trampoline or
* IST stack to the normal thread stack. On an exit to user mode, the
* normal exit-to-usermode path is invoked.
*
* On an exit to kernel mode, if @paranoid == 0, we check for preemption,
* whereas we omit the preemption check if @paranoid != 0. This is purely
* because the implementation is simpler this way. The kernel only needs
* to check for asynchronous kernel preemption when IRQ handlers return.
*
* If @paranoid == 0, then the stub will handle IRET faults by pretending
* that the fault came from user mode. It will handle gs_change faults by
* pretending that the fault happened with kernel GSBASE. Since this handling
* is omitted for @paranoid != 0, the #GP, #SS, and #NP stubs must have
* @paranoid == 0. This special handling will do the wrong thing for
* espfix-induced #DF on IRET, so #DF must not use @paranoid == 0.
*
* @paranoid == 2 is special: the stub will never switch stacks. This is for
* #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
*/
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0 read_cr2=0
...
.endm
最终会进入到【do_page_fault】。
do_page_fault
dotraplinkage void
do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
这个函数里面会跟踪记录上下文(CONFIG_CONTEXT_TRACKING)和错误页(CONFIG_TRACING),然后进入到【__do_page_fault】。
__do_page_fault
/*
* Explicitly marked noinline such that the function tracer sees this as the
* page_fault entry point.
*/
static noinline void
__do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address)
【__do_page_fault】检查错误页的线性地址是不是在地址空间的内核控制的部分。一般情况下不会是内核页出错,但如果内核页出错,那么执行【do_kern_addr_fault】。
我更关注非内核控制部分的地址空间上的页错误,这部分会由【do_user_addr_fault】来处理。
do_user_addr_fault
/* Handle faults in the user portion of the address space */
static inline
void do_user_addr_fault(struct pt_regs *regs,
unsigned long hw_error_code,
unsigned long address)
{
...
/* kprobes don't want to hook the spurious faults: */
...
/*
* Reserved bits are never expected to be set on
* entries in the user portion of the page tables.
*/
...
/*
* If SMAP is on, check for invalid kernel (supervisor) access to user
* pages in the user address space. The odd case here is WRUSS,
* which, according to the preliminary documentation, does not respect
* SMAP and will have the USER bit set so, in all cases, SMAP
* enforcement appears to be consistent with the USER bit.
*/
...
/*
* If we're in an interrupt, have no user context or are running
* in a region with pagefaults disabled then we must not take the fault
*/
...
/*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
*
* User-mode registers count as a user access even for any
* potential system fault or CPU buglet:
*/
...
#ifdef CONFIG_X86_64
/*
* Faults in the vsyscall page might need emulation. The
* vsyscall page is at a high address (>PAGE_OFFSET), but is
* considered to be part of the user address space.
*
* The vsyscall page does not have a "real" VMA, so do this
* emulation before we go searching for VMAs.
*
* PKRU never rejects instruction fetches, so we don't need
* to consider the PF_PK bit.
*/
...
#endif
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
* tables. But, an erroneous kernel fault occurring outside one of
* those areas which also holds mmap_sem might deadlock attempting
* to validate the fault against the address space.
*
* Only do the expensive exception table search when we might be at
* risk of a deadlock. This happens if we
* 1. Failed to acquire mmap_sem, and
* 2. The access did not originate in userspace.
*/
...
/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it..
*/
good_area:
...
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*
* Note that handle_userfault() may also release and reacquire mmap_sem
* (and not return with VM_FAULT_RETRY), when returning to userland to
* repeat the page fault later with a VM_FAULT_NOPAGE retval
* (potentially after handling any pending signal during the return to
* userland). The return to userland is identified whenever
* FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
*/
...
/*
* If we need to retry the mmap_sem has already been released,
* and if there is a fatal signal pending there is no guarantee
* that we made any progress. Handle this case first.
*/
...
/*
* Major/minor page fault accounting. If any of the events
* returned VM_FAULT_MAJOR, we account it as a major fault.
*/
...
}
NOKPROBE_SYMBOL(do_user_addr_fault);
首先看看页错误是不是kprobe中发生的,如果是,就试图用kprobe中注册的fault_handler去处理,如果处理并不ok,走常规的页错误处理逻辑。
用户的页表部分的Page Fault的具体错误代码中的保留位不应该被设置。
如果SMAP被开启,那么内核模式被拒绝访问用户地址,【bad_area_nosemaphore】。
如果没有可用的用户上下文,或者页错误处理句柄当前不可用,那么【bad_area_nosemaphore】。
如果页错误上下文是用户模式(通过被检测上下文中记录的CS寄存器来判断),允许本地中断(最终通过调用硬件指令STI),并对局部变量标志符号(初始时候FAULT_FLAG_ALLOW_RETRY、FAULT_FLAG_KILLABLE置位)FAULT_FLAG_USER置位。
/*
* Page fault error code bits:
*
* bit 0 == 0: no page found 1: protection fault
* bit 1 == 0: read access 1: write access
* bit 2 == 0: kernel-mode access 1: user-mode access
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
*/
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
X86_PF_WRITE = 1 << 1,
X86_PF_USER = 1 << 2,
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
};
#define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */
#define FAULT_FLAG_MKWRITE 0x02 /* Fault was mkwrite of existing pte */
#define FAULT_FLAG_ALLOW_RETRY 0x04 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x08 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x20 /* Second try */
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
/**
* enum vm_fault_reason - Page fault handlers return a bitmask of
* these values to tell the core VM what happened when handling the
* fault. Used to decide whether a process gets delivered SIGBUS or
* just gets major/minor fault counters bumped up.
*
* @VM_FAULT_OOM: Out Of Memory
* @VM_FAULT_SIGBUS: Bad access
* @VM_FAULT_MAJOR: Page read from storage
* @VM_FAULT_WRITE: Special case for get_user_pages
* @VM_FAULT_HWPOISON: Hit poisoned small page
* @VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded
* in upper bits
* @VM_FAULT_SIGSEGV: segmentation fault
* @VM_FAULT_NOPAGE: ->fault installed the pte, not return page
* @VM_FAULT_LOCKED: ->fault locked the returned page
* @VM_FAULT_RETRY: ->fault blocked, must retry
* @VM_FAULT_FALLBACK: huge page fault failed, fall back to small
* @VM_FAULT_DONE_COW: ->fault has fully handled COW
* @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs
* fsync() to complete (for synchronous page faults
* in DAX)
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
如果页错误上下文是内核模式,STI设置当前上下文,允许本地中断
根据PF错误码,设置FAULT_FLAG_WRITE和FAULT_FLAG_INSTRUCTION
X64下,如果vsyscall(syscall实现被映射到用户空间)页出现错误,但它本身不具有真实的VMA,需要对它进行模拟(emulation)。
使用mm_sem,让当前上下文能够对mmap(vma列表)进行只读访问。
找到页错误地址对应的vma。检查vma非空、地址落在vma中。
如果address小于vma起始地址,并且落在Stack中(VM_GROWSDOWN),那么进一步向下扩展一下VMA即可(也就是增加栈对应的vma)。
【access_error】判断当前vma是否可写或者可执行
【handle_mm_fault】处理页错误。
如果失败之后重试一次可能能够成功,并且允许重试,那么重试一次。如果不允许重试,就直接返回到用户态(如果是用户模式的错误),或者内核模式处理异常【no_context】(还不ok就OOPS/SIGBUS)。针对包括页错误在内的段错误,内核会发送SIGSEGV信号,如果用户态注册了SIGSEGV处理句柄,那么就会将控制流移交到SIGSEGV处理句柄。
彻底失败后,就报错。
handle_mm_fault
/*
* By the time we get here, we already hold the mm semaphore
*
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
标记当前状态为TASK_RUNNING。
PGFAULT事件计数。
RSS Counter更新到mm中去。
如果错误页是Huge TLB Page,那么【hugetlb_fault】来处理。不然就用【__handle_mm_fault】处理.
__handle_mm_fault
/*
* By the time we get here, we already hold the mm semaphore
*
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
对线程地址创建各级页表项(会为页表项分配页【alloc_pages】),pgd->p4d->pud->pmd。Huge Page的流程会略有不同。
PTE的错误在【handle_pte_fault】中处理。
handle_pte_fault
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
* RISC architectures). The early dirtying is also good on the i386.
*
* There is also a hook called "update_mmu_cache()" that architectures
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
* We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
* concurrent faults).
*
* The mmap_sem may have been released depending on flags and our return value.
* See filemap_fault() and __lock_page_or_retry().
*/
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
构建PTE项,处理PTE错误。
下表来自Linux内存管理 (10)缺页中断处理。
各种场景 | 缺页中断类型 | 处理函数 | ||
页不在内存中(Not Present) | PTE内容为空 | 有vm_ops | 文件映射缺页中断 | do_fault |
没有vm_ops | 匿名页面缺页中断 | do_anonymous_page | ||
PTE内容存在 | 页被交换到swap分区 | do_swap_page | ||
页在内存中 | 写时复制 | do_wp_page |
Kernel 5.4代码结构略有变化,如下。
判断 | 缺页中断类型 | 处理 | ||
PTE内容为空 | 虚拟地址域(VMA:比如代码段、数据段、栈段、堆段分别是一个VMA)不是匿名的(普通映射) | 匿名页面缺页中断 | 【do_anonymous_page】 | |
虚拟地址域是匿名的(匿名映射) | 文件映射缺页中断 | 【do_fault】 | ||
PTE内容非空 | PTE Not Present(P位为0且None位为0) | 认为页之前被交换到swap分区 | 【do_swap_page】,如果页也不存在于SWAP空间,就会出错 | |
PTE Present(P位为1或None位为1) | PTE None,并且VMA是可访问的 | 去其他NUMA结点找 | 【do_numa_page】 | |
写访问出错,并且PTE RW位为0 | 写时复制 | 【do_wp_page】,并标记Dirty位 |
handle_pte_fault处理不了的或者处理出错的,会返回到前面提到的【do_user_addr_fault】函数。内核会以内核模式尝试处理异常,如果仍出错,就会OOPS。针对包括页错误在内的段错误,内核会发送SIGSEGV信号,如果用户态注册了SIGSEGV处理句柄,那么就会将控制流移交到SIGSEGV处理句柄。
段错误和页错误的关系可以参考这个https://www.zhihu.com/question/358909046/answer/920302851