mmap 源码分析
准备
内核版本: 4.20.1
上一篇Linux环境写文件如何稳定跑满磁盘I-O带宽我们使用了mmap
来帮助我们写文件稳定的跑满了磁盘I/O,这篇我们来详细介绍一下mmap()
的细节和源码分析. 虽然我们使用mmap()
只是简单的映射文件至内存中,而mmap()
的设计实现主要涉及内核中的虚拟内存空间和内存映射等细节.
函数原型
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); |
这是mmap
的函数原型,而系统调用的接口在mm/mmap.c
中的:
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff); |
虚拟内存区域管理
这里我们先介绍两个关于虚拟内存的数据结构。虚拟内存概念的相关资料网上已经足够的丰富,这里我们从内核的角度来分析。虚拟空间的管理是以进程为基础的,每个进程都有各自的虚存空间,除此之外,每个进程的“内核虚拟空间”是为所有的进程所共享的。一个进程的虚拟地址空间主要由两个数据结构来描述: mm_struct(内存描述符)
和vm_area_struct(虚拟内存区域描述符)
。
The Memory Descriptor(内存描述符)
mm_struct
包括进程中虚拟地址空间的所有信息,mm_struct
定义在include/linux/mm_types.h
:
struct mm_struct { struct { struct vm_area_struct *mmap; /* vm_area_struct的链表 */ pgd_t * pgd; /* 指向进程的页目录 */ /* ... */ int map_count; /* vm_area_struct数量 */ /* ... */ unsigned long total_vm; /* 映射的Page数量 */ /* ... */ unsigned long start_code, end_code, start_data, end_data; /* 代码段起始结束位置,数据段起始结束位置 */ unsigned long start_brk, brk, start_stack; /* 堆的起始结束位置, 栈因为其性质,只有起始位置 */ unsigned long arg_start, arg_end, env_start, env_end; /* 参数段,环境段的起始结束位置 */ /* ... */ } } |
结合mm_struct
和下图32位系统典型的虚拟地址空间分布更能直观的理解(来自《深入理解计算机系统》):
Virtual Memory Area(虚拟内存区域描述符)
vm_area_struct
描述了虚拟地址空间的一个区间, 一个进程的虚拟空间中可能有多个虚拟区间, vm_area_struct
同样定义在include/linux/mm_types.h
:
/* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory * space that has a special rule for the page-fault handlers (ie a shared * library, the executable area etc). */ struct vm_area_struct { /* The first cache line has the info for VMA tree walking. */ unsigned long vm_start; /* 在虚拟地址空间的起始位置 */ unsigned long vm_end; /* 在虚拟地址空间的结束位置*/ /* linked list of VM areas per task, sorted by address */ struct vm_area_struct *vm_next, *vm_prev; /* 虚拟内存区域链表中的前继,后继指针 */ struct rb_node vm_rb; /* * Largest free memory gap in bytes to the left of this VMA. * Either between this VMA and vma->vm_prev, or between one of the * VMAs below us in the VMA rbtree and its ->vm_prev. This helps * get_unmapped_area find a free area of the right size. */ unsigned long rb_subtree_gap; /* Second cache line starts here. */ /* Function pointers to deal with this struct. */ const struct vm_operations_struct *vm_ops; /* 虚拟内存操作集合 */ struct mm_struct *vm_mm; /* vma所属的虚拟地址空间 */ pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ unsigned long vm_pgoff; /* 以Page为单位的偏移. */ struct file * vm_file; /* 映射的文件,匿名映射即为nullptr*/ |
下图是某个进程的虚拟内存简化布局以及相应的几个数据结构之间的关系:
mmap映射执行流程
- 检查参数,并根据传入的映射类型设置
vma
的flags. - 进程查找其虚拟地址空间,找到一块空闲的满足要求的虚拟地址空间.
- 根据找到的虚拟地址空间初始化
vma
. - 设置
vma->vm_file
. - 根据文件系统类型,将
vma->vm_ops
设为对应的file_operations
. - 将
vma
插入mm
的链表中.
源码分析
我们接下来进入mmap
的代码分析:
do_mmap()
do_mmap()是整个mmap()
的具体操作函数, 我们跳过系统调用来直接看具体实现:
unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) { struct mm_struct *mm = current->mm; /* 获取该进程的memory descriptor int pkey = 0; *populate = 0; /* 函数对传入的参数进行一系列检查, 假如任一参数出错,都会返回一个errno */ if (!len) return -EINVAL; /* * Does the application expect PROT_READ to imply PROT_EXEC? * * (the exception is when the underlying filesystem is noexec * mounted, in which case we dont add PROT_EXEC.) */ if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; /* force arch specific MAP_FIXED handling in get_unmapped_area */ if (flags & MAP_FIXED_NOREPLACE) flags |= MAP_FIXED; /* 假如没有设置MAP_FIXED标志,且addr小于mmap_min_addr, 因为可以修改addr, 所以就需要将addr设为mmap_min_addr的页对齐后的地址 */ if (!(flags & MAP_FIXED)) addr = round_hint_to_min(addr); /* Careful about overflows.. */ /* 进行Page大小的对齐 */ len = PAGE_ALIGN(len); if (!len) return -ENOMEM; /* offset overflow? */ if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) return -EOVERFLOW; /* Too many mappings? */ /* 判断该进程的地址空间的虚拟区间数量是否超过了限制 */ if (mm->map_count > sysctl_max_map_count) return -ENOMEM; /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ /* get_unmapped_area从当前进程的用户空间获取一个未被映射区间的起始地址 */ addr = get_unmapped_area(file, addr, len, pgoff, flags); /* 检查addr是否有效 */ if (offset_in_page(addr)) return addr; /* 假如flags设置MAP_FIXED_NOREPLACE,需要对进程的地址空间进行addr的检查. 如果搜索发现存在重合的vma, 返回-EEXIST。 这是MAP_FIXED_NOREPLACE标志所要求的 */ if (flags & MAP_FIXED_NOREPLACE) { struct vm_area_struct *vma = find_vma(mm, addr); if (vma && vma->vm_start < addr + len) return -EEXIST; } if (prot == PROT_EXEC) { pkey = execute_only_pkey(mm); if (pkey < 0) pkey = 0; } /* Do simple checking here so the lower-level routines won't have * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. */ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; /* 假如flags设置MAP_LOCKED,即类似于mlock()将申请的地址空间锁定在内存中, 检查是否可以进行lock*/ if (flags & MAP_LOCKED) if (!can_do_mlock()) return -EPERM; if (mlock_future_check(mm, vm_flags, len)) return -EAGAIN; if (file) { /* file指针不为nullptr, 即从文件到虚拟空间的映射 */ struct inode *inode = file_inode(file); /* 获取文件的inode */ unsigned long flags_mask; if (!file_mmap_ok(file, inode, pgoff, len)) return -EOVERFLOW; flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags; /* ... 根据标志指定的map种类,把为文件设置的访问权考虑进去。 如果所请求的内存映射是共享可写的,就要检查要映射的文件是为写入而打开的,而不 是以追加模式打开的,还要检查文件上没有上强制锁。 对于任何种类的内存映射,都要检查文件是否为读操作而打开的。 ... */ } else { switch (flags & MAP_TYPE) { case MAP_SHARED: if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; /* * Ignore pgoff. */ pgoff = 0; vm_flags |= VM_SHARED | VM_MAYSHARE; break; case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma. */ pgoff = addr >> PAGE_SHIFT; break; default: return -EINVAL; } } /* * Set 'VM_NORESERVE' if we should not account for the * memory use of this mapping. */ if (flags & MAP_NORESERVE) { /* We honor MAP_NORESERVE if allowed to overcommit */ if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) vm_flags |= VM_NORESERVE; /* hugetlb applies strict overcommit unless MAP_NORESERVE */ if (file && is_file_hugepages(file)) vm_flags |= VM_NORESERVE; } addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; return addr; |
mmap_region()
do_mmap()
根据用户传入的参数做了一系列的检查,然后根据参数初始化vm_area_struct
的标志vm_flags
,vma->vm_file = get_file(file)
建立文件与vma
的映射, mmap_region()负责创建虚拟内存区域:
unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) { struct mm_struct *mm = current->mm; // 获取该进程的memory descriptor struct vm_area_struct *vma, *prev; int error; struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; /* Check against address space limit. */ /* 检查申请的虚拟内存空间是否超过了限制. */ if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) { unsigned long nr_pages; /* * MAP_FIXED may remove pages of mappings that intersects with * requested mapping. Account for the pages it would unmap. */ nr_pages = count_vma_pages_range(mm, addr, addr + len); if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages)) return -ENOMEM; } /* 检查[addr, addr+len)的区间是否存在映射空间,假如存在重合的映射空间需要munmap */ while (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) { if (do_munmap(mm, addr, len, uf)) return -ENOMEM; } /* * Private writable mapping: check memory availability */ if (accountable_mapping(file, vm_flags)) { charged = len >> PAGE_SHIFT; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; vm_flags |= VM_ACCOUNT; } /* 检查是否可以合并[addr, addr+len)区间内的虚拟地址空间vma*/ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX); if (vma) /* 假如合并成功,即使用合并后的vma, 并跳转至out */ goto out; /* * Determine the object being mapped and call the appropriate * specific mapper. the address has already been validated, but * not unmapped, but the maps are removed from the list. */ /* 如果不能和已有的虚拟内存区域合并,通过Memory Descriptor来申请一个vma */ vma = vm_area_alloc(mm); if (!vma) { error = -ENOMEM; goto unacct_error; } /* 初始化vma */ vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff; if (file) { /* 假如指定了文件映射 */ if (vm_flags & VM_DENYWRITE) { /* 映射的文件不允许写入,调用deny_write_accsess(file)排斥常规的文件操作 */ error = deny_write_access(file); if (error) goto free_vma; } if (vm_flags & VM_SHARED) { /* 映射的文件允许其他进程可见, 标记文件为可写 */ error = mapping_map_writable(file->f_mapping); if (error) goto allow_write_and_free_vma; } /* ->mmap() can change vma->vm_file, but must guarantee that * vma_link() below can deny write-access if VM_DENYWRITE is set * and map writably if VM_SHARED is set. This usually means the * new file must not have been exposed to user-space, yet. */ vma->vm_file = get_file(file); /* 递增File的引用次数,返回File赋给vma*/ error = call_mmap(file, vma); /* 调用文件系统指定的mmap函数,后面会介绍 */ if (error) goto unmap_and_free_vma; /* Can addr have changed?? * * Answer: Yes, several device drivers can do it in their * f_op->mmap method. -DaveM * Bug: If addr is changed, prev, rb_link, rb_parent should * be updated for vma_link() */ WARN_ON_ONCE(addr != vma->vm_start); addr = vma->vm_start; vm_flags = vma->vm_flags; } else if (vm_flags & VM_SHARED) { /* 假如标志为VM_SHARED,但没有指定映射文件,需要调用shmem_zero_setup() shmem_zero_setup()实际映射的文件是dev/zero */ error = shmem_zero_setup(vma); if (error) goto free_vma; } else { /* 既没有指定file, 也没有设置VM_SHARED, 即设置为匿名映射 */ vma_set_anonymous(vma); } /* 将申请的新vma加入mm中的vma链表*/ vma_link(mm, vma, prev, rb_link, rb_parent); /* Once vma denies write, undo our temporary denial count */ if (file) { if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); if (vm_flags & VM_DENYWRITE) allow_write_access(file); } file = vma->vm_file; out: perf_event_mmap(vma); /* 更新进程的虚拟地址空间mm */ vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm)) vma->vm_flags &= VM_LOCKED_CLEAR_MASK; else mm->locked_vm += (len >> PAGE_SHIFT); } if (file) uprobe_mmap(vma); /* * New (or expanded) vma always get soft dirty status. * Otherwise user-space soft-dirty page tracker won't * be able to distinguish situation when vma area unmapped, * then new mapped in-place (which must be aimed as * a completely new data area). */ vma->vm_flags |= VM_SOFTDIRTY; vma_set_page_prot(vma); return addr; unmap_and_free_vma: vma->vm_file = NULL; fput(file); /* Undo any partial mapping done by a device driver. */ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); charged = 0; if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); allow_write_and_free_vma: if (vm_flags & VM_DENYWRITE) allow_write_access(file); free_vma: vm_area_free(vma); unacct_error: if (charged) vm_unacct_memory(charged); return error; } |
mmap_region()
调用了call_mmap(file, vma)
: call_mmap
根据文件系统的类型选择适配的mmap()
函数,我们选择目前常用的ext4
:
ext4_file_mmap()
是ext4
对应的mmap
, 功能非常简单,更新了file的修改时间(file_accessed(flie))
,将对应的operation赋给vma->vm_flags
:
三个操作函数的意义:
.fault
: 处理Page Fault.map_pages
: 映射文件至Page Cache.page_mkwrite
: 修改文件的状态为可写
static const struct vm_operations_struct ext4_file_vm_ops = { .fault = ext4_filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ext4_page_mkwrite, }; static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_mapping->host; if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO; /* * We don't support synchronous mappings for non-DAX files. At least * until someone comes with a sensible use case. */ if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC)) return -EOPNOTSUPP; file_accessed(file); if (IS_DAX(file_inode(file))) { vma->vm_ops = &ext4_dax_vm_ops; vma->vm_flags |= VM_HUGEPAGE; } else { vma->vm_ops = &ext4_file_vm_ops; } return 0; } |
通过分析mmap
的源码我们发现在调用mmap()
的时候仅仅申请一个vm_area_struct
来建立文件与虚拟内存的映射,并没有建立虚拟内存与物理内存的映射。假如没有设置MAP_POPULATE
标志位,Linux并不在调用mmap()
时就为进程分配物理内存空间,直到下次真正访问地址空间时发现数据不存在于物理内存空间时,触发Page Fault
即缺页中断,Linux才会将缺失的Page换入内存空间. 后面的文章我们会介绍Linux的缺页(Page fault)处理和请求Page的机制.
匿名映射
mmap()
设置参数MAP_ANONYMOUS
即可指定匿名映射,mmap
的匿名映射并不执行文件或设备为映射地址,实际上映射的文件为/dev/zero
,匿名页的物理内存一般分配用来作为进程的栈或堆的虚拟内存映射.
总结
常用的read()
首先从文件的Page读取至内核页缓存,然后再从内核态的内存空间拷贝到用户态的内存空间,而mmap
直接建立了文件与虚拟地址空间的映射, 可以直接通过MMU
根据虚拟地址空间的地址映射从内核的物理内存区读取数据, 省去了内核态拷贝数据至用户态的开销. 因为mmap
的修改直接反映在物理内存时,所以kill -9
进程不会丢数据.
Q&A
-
vm_area_struct
如何寻找对应的物理内存页?vm_area_struct
结构中并没有直接的存放Page
指针的结构体,但包含虚拟地址的起始地址和结束地址vm_start
和vm_end
, 通过虚拟地址转换物理地址的方法可以直接寻找到指定的Page
. -
如何处理变长的文件?
Rocksdb使用了
mmap
的方式写文件, 首先fallocate
固定长度len
的文件,然后通过mmap
建立映射,使用一个base
指针来滑动写入位置,写满长度len
之后,调用munmap
. 假如Close文件时写不够长度len
, 即mummap
写入的长度,然后使用ftruncate()
将多余的映射部分截去. -
mmap()
之后memcpy()
出现SIGBUS
错误:SIGBUS
出现在缺页中断处理的过程中,即前面我们提到的ext4_file_vm_ops
的ext4_file_vm_ops()
:do_mmap()
有一行len = PAGE_ALIGN(len)
, 即根据传入的参数len
进行页对齐后的长度来映射文件,但这里并没有考虑文件size.
而缺页中断后真正的文件映射读取会考虑文件长度,即读取的offset假如超过了文件size页对齐后的长度,即会返回SIGBUS
./* * DIV_ROUND_UP()意为向上取整, i_size_read(inode)返回文件的长度(inode->i_size) * 假如文件长度为7000, 经过DIV_ROUND_UP(), max_off返回8192 */ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); /* * offset为memcpy()中目标地址addr所指向的偏移位置,假如超过了max_off,返回了SIGBUS */ if (unlikely(offset >= max_off)) return VM_FAULT_SIGBUS;
-
mmap()
之后memcpy()
出现SIGSEGV
错误: (mm/memory.c:handle_mm_fault()
)if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) /* * 当进程访问试图访问非法的虚拟地址空间,返回SIGSEGV错误 */ return VM_FAULT_SIGSEGV;
-
mmap
是银弹吗?不是, 随机写频繁触发的
Page Fault
和脏页回写使得mmap
避免在内核态与用户态之间的拷贝的优势减弱,下图是Linux环境写文件如何稳定跑满磁盘I-O带宽中方案三的mmap
顺序写入的火焰图,我们可以更直观的看到mmap
的瓶颈所在: -
mmap设置
MAP_SHARED
, 这部分使用的内存会计算在RSS中吗?会,RSS(Resident set size)意为常驻使用内存,一般理解为真正使用的物理内存,当这部分设置了
MAP_SHARED
的内存触发了Page Fault
,被OS真正分配了物理内存,就会在RSS的数值上体现.