Anatomy of a Program in Memory

最新推荐文章于 2023-05-11 09:30:00 发布

痒痒挠963

最新推荐文章于 2023-05-11 09:30:00 发布

阅读量929

点赞数

文章标签： windows linux

原文链接：https://manybutfinite.com/post/page-cache-the-affair-between-memory-and-files/

版权

前言：原文来自于：Anatomy of a Program in Memory | Many But Finite，这里只是对其进行翻译，并且重构了原文中的图片。译注则是我增加的内容，用来解释原文或提出问题；由于个人水平有限，译文和译注中的错误之处还请广大坛友提出指正，不胜感激。
下面采用分段中英对照的方式列出内容：

Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.
内存管理是操作系统的核心功能之一，对编程与系统管理而言都很关键。在接下来的文章里，我将着眼于讨论实践方面的内存，但是也不回避其内部原理。概念是通用的，例子多数来自运行在 32位 X86体系结构处理器上的 Linux 与 Windows操作系统。第一篇文章描述了程序在内存中是如何布局的。

Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself.
Thus a portion of the virtual address space must be reserved to the kernel:
一个多任务操作系统的每个进程运行在自身的内存沙箱中。该沙箱是虚拟的地址空间，在32位（译注：保护？）模式下总是4GB大小的内存地址块。这些进程的虚拟地址通过页表被映射至物理内存，

页表由操作系统内核维护，并且可被处理器查阅。每个进程都有它自己的一组页表，但这里有一个问题。一旦启用虚拟地址，它们将应用于所有机器上运行的软件，包括内核自身。
因此，一部分虚拟地址空间需要保留给内核：（下图展示了现代操作系统中的每个进程的一致虚拟内存布局，内核占有其中一部分的地址空间）

This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
这并不意味着内核就一定使用这么多的物理内存，只有内核可用的那部分虚拟地址空间（1GB）能够映射至物理内存。内核空间在页表中以其独有的特权码（ring2 或者更低）来标记，因此当一个用户模式程序试图接触内核空间时，会引发页面错误。在 Linux 中，内核空间总是存在于所有进程中，并且映射至相同的物理内存。内核代码和数据总是可寻址的，从而能够准备随时处理中断和系统调用（这些都需要处理器转入内核空间访问其代码和数据结构）。相比之下，每当发生进程（任务）切换时，地址空间中的用户模式部分（用户空间）到物理内存的映射也会变化，这一点可以通过下图加以说明：

Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
上图中的蓝色区域代表映射至物理内存的虚拟地址，而白色区域则未映射。在上面的例子中，Firefox 浏览器进程由于其惊人的内存需求，已经使用了远多于自身虚拟地址空间的（相应物理）内存。
地址空间中不同（分隔的）条带区域对应诸如堆，栈等的内存段。需要留意的是，这些内存段只是一个虚拟内存地址范围，它与 Intel 的物理内存分段寻址（译注：数据段，代码段，堆栈段。。。这
些段的基址通过相应的段描述符保存在全局描述符表即GDT这个裸机内存/内核级数据结构中，也可以将GDT的基址加载到GDTR即全局描述符表寄存器中，处理器通过查找前者内容实现段式寻址）
没有关系。总之，下图就是一个 Linux 进程的虚拟地址空间/内存映射的标准布局：

When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine.
This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.
如果计算机愉快安全且准确无误的运行程序，那么上图所示的每个内存段的起始虚拟地址，对于计算机上的几乎每一个进程而言都是完全相同的。这使得远程利用安全漏洞（译注：远程执行恶意代码？）变得很容易。一个（漏洞）利用往往需要引用绝对内存位置：一个用户栈中的地址，一个库函数的（起始）地址等等。远程攻击者必须尝试摸索性的选择（译注：猜测？）这个位置，并且指望地址空间均相同成为事实。如果它们真的遇到这种福利（译注：程序每次运行时的某个栈地址或库函数地址都不会变化），人们显然就会被黑客攻击了。由于这个关系，地址空间加载随机化技术（译注：windows 中的 ASLR?）开始变得流行起来。在 Linux 中，随机化的有栈地址，用户共享内存映射段（译注：参考上图），以及通过为堆的起始地址添加偏移量实现随机化。
遗憾的是，32位地址空间显得相当吃紧，使得随机化只能在很小的范围中进行而且成效不彰。

The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.
在进程地址空间顶端的那段内存就是用户栈，它用于存储绝大多数编程语言中的局部变量，以及函数调用时传递的参数。
调用一个方法或函数会把一个新栈帧压到栈上（译注：这里的意思应该是改变栈顶指针即ESP指向更低的内存地址，这样相较原来的栈就形成了一段新栈帧）。该栈帧在函数返回时被销毁。（译注：即ESP回指更高的，调用前的内存地址，从而删除被调函数的栈帧）
栈的这种简单设计可能是因为考虑了数据遵守严格的 LIFO 顺序（后进先出）这一点。这意味着不需要用复杂的数据结构来追踪栈的内容——一个简单的栈顶指针（ESP）就能搞定。
得益于栈顶指针，压（进）栈和弹（出）栈的操作变得非常快捷且明确。再者，栈区的反复使用往往能够使其保持活跃，从而驻留在CPU缓存中，加快访问速度。一个进程中的每个线程都有其自身的栈。
（为了直观理解作者的意思，我制作了一张函数调用时的栈存储变化过程图解，便于参考：）

It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.
向栈中压入超过其自身能容纳的数据，就有可能耗尽映射的栈内存区。这将触发一个由 Linux 函数 expand_stack() 处理的页面错误，该函数转而调用 acct_stack_growth() 函数去检查是否应该适当地增大栈内存，如果被填满的栈内存低于 RLIMIT_STACK（通常是8MB），一般而言此时栈内存会增大并且程序继续运行，不会察觉到刚才发生了什么事。这是按需调整栈大小的常规机制。
然而，假设达到了栈内存大小上限，就会产生栈溢出，并且程序接收到一个分段错误信息。
尽管映射的栈内存可以扩展来满足需求，却不能在实际使用需求减少时收缩回来以释放多余的空间。
（译注：相比之下，构成栈内存的一个个函数调用栈帧则可以按需增减，由ESP决定。销毁栈帧不会导致栈收缩；展开栈帧不会导致栈扩展，除非填满当前栈大小，而扩展上限就是8MB）
这就像联邦政府的预算，只会越来越多。

Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.
在程序访问到上图中白色的未映射内存区域时，动态栈增长是唯一能使访问合法的情况，任何其它访问未映射内存的方式会触发页面错误，继而导致分段错误。某些映射区域是只读的，因此尝试向这些区域写入数据也将导致段错误。

Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data.
In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger
than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
在栈下方的是共享内存映射区。内核将文件的内容直接映射到这里。任何应用程序都可以通过 Linux 的 mmap() 系统调用（或相应实现），或者 windows 的 CreateFileMapping() / MapViewOfFile()，申请这么一块区域。共享内存映射是一种执行文件读写的便捷，高性能办法，因此被用于载入动态链接库。也能够在该区域创建一个不对应于任何文件的匿名内存映射，用于代替程序数据。在 Linux 中，如果你通过 malloc() 申请一大块内存，C 库将创建这么一块匿名映射区，而不是使用堆内存。“一大块”是指超过符号常量 MMAP_THRESHOLD 对应的值，默认是 128KBytes，可以通过 mallopt() 调整该值大小。（译注：那么 malloc() 申请的内存小于 MMAP_THRESHOLD 就是在堆中分配？）

Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.
接下来我们讨论地址空间中的堆。堆用于提供运行时的动态内存分配，意味着堆中分配的数据生命周期要比执行该分配的函数要长（译注：原文为 meant for data that must outlive the function doing the allocation，有没有更好的翻译？），这一点与栈不同。
多数编程语言向程序提供堆管理功能。满足程序的内存请求是语言运行时库与内核之间的共同事务。在 C 中，分配堆内存的接口是 malloc() 及其友元函数，而在像 C# 这种带有垃圾回收机制的语言中，其接口是关键字 new。

If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:
如果堆中有足够空间来满足内存申请，就可以通过语言运行时库来处理请求，不需要内核介入。否则，堆通过 brk() 系统调用（或相应实现）来扩大，从而给请求的内存块腾出空间。堆管理机制相当复杂，面对程序混乱的内存分配模式，需要追求快速和高效内存利用的成熟算法才能胜任。服务一个堆请求所需的时间可以有很大的差异。实时系统有专门的分配器来解决这一问题。堆中的布局同样趋向于变得片断化，参考下图：

Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file.
If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.
最后我们来讨论地址空间中最低部分的段：BSS，data，以及 text 段。在 C 语言中，BSS与data段存储的内容都是静态（全局）变量。两者的区别在于：BSS段存储未初始化静态变量，程序员尚未在源码中设置这些变量的值，因此叫未初始化（译注：在链接时，链接器会为bss节分配固定的大小，并且用字节0x00填充；加载时创建对应内容的段）。bss内存区域是匿名的：它并不映射到任何文件。例如，假设你如下定义一个变量：

1	`static int cntActiveUsers`

该变量就位于bss段中。

The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file.
This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!
另一方面，data段存储的内容对应源码中已初始化变量。该内存区域不是匿名的。它映射的程序二进制映像部分，包含了在源码中初始化的静态值。假设你如下定义一个变量：

1	`static int cntWorkerBees = 10`

该变量就位于data段中，初始值为10。即便数据段映射至一个文件，也是私有内存映射，这意味着对data段内容的更新不会反映至底层对应的文件上。理应如此，否则程序text段逻辑对全局变量多次赋值将会改变磁盘上对应二进制映像的data节内容，这是不可想象的！

The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:
下图的例子要复杂一些，因为它使用了指针。在这种情况下，指针 gonzo 的内容——4字节的内存地址（译注：在x86+32位操作系统上，指针本身也占有4字节的连续内存地址）位于数据段。
但是指针本身（4字节的连续内存地址）却不在数据段内（译注：gonzo指向的地址0x080484f0 是该字符串的第一个ASCII字符 G 的起始地址，参见下图）。
该字符串位于text段内，text段为只读，存储所有的程序代码，以及字符串常量（译注：例如图中的 "God's own prototype" 字符串，据推测，应该是以类似 const char * 的形式定义的；如果是 static char * 定义的，则是可读写，应该位于data段而非text段）
text段内存同样映射至二进制文件对应的text节，但是向text段写入将导致程序分段错误。这有助于避免指针的缺陷（译注：通过修改指针错误引用的代码段内容得到段错误），虽然这种预防机制没有C程序员的安全编码意识来的有效，但不失为一种辅助检测手段。

You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.
在 Linux 中，你可以通过读取文件 /proc/pid_of_process/maps 来查看特定进程的虚拟地址空间布局。需要注意的是，一个虚拟段可能包含多个区域，例如，每个映射进 mmap 共享内存段中的文件内容通常各划分了各自的区域，并且动态链接库在该段中拥有类似BSS和data段的额外布局。下一篇文章将会阐明“区域”的确切含义。以及谈谈有时人们说的“数据段”= data段+bss段+堆段，这种说法的正确性。

You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:
你可以使用 nm 和 objdump 命令查看二进制文件来显示符号及其地址，预定映射到的内存段等等信息。最后，在 Linux 中，前文描述的虚拟地址空间布局很灵活，作为默认已经沿用数年之久。
这种机制假设 RLIMIT_STACK 的值是确定的，否则，Linux 将恢复“经典的”布局方案，如下图所示：

That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.
最后这个图就是经典的虚拟地址空间布局。下一篇文章将探讨内核是怎样跟踪这些内存区域的；内存映射的原理，如何与文件读写联系起来；以及内存使用图表的含义。

(到这里就翻译完了,以后有空会继续翻译上面作者提到的2篇文章,各位也可以去上述地址查看原文)

Anatomy of a Program in Memory | Many But Finite

How The Kernel Manages Your Memory | Many But Finite

Page Cache, the Affair Between Memory and Files | Many But Finite

How The Kernel Manages Your Memory

After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:

在仔细审视了进程的虚拟地址布局之后，让我们把目光转向内核以及其管理用户内存的机制。再次从gonzo图示开始：

Linux kernel mm_struct

Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points to the memory descriptor, mm_struct, which is an executive summary of a program's memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (rss stands for Resident Set Size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and the page tables. Gonzo's memory areas are shown below:

Linux进程在内核中是由task_struct的实例来表示的，即进程描述符。task_struct的mm字段指向内存描述符（memory descriptor），即mm_struct，一个程序的内存的执行期摘要。它存储了上图所示的内存段的起止位置，进程所使用的物理内存页的数量（rss表示Resident Set Size），虚拟内存空间的使用量，以及其他信息。我们还可以在内存描述符中找到用于管理程序内存的两个重要结构：虚拟内存区域集合（the set of virtual memory areas）及页表（page table）。Gonzo的内存区域如下图所示：

Kernel memory descriptor and memory areas

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous. Each memory segment above (e.g., heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.

每一个虚拟内存区域（简称VMA）是一个连续的虚拟地址范围；这些区域不会交叠。一个vm_area_struct的实例完备的描述了一个内存区域，包括它的起止地址，决定访问权限和行为的标志位，还有vm_file字段，用于指出被映射的文件（如果有的话）。一个VMA如果没有映射到文件，则是匿名的（anonymous）。除memory mapping 段以外，上图中的每一个内存段（如：堆，栈）都对应于一个单独的VMA。这并不是强制要求，但在x86机器上经常如此。VMA并不关心它在哪一个段。

A program's VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps, the kernel is simply going through the linked list of VMAs for the process and printing each one.

一个程序的VMA同时以两种形式存储在它的内存描述符中：一个是按起始虚拟地址排列的链表，保存在mmap字段；另一个是红黑树，根节点保存在mm_rb字段。红黑树使得内核可以快速的查找出给定虚拟地址所属的内存区域。当你读取文件/proc/pid_of_process/maps时，内核只须简单的遍历指定进程的VMA链表，并打印出每一项来即可。

In Windows, the EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or VAD; they are stored in an AVL tree. You know what the funniest thing about Windows and Linux is? It's the little differences.

在Windows中，EPROCESS块可以粗略的看成是task_struct和mm_struct的组合。VMA在Windows中的对应物时虚拟地址描述符（Virtual Address Descriptor），或简称VAD；它们保存在平衡树中（AVL tree）。你知道Windows和Linux最有趣的地方是什么吗？就是这些细小的不同点。

The 4GB virtual address space is divided into pages. x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size. Here's 3GB of user space in 4KB pages:

4GB虚拟地址空间被分割为许多页（page）。x86处理器在32位模式下所支持的页面大小为4KB，2MB和4MB。Linux和Windows都使用4KB大小的页面来映射用户部分的虚拟地址空间。第0-4095字节在第0页，第4096-8191字节在第1页，以此类推。VMA的大小必须是页面大小的整数倍。下图是以4KB分页的3GB用户空间：

4KB Pages Virtual User Space

The processor consults page tables to translate a virtual address into a physical memory address. Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process' page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:

处理器会依照页表（page table）来将虚拟地址转换到物理内存地址。每个进程都有属于自己的一套页表；一旦进程发生了切换，用户空间的页表也会随之切换。Linux在内存描述符的pgd字段保存了一个指向进程页表的指针。每一个虚拟内存页在页表中都有一个与之对应的页表项（page table entry），简称PTE。它在普通的x86分页机制下，是一个简单的4字节记录，如下图所示：

（“MMU是如何将虚拟地址转化为物理地址呢，它会去查询页表。

MMU中有几个概念：
PTE：页表项，即某条具体的转换规则，例如0x84050000这一页映射到0x8000物理页。
TLB：快表，MMU中的cache，用来缓存最近用过的PTE。

每个进程的页表都是不同的（前3G映射不同，后1G相同），所以切换进程时候会将该进程的页表地址写进MMU中，从而更新整个MMU的映射规则。

--20200521补”）

x86 Page Table Entry (PTE) for 4KB page

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.

Linux有一些函数可以用于读取或设置PTE中的每一个标志。P位告诉处理器虚拟页面是否存在于（present）物理内存中。如果是0，访问这个页将触发页故障（page fault）。记住，当这个位是0时，内核可以根据喜好，随意的使用其余的字段。R/W标志表示读/写；如果是0，页面就是只读的。U/S标志表示用户/管理员；如果是0，则这个页面只能被内核访问。这些标志用于实现只读内存和保护内核空间。

Bits D and A are for dirty and accessed. A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB. The other PTE fields are for another day, as is Physical Address Extension.

D位和A位表示数据脏（dirty）和访问过（accessed）。脏表示页面被执行过写操作，访问过表示页面被读或被写过。这两个标志都是粘滞的：处理器只会将它们置位，之后必须由内核来清除。最后，PTE还保存了对应该页的起始物理内存地址，对齐于4KB边界。PTE中的其他字段我们改日再谈，比如物理地址扩展（Physical Address Extension）。

A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags. However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it's still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.

虚拟页面是内存保护的最小单元，因为页内的所有字节都共享U/S和R/W标志。然而，同样的物理内存可以被映射到不同的页面，甚至可以拥有不同的保护标志。值得注意的是，在PTE中没有对执行许可（execute permission）的设定。这就是为什么经典的x86分页可以执行位于stack上的代码，从而为黑客利用堆栈溢出提供了便利（使用return-to-libc和其他技术，甚至可以利用不可执行的堆栈）。PTE缺少不可执行（no-execute）标志引出了一个影响更广泛的事实：VMA中的各种许可标志可能会也可能不会被明确的转换为硬件保护。对此，内核可以尽力而为，但始终受到架构的限制。

Virtual memory doesn't store anything, it simply maps a program's address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space. While memory operations on the bus are somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames. The processor doesn't know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:

虚拟内存并不存储任何东西，它只是将程序地址空间映射到底层的物理内存上，后者被处理器视为一整块来访问，称作物理地址空间（physical address space）。对物理内存的操作还与总线有点联系，好在我们可以暂且忽略这些并假设物理地址范围以字节为单位递增，从０到最大可用内存数。这个物理地址空间被内核分割为一个个页帧（page frame）。处理器并不知道也不关心这些帧，然而它们对内核至关重要，因为页帧是物理内存管理的最小单元。Linux和Windows在32位模式下，都使用4KB大小的页帧；以一个拥有2GB RAM的机器为例：

Physical Address Space

In Linux each page frame is tracked by a descriptor and several flags. Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it's available for allocation via the buddy system. An allocated page frame might be anonymous, holding program data, or it might be in the page cache, holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.

在Linux中，每一个页帧都由一个描述符和一些标志所跟踪。这些描述符合在一起，记录了计算机内的全部物理内存；可以随时知道每一个页帧的准确状态。物理内存是用buddy memory allocation技术来管理的，因此如果一个页帧可被buddy 系统分配，则它就是可用的（free）。一个被分配了的页帧可能是匿名的（anonymous），保存着程序数据；也可能是页缓冲的（page cache），保存着一个文件或块设备的数据。还有其他一些古怪的页帧使用形式，但现在先不必考虑它们。Windows使用一个类似的页帧编号（Page Frame Number简称PFN）数据库来跟踪物理内存。

Let's put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:

让我们把虚拟地址区域，页表项，页帧放到一起，看看它们到底是怎么工作的。下图是一个用户堆的例子：

Physical Address Space

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.

蓝色矩形表示VMA范围内的页，箭头表示页表项将页映射到页帧上。一些虚拟页并没有箭头；这意味着它们对应的PTE的存在位（Present flag）为0。形成这种情况的原因可能是这些页还没有被访问过，或者它们的内容被系统换出了（swap out）。无论那种情况，对这些页的访问都会导致页故障（page fault），即使它们处在VMA之内。VMA和页表的不一致看起来令人奇怪，但实际经常如此。

A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says "sure", and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon, while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program's memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let's take the simple case of memory allocation:

一个VMA就像是你的程序和内核之间的契约。你请求去做一些事情（如：内存分配，文件映射等），内核说"行"，并创建或更新适当的VMA。但它并非立刻就去完成请求，而是一直等到出现了页故障才会真正去做。内核就是一个懒惰，骗人的败类；这是虚拟内存管理的基本原则。它对大多数情况都适用，有些比较熟悉，有些令人惊讶，但这个规则就是这样：VMA记录了双方商定做什么，而PTE反映出懒惰的内核实际做了什么。这两个数据结构共同管理程序的内存；都扮演着解决页故障，释放内存，换出内存（swapping memory out）等等角色。让我们看一个简单的内存分配的例子：

Example of demand paging and memory allocation

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma(). If found, the permissions on the VMA are also checked against the attempted access (read or write). If there's no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.

当程序通过brk()系统调用请求更多的内存时，内核只是简单的更新堆的VMA，然后说搞好啦。其实此时并没有页帧被分配，新的页也并没有出现于物理内存中。一旦程序试图访问这些页，处理器就会报告页故障，并调用do_page_fault()。它会通过调用find_vma()去搜索哪一个VMA含盖了产生故障的虚拟地址。如果找到了，还会根据VMA上的访问许可来比对检查访问请求（读或写）。如果没有合适的VMA，也就是说内存访问请求没有与之对应的合同，进程就会被处以段错误（Segmentation Fault）的罚单。

When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.

当一个VMA被找到后，内核必须处理这个故障，方式是察看PTE的内容以及VMA的类型。在我们的例子中，PTE显示了该页并不存在。事实上，我们的PTE是完全空白的（全为0），在Linux中意味着虚拟页还没有被映射。既然这是一个匿名的VMA，我们面对的就是一个纯粹的RAM事务，必须由do_anonymous_page()处理，它会分配一个页帧并生成一个PTE，将出故障的虚拟页映射到那个刚刚分配的页帧上。

Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault.

事情还可能有些不同。被换出的页所对应的PTE，例如，它的Present标志是0但并不是空白的。相反，它记录了页面内容在交换系统中的位置，这些内容必须从磁盘读取出来并通过do_swap_page()加载到一个页帧当中，这就是所谓的major fault。

This concludes the first half of our tour through the kernel's user memory management. In the next post, we'll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.

至此我们走完了"内核的用户内存管理"之旅的前半程。在下一篇文章中，我们将把文件的概念也混进来，从而建立一个内存基础知识的完成画面，并了解其对系统性能的影响。

Page Cache, the Affair Between Memory and Files

进程render要读取一个scene.dat文件过程(normal and mmap)

假如一个进程render要读取一个scene.dat文件，实际发生的步骤如下

1. render进程向内核发起读scene.dat文件的请求

2. 内核根据scene.dat的inode找到对应的address_space(struct inode->struct address_space *i_mapping;)，在address_space中查找页缓存，如果没有找到，那么分配一个内存页page加入到页缓存

3. 从磁盘中读取scene.dat文件相应的页填充页缓存中的页，也就是第一次复制

4. 从页缓存的页复制内容到render进程的堆空间的内存中，也就是第二次复制

最后物理内存的内容是这样的，同一个文件scene.dat的内容存在了两份拷贝，一份是页缓存（page cache），一份是用户进程的堆空间对应的物理内存空间

再来看看内存映射文件mmap只复制一次是如何做的，mmap只有一次页缓存的复制，从磁盘文件复制到页缓存中。

mmap会创建一个虚拟内存区域vm_area_struct，进程的task_struct维护着这个进程所有的虚拟内存区域信息，虚拟内存区域会更新相应的进程页表项(PTE)，让这些页表项直接指向页缓存所在的物理页page。mmap新建的这个虚拟内存区域和进程堆的虚拟内存区域不是同一个，所以mmap是在堆外空间。

最后明确几个概念

1. 用户进程访问内存只能通过页表结构，内核可以通过虚拟地址直接访问物理内存。

2. 用户进程不能访问内核的地址空间，这里的地址空间指的是虚拟地址空间，这是肯定的，因为用户进程的虚拟地址空间和内核的虚拟地址空间是不重合的，内核虚拟地址空间必须特权访问

3. page结构表示物理内存页帧，同一个物理内存地址可以同时被内核进程和用户进程访问，只要将用户进程的页表项也指向这个物理内存地址。也就是mmap的实现原理。

参考资料:

《Page Cache, the Affair Between Memory and Files》

《Linux Kernel: What is the major difference between the buffer cache and the page cache?》

《深入Linux内核架构》

关于OS Page Cache的简单介绍

在现代计算机系统中，CPU，RAM，DISK的速度不相同。CPU与RAM之间，RAM与DISK之间的速度差异常常是指数级。为了在速度和容量上折中，在CPU与RAM之间使用CPU cache以提高访存速度，在RAM与磁盘之间，操作系统使用page cache提高系统对文件的访问速度。

操作系统在处理文件时，需要考虑两个问题：

1.相对于内存的高速读写，缓慢的硬盘驱动器，特别是磁盘寻道较为耗时。

2.文件加载到物理内存一次，并在多个程序间共享。

幸运的是，操作系统使用page cache机制解决了上面的两个问题。page cache（页面缓存），内核在其中存储页面大小倍数的文件块。现假设一名为render的程序需要读取512字节scene.dat文件的内容，流程分析如下：

1.render请求获取512字节scene.dat文件的内容，使用系统调用 read(scene.dat, to_heap_buf, 512, offset=0)

2.内核从页面缓存中搜索满足请求的scene.dat文件的4KB的块，如果数据尚未缓存，则进入下一步

3.内核申请页帧空间，进行I/O操作，从偏移位置0开始请求4KB的数据，并复制到页帧中

4.内核从page cache中复制512字节的数据到render的缓存中，read()系统调用结束

对于系统的所有文件I/O请求，操作系统都是通过page cache机制实现的，对于操作系统而言，磁盘文件都是由一系列的数据块顺序组成，数据块的大小随系统的不同而不同，x86 linux系统下是4KB（一个标准的页面大小）。内核在处理文件I/O请求时，首先到page cache中查找（page cache中的每一个数据块都设置了文件以及偏移信息），如果未命中，则启动磁盘I/O，将磁盘文件中的数据块加载到page cache中的一个空闲块，之后copy到用户缓冲区中。

很明显，同一块文件数据，在内存中保存了两份，这既占用了不必要的内存空间，冗余的拷贝，也导致了CPU cache利用率不高。针对此问题，操作系统提供了内存映射机制（Linux中的mmap，windows中的Filemapping）。

在使用mmap调用时，系统并不马上为其分配内存空间，而仅仅是添加一个VMA(Virtual Memory Area)到该进程中，当程序访问到目标空间时，产生缺页中断。在缺页中断中，从page cache中查找要访问的文件块，若未命中，则启动磁盘I/O从磁盘中加载到page cache，然后将文件块在page cache中的物理页映射到进程mmap地址空间。

当程序退出或关闭文件时，系统是否会马上清除page cache中的相应页面呢？答案是否定的。由于该文件可能被其它进程访问，或该进程一段时间后会重新访问，因此，在物理内存足够的情况下，系统总是将其保存在page cache中，这样可以提高系统的整体性能。只有当系统物理内存不足时，内核才会主动清理page cache。

当进程调用write修改文件时，由于page cache的存在，修改并不会马上更新到磁盘，而只是暂时更新到page cache中，同时mark目标page为dirty，当内核主动释放page cache时，才将更新写入到磁盘（主动调用sync时，也会更新到磁盘）。

///

Linux系统中的Page cache和Buffer cache

Free命令显示内存

首先，我们来了解下内存的使用情况

Mem：表示物理内存统计。
total：表示物理内存总量(total = used + free)。
used：表示总计分配给缓存（包含buffers 与cache ）使用的数量，但其中可能部分缓存并未实际使用。
free：未被分配的内存。
shared：共享内存。
buffers：系统分配但未被使用的buffers数量。
cached：系统分配但未被使用的cache数量。
-/+ buffers/cache：表示物理内存的缓存统计。
used2：也就是第一行中的used – buffers - cached也是实际使用的内存总量。 // used2为第二行
free2 = buffers1 + cached1 + free1 // free2为第二行，buffers1等为第一行
free2：未被使用的buffers与cache和未被分配的内存之和，这就是系统当前实际可用内存。
Swap：表示硬盘上交换分区的使用情况。

在Free命令中显示的buffer和cache，它们都是占用内存：

buffer : 作为buffer cache的内存，是块设备的读写缓冲区，更靠近存储设备，或者直接就是disk的缓冲区。

cache: 作为page cache的内存, 文件系统的cache，是memory的缓冲区。

如果cache 的值很大，说明cache住的文件数很多。如果频繁访问到的文件都能被cache住，那么磁盘的读IO 必会非常小。

Page cache（页面缓存）

Page cache 也叫页缓冲或文件缓冲，是由好几个磁盘块构成，大小通常为4k，在64位系统上为8k，构成的几个磁盘块在物理磁盘上不一定连续，文件的组织单位为一页，也就是一个page cache大小，文件读取是由外存上不连续的几个磁盘块，到buffer cache，然后组成page cache，然后供给应用程序。

Page cache在linux读写文件时，它用于缓存文件的逻辑内容，从而加快对磁盘上映像和数据的访问。具体说是加速对文件内容的访问，buffer cache缓存文件的具体内容——物理磁盘上的磁盘块，这是加速对磁盘的访问。

Buffer cache（块缓存）

Buffer cache 也叫块缓冲，是对物理磁盘上的一个磁盘块进行的缓冲，其大小为通常为1k，磁盘块也是磁盘的组织单位。设立buffer cache的目的是为在程序多次访问同一磁盘块时，减少访问时间。系统将磁盘块首先读入buffer cache，如果cache空间不够时，会通过一定的策略将一些过时或多次未被访问的buffer cache清空。程序在下一次访问磁盘时首先查看是否在buffer cache找到所需块，命中可减少访问磁盘时间。不命中时需重新读入buffer cache。对buffer cache的写分为两种，一是直接写，这是程序在写buffer cache后也写磁盘，要读时从buffer cache上读，二是后台写，程序在写完buffer cache后并不立即写磁盘，因为有可能程序在很短时间内又需要写文件，如果直接写，就需多次写磁盘了。这样效率很低，而是过一段时间后由后台写，减少了多次访磁盘的时间。

Buffer cache是由物理内存分配，Linux系统为提高内存使用率，会将空闲内存全分给buffer cache ，当其他程序需要更多内存时，系统会减少cache大小。

Buffer page（缓冲页）

如果内核需要单独访问一个块，就会涉及到buffer page，并会检查对应的buffer head。

Swap space（交换空间）

Swap space交换空间，是虚拟内存的表现形式。系统为了应付一些需要大量内存的应用，而将磁盘上的空间做内存使用，当物理内存不够用时，将其中一些暂时不需的数据交换到交换空间，也叫交换文件或页面文件中。做虚拟内存的好处是让进程以为好像可以访问整个系统物理内存。因为在一个进程访问数据时，其他进程的数据会被交换到交换空间中。

Swap cache（交换缓存）

swap cache，它表示交换缓存的大小。Page cache是磁盘数据在内存中的缓存，而swap cache则是交换分区在内存中的临时缓存。

Memory mapping（内存映射）

内核有两种类型的内存映射：共享型(shared)和私有型(private)。私有型是当进程为了只读文件，而不写文件时使用，这时，私有映射更加高效。但是，任何对私有映射页的写操作都会导致内核停止映射该文件中的页。所以，写操作既不会改变磁盘上的文件，对访问该文件的其它进程也是不可见的。

共享内存中的页通常都位于page cache，私有内存映射只要没有修改，也位于page cache。当进程试图修改一个私有映射内存页时，内核就把该页进行复制，并在页表中用复制的页替换原来的页。由于修改了页表，尽管原来的页仍然在 page cache，但是已经不再属于该内存映射。而新复制的页也不会插入page cache，而是添加到匿名页反向映射数据结构。

Page cache和Buffer cache的区别

磁盘的操作有逻辑级（文件系统）和物理级（磁盘块），这两种Cache就是分别缓存逻辑和物理级数据的。

假设我们通过文件系统操作文件，那么文件将被缓存到Page Cache，如果需要刷新文件的时候，Page Cache将交给Buffer Cache去完成，因为Buffer Cache就是缓存磁盘块的。

也就是说，直接去操作文件，那就是Page Cache区缓存，用dd等命令直接操作磁盘块，就是Buffer Cache缓存的东西。

Page cache实际上是针对文件系统的，是文件的缓存，在文件层面上的数据会缓存到page cache。文件的逻辑层需要映射到实际的物理磁盘，这种映射关系由文件系统来完成。当page cache的数据需要刷新时，page cache中的数据交给buffer cache，但是这种处理在2.6版本的内核之后就变的很简单了，没有真正意义上的cache操作。

Buffer cache是针对磁盘块的缓存，也就是在没有文件系统的情况下，直接对磁盘进行操作的数据会缓存到buffer cache中，例如，文件系统的元数据都会缓存到buffer cache中。

简单说来，page cache用来缓存文件数据，buffer cache用来缓存磁盘数据。在有文件系统的情况下，对文件操作，那么数据会缓存到page cache，如果直接采用dd等工具对磁盘进行读写，那么数据会缓存到buffer cache。

Buffer(Buffer Cache)以块形式缓冲了块设备的操作，定时或手动的同步到硬盘，它是为了缓冲写操作然后一次性将很多改动写入硬盘，避免频繁写硬盘，提高写入效率。

Cache(Page Cache)以页面形式缓存了文件系统的文件，给需要使用的程序读取，它是为了给读操作提供缓冲，避免频繁读硬盘，提高读取效率。

///

Page Cache, the Affair Between Memory and Files.页面缓存-内存与文件的那些事 - Michael_Tong唐唐 - 博客园

Previously we looked at how the kernel manages virtual memory for a user process, but files and I/O were left out. This post covers the important and often misunderstood relationship between files and memory and its consequences for performance.

上次我们考察了内核如何为一个用户进程管理虚拟内存，但是没有涉及文件及I/O。这次我们的讨论将涵盖非常重要且常被误解的文件与内存间关系的问题，以及它对系统性能的影响。

Two serious problems must be solved by the OS when it comes to files. The first one is the mind-blowing slowness of hard drives, and disk seeks in particular, relative to memory. The second is the need to load file contents in physical memory once and share the contents among programs. If you use Process Explorer to poke at Windows processes, you'll see there are ~15MB worth of common DLLs loaded in every process. My Windows box right now is running 100 processes, so without sharing I'd be using up to ~1.5 GB of physical RAM just for common DLLs. No good. Likewise, nearly all Linux programs need ld.so and libc, plus other common libraries.

提到文件，操作系统必须解决两个重要的问题。首先是硬盘驱动器的存取速度缓慢得令人头疼（相对于内存而言），尤其是磁盘的寻道性能。第二个是要满足'一次性加载文件内容到物理内存并在程序间共享'的需求。如果你使用进程浏览器翻看Windows进程，就会发现大约15MB的共享DLL被加载进了每一个进程。我目前的Windows系统就运行了100个进程，如果没有共享机制，那将消耗大约1.5GB的物理内存仅仅用于存放公用DLL。这可不怎么好。同样的，几乎所有的Linux程序都需要ld.so和libc，以及其它的公用函数库。

Happily, both problems can be dealt with in one shot: the page cache, where the kernel stores page-sized chunks of files. To illustrate the page cache, I'll conjure a Linux program named render, which opens file scene.dat and reads it 512 bytes at a time, storing the file contents into a heap-allocated block. The first read goes like this:

幸运的是，这两个问题都可以一次性解决：页面缓存，内核在其中存储页面大小倍数的文件块（上文说过，页面大小是以4KB为一个基本单元）。为了演示页面缓存，我将调用一个名为render的Linux程序，来打开文件。每次读取512字节，将文件内容存储到堆分配的块中。第一个解读是这样的:

After 12KB have been read, render's heap and the relevant page frames look thus:

在读取了12KB以后，render的堆以及相关的页帧情况如下：

This looks innocent enough, but there's a lot going on. First, even though this program uses regular read calls, three 4KB page frames are now in the page cache storing part of scene.dat. People are sometimes surprised by this, but all regular file I/O happens through the page cache. In x86 Linux, the kernel thinks of a file as a sequence of 4KB chunks. If you read a single byte from a file, the whole 4KB chunk containing the byte you asked for is read from disk and placed into the page cache. This makes sense because sustained disk throughput is pretty good and programs normally read more than just a few bytes from a file region. The page cache knows the position of each 4KB chunk within the file, depicted above as #0, #1, etc. Windows uses 256KB views analogous to pages in the Linux page cache.

这看起来很简单，但还有很多事情会发生。首先，即使这个程序只调用了常规的read函数，此时也会有三个 4KB的页帧存储在页面缓存当中，它们持有scene.dat的一部分数据。尽管有时这令人惊讶，但的确所有的常规文件I/O都是通过页面缓存来进行的。在x86 Linux里，内核将文件看作是4KB大小的数据块的序列。即使你只从文件读取一个字节，包含此字节的整个4KB数据块都会被读取，并放入到页面缓存当中。这样做是有道理的，因为磁盘的持续性数据吞吐量很不错，而且一般说来，程序对于文件中某区域的读取都不止几个字节。页面缓存知道每一个4KB数据块在文件中的对应位置，如上图所示的#0, #1等等。与Linux的页面缓存类似，Windows使用256KB的views。

Sadly, in a regular file read the kernel must copy the contents of the page cache into a user buffer, which not only takes cpu time and hurts the cpu caches, but also wastes physical memory with duplicate data. As per the diagram above, the scene.dat contents are stored twice, and each instance of the program would store the contents an additional time. We've mitigated the disk latency problem but failed miserably at everything else. Memory-mapped files are the way out of this

不幸的是，在一个普通的文件读取操作中，内核必须复制页面缓存的内容到一个用户缓冲区中，这不仅消耗CPU时间，伤害了CPU cache的性能，还因为存储了重复信息而浪费物理内存。如上面每张图所示，scene.dat的内容被保存了两遍，而且程序的每个实例都会保存一份。至此，我们缓和了磁盘延迟的问题，但却在其余的每个问题上惨败。内存映射文件（memory-mapped files）将引领我们走出混乱：

madness:

When you use file mapping, the kernel maps your program's virtual pages directly onto the page cache. This can deliver a significant performance boost: Windows System Programmingreports run time improvements of 30% and up relative to regular file reads, while similar figures are reported for Linux and Solaris in Advanced Programming in the Unix Environment. You might also save large amounts of physical memory, depending on the nature of your application.

当你使用文件映射的时候，内核将你的程序的虚拟内存页直接映射到页面缓存上。这将导致一个显著的性能提升：《Windows系统编程》指出常规的文件读取操作运行时性能改善30%以上；《Unix环境高级编程》指出类似的情况也发生在Linux和Solaris系统上。你还可能因此而节省下大量的物理内存，这依赖于你的程序的具体情况。

As always with performance, measurement is everything, but memory mapping earns its keep in a programmer's toolbox. The API is pretty nice too, it allows you to access a file as bytes in memory and does not require your soul and code readability in exchange for its benefits. Mind your address space and experiment with mmap in Unix-like systems,CreateFileMapping in Windows, or the many wrappers available in high level languages. When you map a file its contents are not brought into memory all at once, but rather on demand via page faults. The fault handler maps your virtual pages onto the page cache afterobtaining a page frame with the needed file contents. This involves disk I/O if the contents weren't cached to begin with.

对于效率来说，如何衡量是关键，但是，memory mapping还是值得所有程序猿们关注。相关的API相当不错，它允许你访问文件就像是在内存中访问字节并且不需要你花费过多精力来关注这些。牢记你所处的体系结构的地址空间结构，Unix类似的系统中通过mmap系统调用而windows中使用CreateFileMapping，或者是在其它高级语言中的一些包装函数。当你尝试映射一个文件，它的内容不会立即全部映射到内存，而是当内核检测到page faults后，才会触发内核读入文件内容的动作。页错误的处理程序首先会按照需要的文件内容分配页frame，然后把映射你所需要的虚拟页到对应的页cache。在所需要的内容不在页cache中的时候才会去进行必要的磁盘I/O操作。

Now for a pop quiz. Imagine that the last instance of our render program exits. Would the pages storing scene.dat in the page cache be freed immediately? People often think so, but that would be a bad idea. When you think about it, it is very common for us to create a file in one program, exit, then use the file in a second program. The page cache must handle that case. When you think more about it, why should the kernel ever get rid of page cache contents? Remember that disk is 5 orders of magnitude slower than RAM, hence a page cache hit is a huge win. So long as there's enough free physical memory, the cache should be kept full. It is therefore not dependent on a particular process, but rather it's a system-wide resource. If you run render a week from now and scene.dat is still cached, bonus! This is why the kernel cache size climbs steadily until it hits a ceiling. It's not because the OS is garbage and hogs your RAM, it's actually good behavior because in a way free physical memory is a waste. Better use as much of the stuff for caching as possible.

现在来测试下。假设我们上面提到的最后一个render程序的实例退出了。是否保存有scene.dat的数据的页cache会立即被释放？人们可能认为会被立即释放，但是这是个坏主意。你可以想一下，通常我们会在一个程序中创建一个文件，然后在另外一个程序中使用它。页cache必须要能处理这种情况。如果你能够想的更多些，为什么内核要丢掉页cache中的内容呢？请记住，磁盘比RAM慢至少5个数量级，因此页cache hit应该尽可能得到保证。一旦有足够的物理内存，cache应该被全部填充。它不是局限于某一个特定程序，而是一个全系统范围内的资源。如果你持续运行render一周，那么scene.dat会一直保存在cache中，从中你得到最大的效率！这就是为什么内核的cache大小会持续上升直到它的一个临界点。这不是OS在浪费你的RAM，而是一个深思熟虑的优化的行为，因为频繁的释放物理内存可能是一种浪费。最大程度的被cache的数据。

Due to the page cache architecture, when a program calls write() bytes are simply copied to the page cache and the page is marked dirty. Disk I/O normally does not happen immediately, thus your program doesn't block waiting for the disk. On the downside, if the computer crashes your writes will never make it, hence critical files like database transaction logs must be fsync()ed (though one must still worry about drive controller caches, oy!). Reads, on the other hand, normally block your program until the data is available. Kernels employ eager loading to mitigate this problem, an example of which is read ahead where the kernel preloads a few pages into the page cache in anticipation of your reads. You can help the kernel tune its eager loading behavior by providing hints on whether you plan to read a file sequentially or randomly (see madvise(), readahead(), Windows cache hints ). Linux does read-ahead for memory-mapped files, but I'm not sure about Windows. Finally, it's possible to bypass the page cache using O_DIRECT in Linux or NO_BUFFERING in Windows, something database software often does.

鉴于页cache的体系结构，当一个程序调用write()系统调用，字节被简单的被拷贝到页cache中，并且该页被标示为dirty。磁盘的I/O通常不会立即发生，因此你的程序不会因为等待磁盘操作而被阻塞。不好的一方面是，如果你电脑当机了，那么你之前的写操作就再也不会被完成，因此像数据库的存储过程这样的重要文件必须要执行fsync()。（当然，你也可能对drive controller caches不太放心！！）另一方面，读操作通常会阻塞你的程序直到数据真正被获得。内核通过预读之类的eager laoding的方法来减轻这种阻碍，预读就是内核读取比你要求的还多的page进入页cache中。你可以通过内核对外提供的结构来影响内核的eager loading策略，比如你计划是顺序读取还是随机读取（参见madvise(),readahead(),Windows cache hints）。Linux对memory-mapped files进行预读，但是我不确信windows也这么做。当然，你也可以在Linux中使用O_DIRECT或者在windows中使用NO_BUFFERING来跳过页cache，通常一些数据库软件会采用这种做法。

A file mapping may be private or shared. This refers only to updates made to the contents in memory: in a private mapping the updates are not committed to disk or made visible to other processes, whereas in a shared mapping they are. Kernels use the copy on writemechanism, enabled by page table entries, to implement private mappings. In the example below, both render and another program called render3d (am I creative or what?) have mapped scene.dat privately. Render then writes to its virtual memory area that maps the file:

一个文件的mapping可以是私有的或者共享的。这只会影响对内存中的内容做updates的操作：在私有模式的mapping中，updates操作不会把结果提交到磁盘，本次的updates结果对其它程序也不可见，而在共享的mapping模式下，确是相反的情况。内核使用copy on write机制，由page table（PTE控制私有还是共享）项中的选项域来启用，从而实现私有mapping。下面的例子中，render和render3d都会采用私有模式来map scene.dat。Render会对它的映射了文件内容的虚拟空间进行写操作：

The read-only page table entries shown above do not mean the mapping is read only, they're merely a kernel trick to share physical memory until the last possible moment. You can see how 'private' is a bit of a misnomer until you remember it only applies to updates. A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from. Once copy-on-write is done, changes by others are no longer seen. This behavior is not guaranteed by the kernel, but it's what you get in x86 and makes sense from an API perspective. By contrast, a shared mapping is simply mapped onto the page cache and that's it. Updates are visible to other processes and end up in the disk. Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.

上图所示的只读的页表项并不意味着映射是只读的，它们只是内核耍的小把戏，用于共享物理内存直到可能的最后一刻。你会发现'私有'一词是多么的不恰当，你只需记住它只在数据发生更改时起作用。此设计所带来的一个结果就是，一个以私有方式映射文件的虚拟内存页可以观察到其他进程对此文件的改动，只要之前对这个内存页进行的都是读取操作。一旦发生过写时拷贝，就不会再观察到其他进程对此文件的改动了。此行为不是内核提供的，而是在x86系统上就会如此。另外，从API的角度来说，这也是合理的。与此相反，共享映射只是简单的映射到页面缓存，仅此而已。对页面的所有更改操作对其他进程都可见，而且最终会执行磁盘操作。最后，如果此共享映射是只读的，那么页故障将触发段错误（segmentation fault）而不是写时拷贝。

Dynamically loaded libraries are brought into your program's address space via file mapping. There's nothing magical about it, it's the same private file mapping available to you via regular APIs. Below is an example showing part of the address spaces from two running instances of the file-mapping render program, along with physical memory, to tie together many of the concepts we've seen.

动态加载库通过file mapping来到你的程序空间。这里没有任何神秘的东西，它只是通过通常的APIs提供给你的一个私有模式的file mapping。下图是2个采用file mapping实现的render程序实例的地址空间，包括物理内存及之前我们提到的一些概念：

请注意，stack/heap都是对应到各自的page frames，而共享库和通过file maping共享的scene.dat则是共享同样的page frames。

This concludes our 3-part series on memory fundamentals. I hope the series was useful and provided you with a good mental model of these OS topics.

至此我们完成了内存基础知识的三部曲系列。我希望这个系列对您有用，并在您头脑中建立一个好的操作系统模型。

*************************************************************************************************************************************************

linux页表创建与更新

简单来说，讨论linux页表就是讨论linux进程的的页表：linux页表的创建与更新都包含于进程的创建与更新中。当前的linux内核采用的是写时复制方法，在创建一个linux进程时，完全复制父进程的页表，并且将父子进程的页表均置为写保护（即写地址的时候会产生缺页异常等）。那么父子进程谁向地址空间写数据时，产生缺页异常，分配新的页，并将两个页均置为可写，按照这种方式父子进程的地址空间渐渐变得不同。
按照上面的分析, 只需要讨论第一个进程页表初始化，进程创建时页表的拷贝，以及缺页异常时页表的更新即可。
1.init_task进程页表的初始化
init_task的地址空间是init_mm, init_mm在内核初始化的时候就赋值给了current->active_mm. init_mm的初始化页表是swapper_pg_dir，在mips架构中swapper_pg_dir初始化在函数pagetable_init中，初始化关系是
swapper_pg_dir -> invalide_pmd_table -> invalide_pte_table 或
swapper_pg_dir -> invalide_pte_table.
即在init_mm中，页表指向的全部是invalide_pte_table。

2.创建进程时页表的拷贝
进程创建一般调用的是do_fork函数，按照如下调用关系：
do_fork->copy_process->copy_mm->dup_mm->dup_mmap->copy_page_range
找到copy_page_range函数，这个函数便是负责页表的拷贝，函数核心代码如下：
874 do {
875 next = pgd_addr_end(addr, end);
876 if (pgd_none_or_clear_bad(src_pgd))
877 continue;
878 if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
879 vma, addr, next))) {
880 ret = -ENOMEM;
881 break;
882 }
883 } while (dst_pgd++, src_pgd++, addr = next, addr != end);
copy_pud_range便是拷贝pud表，copy_pud_range调用copy_pmd_range, copy_pmd_range调用copy_pte_range，以此完成对三级页表的复制。需要注意的是在copy_pte_range调用的copy_one_pte中有如下代码：
694 if (is_cow_mapping(vm_flags)) {
695 ptep_set_wrprotect(src_mm, addr, src_pte);
696 pte = pte_wrprotect(pte);
697 }
这里便是判断如果采用的是写时复制，便将父子页均置为写保护，即会产生如下所示的缺页异常。

3.缺页异常时页表的更新
由页表的初始化可以看到，init_mm的页表全指向无效页表，然而普通的进程中不可能页表均指向无效项，因此肯定拥有一个不断扩充页表的机制，这个机制是通过缺页异常实现的。
以mips为例，mips的缺页异常最终会调用do_page_fault，do_page_fault调用handle_mm_fault，handle_mm_fault是公共代码，一般所有的缺页异常均会调用handle_mm_fault的核心代码如下：
3217 pud = pud_alloc(mm, pgd, address);
3218 if (!pud)
3219 return VM_FAULT_OOM;
3220 pmd = pmd_alloc(mm, pud, address);
3221 if (!pmd)
3222 return VM_FAULT_OOM;
3223 pte = pte_alloc_map(mm, pmd, address);
3224 if (!pte)
3225 return VM_FAULT_OOM;
其中pud_alloc代码如下：
1056 static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
1057 {
1058 return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
1059 NULL: pud_offset(pgd, address);
1060 }
其中pgd_none用于判断pgd是否为invalide,如果是可调用__pud_alloc,如果不是获得其地址继续查。
pmd_alloc函数和pte_alloc_map函数类似。

因此可以看出，在缺页异常中，会按照地址一次查三张页表，如果页表为invalide，比如invalide_pmd_table或invalide_pte_table，则会分配一个新的页表项取代invalide的页表项。这便是页表扩充的机制。

需要注意的是handle_mm_fault最终会调用handle_pte_fault，在handle_pte_fault函数中有如下代码：
3171 if (flags & FAULT_FLAG_WRITE) {
3172 if (!pte_write(entry))
3173 return do_wp_page(mm, vma, address,
3174 pte, pmd, ptl, entry);
3175 entry = pte_mkdirty(entry);
3176 }
即在缺页异常中如果遇到写保护会调用do_wp_page，这里面会处理上面所说的写时复制中父子进程区分的问题。

如上三个部分便是linux页表的大体处理框架

三级页表结构示意图[zz]

图3.3 Linux的三级页表结构

Linux总是假定处理器有三级页表。每个页表通过所包含的下级页表的页面框号来访问。图3.3给出了虚拟地址是如何分割成多个域的，每个域提供了某个指定页表的偏移。为了将虚拟地址转换成物理地址，处理器必须得到每个域的值。这个过程将持续三次直到对应于虚拟地址的物理页面框号被找到。最后再使用虚拟地址中的最后一个域，得到了页面中数据的地址。

为了实现跨平台运行，Linux提供了一系列转换宏使得核心可以访问特定进程的页表。这样核心无需知道页表入口的结构以及它们的排列方式。

这种策略相当成功，无论在具有三级页表结构的Alpha AXP还是两级页表的Intel X86处理器中，Linux总是使用相同的页表操纵代码。

痒痒挠963

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Anatomy of a Program in Memory

前言：原文来自于：http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/，这里只是对其进行翻译，并且重构了原文中的图片。译注则是我增加的内容，用来解释原文或提出问题；由于个人水平有限，译文和译注中的错误之处还请广大坛友提出指正，不胜感激。下面采用分段中英对照的方式列出内容：Memorymanagement...
复制链接

扫一扫