(进程的虚拟存储器映像布局详解)
前言:原文来自于http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/
这里只是对其进行翻译,并且重构了原文中的图片。译注则是我增加的内容,用来解释原文或提出问题;由于个人水平有限,译文和译注中的错误之处还请广大坛友提出指正,不胜感激。
下面采用分段中英对照的方式列出内容:
Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.
内存管理是操作系统的核心功能之一,对编程与系统管理而言都很关键。在接下来的文章里,我将着眼于讨论实践方面的内存,但是也不回避其内部原理。概念是通用的,例子多数来自运行在 32位 X86体系结构处理器上的 Linux 与 Windows操作系统。第一篇文章描述了程序在内存中是如何布局的。
Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:
一个多任务操作系统的每个进程运行在自身的内存沙箱中。该沙箱是虚拟的地址空间,在32位(译注:保护?)模式下总是4GB大小的内存地址块。这些进程的虚拟地址通过页表被映射至物理内存,页表由操作系统内核维护,并且可被处理器查阅。每个进程都有它自己的一组页表,但这里有一个问题。一旦启用虚拟地址,它们将应用于所有机器上运行的软件,包括内核自身。
因此,一部分虚拟地址空间需要保留给内核:(下图展示了现代操作系统中的每个进程的一致虚拟内存布局,内核占有其中一部分的地址空间)
This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
这并不意味着内核就一定使用这么多的物理内存,只有内核可用的那部分虚拟地址空间(1GB)能够映射至物理内存。内核空间在页表中以其独有的特权码(ring2 或者更低)来标记,因此当一个用户模式程序试图接触内核空间时,会引发页面错误。在 Linux 中,内核空间总是存在于所有进程中,并且映射至相同的物理内存。内核代码和数据总是可寻址的,从而能够准备随时处理中断和系统调用(这些都需要处理器转入内核空间访问其代码和数据结构)。相比之下,每当发生进程(任务)切换时,地址空间中的用户模式部分(用户空间)到物理内存的映射也会变化,这一点可以通过下图加以说明:
Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
上图中的蓝色区域代表映射至物理内存的虚拟地址,而白色区域则未映射。在上面的例子中,Firefox 浏览器进程由于其惊人的内存需求,已经使用了远多于自身虚拟地址空间的(相应物理)内存。
地址空间中不同(分隔的)条带区域对应诸如堆,栈等的内存段。需要留意的是,这些内存段只是一个虚拟内存地址范围,它与 Intel 的物理内存分段寻址(译注:数据段,代码段,堆栈段。。。这些段的基址通过相应的段描述符保存在全局描述符表即GDT这个裸机内存/内核级数据结构中,也可以将GDT的基址加载到GDTR即全局描述符表寄存器中,处理器通过查找前者内容实现段式寻址)没有关系。总之,下图就是一个 Linux 进程的虚拟地址空间/内存映射的标准布局:
When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.
如果计算机愉快安全且准确无误的运行程序,那么上图所示的每个内存段的起始虚拟地址,对于计算机上的几乎每一个进程而言都是完全相同的。这使得远程利用安全漏洞(译注:远程执行恶意代码?)变得很容易。一个(漏洞)利用往往需要引用绝对内存位置:一个用户栈中的地址,一个库函数的(起始)地址等等。远程***者必须尝试摸索性的选择(译注:猜测?)这个位置,并且指望地址空间均相同成为事实。如果它们真的遇到这种福利(译注:程序每次运行时的某个栈地址或库函数地址都不会变化),人们显然就会被******了。由于这个关系,地址空间加载随机化技术(译注:windows 中的 ASLR?)开始变得流行起来。在 Linux 中,随机化的有栈地址,用户共享内存映射段(译注:参考上图),以及通过为堆的起始地址添加偏移量实现随机化。
遗憾的是,32位地址空间显得相当吃紧,使得随机化只能在很小的范围中进行而且成效不彰。
The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.
在进程地址空间顶端的那段内存就是用户栈,它用于存储绝大多数编程语言中的局部变量,以及函数调用时传递的参数。
调用一个方法或函数会把一个新栈帧压到栈上(译注:这里的意思应该是改变栈顶指针即ESP指向更低的内存地址,这样相较原来的栈就形成了一段新栈帧)。该栈帧在函数返回时被销毁。(译注:即ESP回指更高的,调用前的内存地址,从而删除被调函数的栈帧)
栈的这种简单设计可能是因为考虑了数据遵守严格的 LIFO 顺序(后进先出)这一点。这意味着不需要用复杂的数据结构来追踪栈的内容——一个简单的栈顶指针(ESP)就能搞定。
得益于栈顶指针,压(进)栈和弹(出)栈的操作变得非常快捷且明确。再者,栈区的反复使用往往能够使其保持活跃,从而驻留在CPU缓存中,加快访问速度。
一个进程中的每个线程都有其自身的栈。
(为了直观理解作者的意思,我制作了一张函数调用时的栈存储变化过程图解,便于参考:)
It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.
向栈中压入超过其自身能容纳的数据,就有可能耗尽映射的栈内存区。这将触发一个由 Linux 函数 expand_stack() 处理的页面错误,该函数转而调用 acct_stack_growth() 函数去检查是否应该适当地增大栈内存,如果被填满的栈内存低于 RLIMIT_STACK(通常是8MB),一般而言此时栈内存会增大并且程序继续运行,不会察觉到刚才发生了什么事。这是按需调整栈大小的常规机制。
然而,假设达到了栈内存大小上限,就会产生栈溢出,并且程序接收到一个分段错误信息。
尽管映射的栈内存可以扩展来满足需求,却不能在实际使用需求减少时收缩回来以释放多余的空间。
(译注:相比之下,构成栈内存的一个个函数调用栈帧则可以按需增减,由ESP决定。销毁栈帧不会导致栈收缩;展开栈帧不会导致栈扩展,除非填满当前栈大小,而扩展上限就是8MB)
这就像联邦政府的预算,只会越来越多。
Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.
在程序访问到上图中白色的未映射内存区域时,动态栈增长是唯一能使访问合法的情况,任何其它访问未映射内存的方式会触发页面错误,继而导致分段错误。某些映射区域是只读的,因此尝试向这些区域写入数据也将导致段错误。
Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
在栈下方的是共享内存映射区。内核将文件的内容直接映射到这里。任何应用程序都可以通过 Linux 的 mmap() 系统调用(或相应实现),或者 windows 的 CreateFileMapping() / MapViewOfFile(),申请这么一块区域。共享内存映射是一种执行文件读写的便捷,高性能办法,因此被用于载入动态链接库。也能够在该区域创建一个不对应于任何文件的匿名内存映射,用于代替程序数据。在 Linux 中,如果你通过 malloc() 申请一大块内存,C 库将创建这么一块匿名映射区,而不是使用堆内存。“一大块”是指超过符号常量 MMAP_THRESHOLD 对应的值,默认是 128KBytes,可以通过 mallopt() 调整该值大小。(译注:那么 malloc() 申请的内存小于 MMAP_THRESHOLD 就是在堆中分配?)
Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.
接下来我们讨论地址空间中的堆。堆用于提供运行时的动态内存分配,意味着堆中分配的数据生命周期要比执行该分配的函数要长(译注:原文为 meant for data that must outlive the function doing the allocation,有没有更好的翻译?),这一点与栈不同。
多数编程语言向程序提供堆管理功能。满足程序的内存请求是语言运行时库与内核之间的共同事务。在 C 中,分配堆内存的接口是 malloc() 及其友元函数,而在像 C# 这种带有垃圾回收机制的语言中,其接口是关键字 new。
If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:
如果堆中有足够空间来满足内存申请,就可以通过语言运行时库来处理请求,不需要内核介入。否则,堆通过 brk() 系统调用(或相应实现)来扩大,从而给请求的内存块腾出空间。
堆管理机制相当复杂,面对程序混乱的内存分配模式,需要追求快速和高效内存利用的成熟算法才能胜任。服务一个堆请求所需的时间可以有很大的差异。实时系统有专门的分配器来解决这一问题。
堆中的布局同样趋向于变得片断化,参考下图:
Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.
最后我们来讨论地址空间中最低部分的段:BSS,data,以及 text 段。在 C 语言中,BSS与data段存储的内容都是静态(全局)变量。两者的区别在于:BSS段存储未初始化静态变量,程序员尚未在源码中设置这些变量的值,因此叫未初始化(译注:在链接时,链接器会为bss节分配固定的大小,并且用字节0x00填充;加载时创建对应内容的段)。bss内存区域是匿名的:它并不映射到任何文件。例如,假设你如下定义一个变量:
static int cntActiveUsers
该变量就位于bss段中。
The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary p_w_picpath that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary p_w_picpath. Inconceivable!
另一方面,data段存储的内容对应源码中已初始化变量。该内存区域不是匿名的。它映射的程序二进制映像部分,包含了在源码中初始化的静态值。假设你如下定义一个变量:
static int cntWorkerBees = 10
该变量就位于data段中,初始值为10。即便数据段映射至一个文件,也是私有内存映射,这意味着对data段内容的更新不会反映至底层对应的文件上。
理应如此,否则程序text段逻辑对全局变量多次赋值将会改变磁盘上对应二进制映像的data节内容,这是不可想象的!
The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:
下图的例子要复杂一些,因为它使用了指针。在这种情况下,指针 gonzo 的内容——4字节的内存地址(译注:在x86+32位操作系统上,指针本身也占有4字节的连续内存地址)位于数据段。
但是指针本身(4字节的连续内存地址)却不在数据段内(译注:gonzo指向的地址0x080484f0 是该字符串的第一个ASCII字符 G 的起始地址,参见下图)。
该字符串位于text段内,text段为只读,存储所有的程序代码,以及字符串常量(译注:例如图中的 "God's own prototype" 字符串,据推测,应该是以类似 const char * 的形式定义的;如果是 static char * 定义的,则是可读写,应该位于data段而非text段)
text段内存同样映射至二进制文件对应的text节,但是向text段写入将导致程序分段错误。这有助于避免指针的缺陷(译注:通过修改指针错误引用的代码段内容得到段错误),虽然这种预防机制没有C程序员的安全编码意识来的有效,但不失为一种辅助检测手段。
You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.
在 Linux 中,你可以通过读取文件 /proc/pid_of_process/maps 来查看特定进程的虚拟地址空间布局。需要注意的是,一个虚拟段可能包含多个区域,例如,每个映射进 mmap 共享内存段中的
文件内容通常各划分了各自的区域,并且动态链接库在该段中拥有类似BSS和data段的额外布局。下一篇文章将会阐明“区域”的确切含义。以及谈谈有时人们说的“数据段”= data段+bss段+堆段,这种说法的正确性。
You can examine binary p_w_picpaths using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:
你可以使用 nm 和 objdump 命令查看二进制文件来显示符号及其地址,预定映射到的内存段等等信息。最后,在 Linux 中,前文描述的虚拟地址空间布局很灵活,作为默认已经沿用数年之久。
这种机制假设 RLIMIT_STACK 的值是确定的,否则,Linux 将恢复“经典的”布局方案,如下图所示:
That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.
最后这个图就是经典的虚拟地址空间布局。下一篇文章将探讨内核是怎样跟踪这些内存区域的;内存映射的原理,如何与文件读写联系起来;以及内存使用图表的含义。
转载于:https://blog.51cto.com/shayi1983/1694924