From Virtual to Physical
In the previous article, we introduced a way to classify the memory a process reclaimed. We used 4 quadrants using two axis: private/shared and anonymous/file-backed. We also evoked the complexity of the sharing mechanism and the fact that all memory is basically reclaimed to the kernel.
Everything we talked about was virtual. It was all about reservation of memory addresses, but a reserved address is not always immediately mapped to physical memory by the kernel. Most of the time, the kernel delays the actual allocation of physical memory until the time of the first access (or the time of the first write in some cases)… and even then, this is done with the granularity of a page (commonly 4KiB). Moreover, some pages may be swapped out after being allocated, that means they get written to disk in order to allow other pages to be put in RAM.
As a consequence, knowing the actual size of physical memory used by a process (known as resident memory of the process) is really a hard game… and the sole component of the system that actually knows about it is the kernel (it’s even one of its jobs). Fortunately, the kernel exposes some interfaces that will let you retrieve some statistics about the system or a specific process. This article enters into the depth of the tools provided by the Linux ecosystem to analyze the memory pattern of processes.
On Linux, those data are exposed through the /proc
file-system and more specifically by the content of /proc/[pid]/
. These directories (one per process) contain some pseudo-files that are API entry points to retrieve information directly from the kernel. The content of the /proc
directory is detailed in proc(5)
manual page (this content changes from one Linux version to another).
The human front-end for those API calls are tools such as procps
(ps
, top
, pmap
…). These tools display the data retrieved from the kernel with little to no modification. As a consequence they are good entry points to understand how the kernel classifies the memory. In this article we will analyze the memory-related outputs of top
and pmap
.
top
: process statistics
top
is a widely known (and used) tool that allows monitoring the system. It displays one line per process with various columns that may contain CPU related, memory related, or more general information.
When running top
, you can switch to the memory view by pressing G3
. In that view, you will find, among others, the following columns: %MEM
, VIRT
, SWAP
, RES
, CODE
, DATA
, SHR
. With the exception of SWAP
, all these data are extracted from the file/proc/[pid]/statm
that exposes some memory related statistics. That file contains 7 numerical fields: size
(mapped to VIRT
),resident
(mapped to RES
), share
(mapped to SHR
), text
(mapped to CODE
), lib
(always 0 on Linux 2.6+), data
(mapped to DATA
) anddt
(always 0 on Linux 2.6+, mapped to nDrt
).
Trivial columns
As you may have guessed, some of these columns are trivial to understand. VIRT
is the total size of the virtual address space that has been reserved by the process so far. CODE
is the size of the executable code of the binary executed by the process. RES
is the resident set size, that is the amount of physical memory the kernel considers assigned to the process. As a direct consequence, %MEM
is strictly proportional to RES
.
The resident set size is computed by the kernel as the sum of two counters. The first one contains the number of anonymous resident pages (MM_ANONPAGES
), the second one is the number of file-backed resident pages (MM_FILEPAGES
). Some pages may be considered as resident for more than one process at once, so the sum of the RES
may be larger than the amount of RAM effectively used, or even larger than the amount of RAM available on the system.
Shared memory
SHR
is the amount of resident sharable memory of the process. If you remember well the classification we made in the previous article, you may suppose this includes all the resident memory of the right column. However as already discussed, some private memory may be shared too. So, in order to understand the actual meaning of that column, we must dig a bit deeper in the kernel.
The SHR
column is filled with the shared
field of /proc/[pid]/statm
which itself is the value of the MM_FILEPAGES
counter of the kernel, which is one of the two components of the resident size. This just means that this column contains the amount of file-backed resident memory (thus including quadrant 3 and 4).
That’s pretty cool… however remember quadrant 2: shared anonymous memory does exist… the previous definition only includes file-backed memory… and running the following test program shows that the shared anonymous memory is taken into account in the SHR
column:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
#include <sys/mman.h>
#include <unistd.h>
#include <stdint.h>
int
main
(
)
{
/* mmap 50MiB of shared anonymous memory */
char
*
p
=
mmap
(
NULL
,
50
<<
20
,
PROT_READ
|
PROT_WRITE
,
MAP_ANONYMOUS
|
MAP_SHARED
,
-
1
,
0
)
;
/* Touch every single page to make them resident */
for
(
int
i
=
0
;
i
<
(
50
<<
20
)
/
4096
;
i
++
)
{
p
[
i
*
4096
]
=
1
;
}
/* Let us see the process in top */
sleep
(
1000000
)
;
return
0
;
}
|
top
indicates 50m
in both the RES
and the SHR
columns1…
This is due to a subtlety of the Linux kernel. On Linux, a shared anonymous map is actually file-based. The kernel creates a file in a tmpfs
(an instance of /dev/zero
). The file is immediately unlinked so it cannot be accessed by any other processes unless they inherited the map (by forking). This is quite clever since the sharing is done through the file layer the same way it’s done for shared file-backed mappings (quadrant 4).
A last point, since private file-backed pages that are modified don’t get synced back to disk, they are not file-backed anymore (the are transferred from MM_FILEPAGES
counter to MM_ANONPAGES
). As a consequence, they don’t account in the SHR
anymore.
Note that the man page of top
is wrong since it states that SHR
may contain non-resident memory: the amount of shared memory available to a task, not all of which is typically resident. It simply reflects memory that could be potentially shared with other processes.
Data
The meaning of the DATA
column is quite opaque. The documentation of top states “Data + Stack”… which does not really help since it does not define “Data”. Thus we’ll need to dig once again into the kernel.
That field is computed by the kernel as a difference between two variables: total_vm
which is the same as VIRT
and shared_vm
.shared_vm
is somehow similar to SHR
in that it shares the definition of the shareable memory, but instead of only accounting the resident part, it contains the sum of all addressed file-backed memory. Moreover, the count is done at the mapping level, not the page one, thus shared_vm
does not have the same subtlety as SHR
for the modified private file-backed memory. As a consequence shared_vm
is the sum of the quadrants 2, 3 and 4. This means that the difference between total_vm
and shared_vm
is exactly the content of quadrant 1.
The DATA
column contains the amount of reserved private anonymous memory. By definition, the private anonymous memory is the memory that is specific to the program and that holds its data. It can only be shared by forking in a copy-on-write fashion. It includes (but is not limited to) the stacks and the heap2. This column does not contain any piece of information about how much memory is actually used by the program, it just tells us that the program reserved some amount of memory, however that memory may be left untouched for a long time.
A typical example of a meaningless DATA
value is what happens when a x86_64 program compiled with Address Sanitizer is launched. ASan works by reserving 16TiB of memory, but only use 1 byte of those terabytes per 8-bytes word of memory actually allocated by the process. As a consequence, the output of top
looks like this:
1
2
|
3 PID %MEM VIRT SWAP RES CODE DATA SHR COMMAND
16190 0.687 16.000t 0 56056 13784 16.000t 2912 zchk-asan
|
Note that the man page of top
is once again wrong since it states that DATA
is the amount of physical memory devoted to other than executable code, also known as the ‘data resident set’ size or DRS; and we just saw that DATA
has no link at all with resident memory.
Swap
SWAP
is somehow different from the other ones. That column is supposed to contain the amount of memory of the process that gets swapped out by the kernel. First of all, the content of that column totally depends on both the version of Linux and the version of top
you are running. Prior to Linux 2.6.34, the kernel didn’t expose any per-process statistics about the number of pages that were swapped out. Prior to top
3.3.0, top
displayed a totally meaningless information here (but that was in accordance with the man page). However, if you use Linux 2.6.34 or later with top
3.3.0 or later, that count is actually the number of pages that were swapped out.
If your top
is too old, the SWAP
column is filled with the difference between the VIRT
and the RES
column. This is totally meaningless because that difference effectively contains the amount of memory that has been swapped out, but it also includes the file-backed pages that get unloaded or the pages that are reserved but untouched (and thus have not been actually allocated yet). Some old Linux distributions still have a top
with that buggy SWAP
value, among them stands the still widely used RHEL5.
If your top
is up-to-date but your kernel is too old, the column will always contain 0, which is not really helpful.
If both your kernel and your top
are up-to-date, then the column will contain the value of the field VmSwap
of the file/proc/[pid]/status
. That is maintained by the kernel as a counter that gets incremented each time a page is swapped out and decremented each time a page get swapped in. As a consequence it is accurate and will provide you with an important piece of information: basically if that value is non-0, this means your system is under memory pressure and the memory of your process cannot fit in RAM.
The man page describes SWAP
as the non-resident portion of a task’s address space, which is what was implemented prior to top
3.3.0, but has nothing to do with the actual amount of memory that has been swapped out. On earlier versions of top
, the man page properly explains what is displayed, however the SWAP
naming is not appropriate.
pmap
: detailed mapping
pmap
is another kind of tool. It goes deeper than top
by displaying information about each separate mapping of the process. A mapping, in that view is a range of contiguous pages having the same backend (anonymous or file) and the same access modes.
For each mapping, pmap
shows the previously listed options as well as the size of the mapping, the amount of resident pages as well as the amount of dirty pages. Dirty pages are pages that have been written to, but have not been synced back to the underlying file yet. As a consequence, the amount of dirty pages is only meaningful for mappings with write-back, that is shared file-backed mappings (quadrant 4).
The source of pmap
data can be found in two human-readable files: /proc/[pid]/maps
and /proc/[pid]/smaps
. While the first file is a simple list of mappings, the second one is a more detailed version with a paragraph per mapping. smaps
is available since Linux 2.6.14, which is old enough to be present on all popular distributions.
pmap
usage is simple:
-
pmap [pid]
: display the content of the/proc/[pid]/maps
, but removes theinode
anddevice
columns. -
pmap -x [pid]
: this enriches the output by adding some pieces of information from/proc/[pid]/smaps
(RSS
andDirty
). - since
pmap
3.3.4 there are-X
and-XX
to display even more data but there are Linux specific (moreover this seems to be a bit buggy with recent kernel versions).
Basic content
The pmap
utility finds its inspiration in a similar command on Solaris and mimics its behavior. Here is the output of pmap
and the content of /proc/[pid]/maps
for the small program given as example for shared anonymous memory testing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
3009: ./blah
0000000000400000 4K r-x-- /home/fruneau/blah
0000000000401000 4K rw--- /home/fruneau/blah
00007fbb5da87000 51200K rw-s- /dev/zero (deleted)
00007fbb60c87000 1536K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb60e07000 2048K ----- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb61007000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100b000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100c000 20K rw--- [ anon ]
00007fbb61011000 128K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61221000 12K rw--- [ anon ]
00007fbb6122e000 8K rw--- [ anon ]
00007fbb61230000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61231000 4K rw--- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61232000 4K rw--- [ anon ]
00007fff9350f000 132K rw--- [ stack ]
00007fff9356e000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 55132K
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
00400000-00401000 r-xp 00000000 08:01 3507636 /home/fruneau/blah
00401000-00402000 rw-p 00000000 08:01 3507636 /home/fruneau/blah
7fbb5da87000-7fbb60c87000 rw-s 00000000 00:04 8467 /dev/zero (deleted)
7fbb60c87000-7fbb60e07000 r-xp 00000000 08:01 3334313 /lib/x86_64-linux-gnu/libc-2.13.so
7fbb60e07000-7fbb61007000 ---p 00180000 08:01 3334313 /lib/x86_64-linux-gnu/libc-2.13.so
7fbb61007000-7fbb6100b000 r--p 00180000 08:01 3334313 /lib/x86_64-linux-gnu/libc-2.13.so
7fbb6100b000-7fbb6100c000 rw-p 00184000 08:01 3334313 /lib/x86_64-linux-gnu/libc-2.13.so
7fbb6100c000-7fbb61011000 rw-p 00000000 00:00 0
7fbb61011000-7fbb61031000 r-xp 00000000 08:01 3334316 /lib/x86_64-linux-gnu/ld-2.13.so
7fbb61221000-7fbb61224000 rw-p 00000000 00:00 0
7fbb6122e000-7fbb61230000 rw-p 00000000 00:00 0
7fbb61230000-7fbb61231000 r--p 0001f000 08:01 3334316 /lib/x86_64-linux-gnu/ld-2.13.so
7fbb61231000-7fbb61232000 rw-p 00020000 08:01 3334316 /lib/x86_64-linux-gnu/ld-2.13.so
7fbb61232000-7fbb61233000 rw-p 00000000 00:00 0
7fff9350f000-7fff93530000 rw-p 00000000 00:00 0 [stack]
7fff9356e000-7fff9356f000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
|
There are a few interesting points in that output. First of all, pmap
‘s choice is to provide the size of the mappings instead of the ranges of addresses and to add the sum of those sizes at the end. This sum is the VIRT
size of top
: the sum of all the reserved ranges of addresses in the virtual address space.
Each map is associated with a set of modes:
-
r
: if set, the map is readable -
w
: if set, the map is writable -
x
: if set, the map contains executable code -
s
: if set, the map is shared (right column in our previous classification). You can notice thatpmap
only has thes
flag, while the kernel exposes two different flags for shared (s
) and private (p
) memory. -
R
: if set, the map has no swap space reserved (MAP_NORESERVE
flag ofmmap
), this means that we can get a segmentation fault by accessing that memory if it has not already been mapped to physical memory and the system is out of physical memory.
The first three flags can be manipulated using the mprotect(2)
system call and can be set directly in the mmap
call.
The last column is the source of the data. In our example, we can notice that pmap
does not keep the kernel-specific details. It has three categories of memory: anon
, stack
and file-backed (with the path of file, and the (deleted) tag if the mapped file has been unlinked). In addition to these categories, the kernel has vdso
, vsyscall
and heap
. It’s quite a shame that pmap
didn’t keep the heap
mark since it’s important for programmers (but that is probably in order to be compatible with its Solaris counterpart).
Concerning that last column, we also see that executable files and shared libraries are mapped privately (but this was already spoiled by the previous article) and that different parts of the same file are mapped differently (some parts are even mapped more than once). This is because executable files contain different sections: text
, data
, rodata
, bss
… each has a different meaning and is mapped differently. We will cover those sections in the next post.
Last (but not least), we can see that our shared anonymous memory is actually implemented as a shared mapping of an unlinked copy of /dev/zero
.
Extended content
The output of pmap -x
contains two additional columns:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
Address Kbytes RSS Dirty Mode Mapping
0000000000400000 4 4 4 r-x-- blah
0000000000401000 4 4 4 rw--- blah
00007fc3b50df000 51200 51200 51200 rw-s- zero (deleted)
00007fc3b82df000 1536 188 0 r-x-- libc-2.13.so
00007fc3b845f000 2048 0 0 ----- libc-2.13.so
00007fc3b865f000 16 16 16 r---- libc-2.13.so
00007fc3b8663000 4 4 4 rw--- libc-2.13.so
00007fc3b8664000 20 12 12 rw--- [ anon ]
00007fc3b8669000 128 108 0 r-x-- ld-2.13.so
00007fc3b8879000 12 12 12 rw--- [ anon ]
00007fc3b8886000 8 8 8 rw--- [ anon ]
00007fc3b8888000 4 4 4 r---- ld-2.13.so
00007fc3b8889000 4 4 4 rw--- ld-2.13.so
00007fc3b888a000 4 4 4 rw--- [ anon ]
00007fff7e6ef000 132 12 12 rw--- [ stack ]
00007fff7e773000 4 4 0 r-x-- [ anon ]
ffffffffff600000 4 0 0 r-x-- [ anon ]
---------------- ------ ------ ------
total kB 55132 51584 51284
|
The first one is RSS
, it tells us how much the mapping contributes to the resident set size (and ultimately provides a sum that is the total memory consumption of the process). As we can see some mappings are only partially mapped in physical memory. The biggest one (our manual mmap
) is totally allocated because we touched every single page.
The second new column is Dirty
and contains the number of pages from their source. For shared file-backed mappings, dirty pages can be written back to the underlying file if the kernel feels it has to make some room in RAM or that there are too many dirty pages. In that case the page is marked clean. For all the remaining quadrants, since the backend is either anonymous (so no disk-based back-end) or private (so changes are not available to other processes), unloading the dirty pages requires writing them to the swap file3.
This is only a subset of what the kernel actually exposes. A lot more information is present in the smaps
file (which make it a bit too verbose to be readable) as you can see in the following snippet:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
00400000-00401000 r-xp 00000000 08:01 3507636 /home/fruneau/blah
Size: 4 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 4 kB
Private_Dirty: 0 kB
Referenced: 4 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
00401000-00402000 rw-p 00000000 08:01 3507636 /home/fruneau/blah
Size: 4 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 4 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
7f55c4dd2000-7f55c7fd2000 rw-s 00000000 00:04 8716 /dev/zero (deleted)
Size: 51200 kB
Rss: 51200 kB
Pss: 51200 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 51200 kB
Referenced: 51200 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
|
Adding pmap
‘s output to a bug report is often a good idea.
More?
As you can see, understanding the output of top
and other tools requires some knowledge of the operating system you are running. Even if top
is available on various systems, each version is specific to the system it runs on. For example, on OS X, you will not find the RES
, DATA
, SHR
… columns, but instead some named RPRVT
, RSHRD
, RSIZE
, VPRVT
, VSIZE
(note that those names are somehow a bit less opaque than Linux ones). If you want to dive a bit deeper in Linux memory management, you can read the mm/
directory of the source tree, or traverse Understand the Linux Kernel.
Because this post is quite long, here is a short summary:
-
top
‘s man page cannot be trusted. -
top
‘sRES
gives you the actual physical RAM used by your process. -
top
‘sVIRT
gives you the actual virtual memory reserved by your process. -
top
‘sDATA
gives you the amount of virtual private anonymous memory reserved by your process. That memory may or may not be mapped to physical RAM. It corresponds to the amount of memory intended to store process specific data (not shared). -
top
‘sSHR
gives you the subset of resident memory that is file-backed (including shared anonymous memory). It represents the amount of resident memory that may be used by other processes. -
top
‘sSWAP
column can only be trusted with recent versions oftop
(3.3.0) and Linux (2.6.34) and is meaningless in all other cases. - if you want details about the memory usage of your process, use
pmap
.
If you are a fan of htop
, note that its content the exact same as top
‘s. Its man page is also wrong or sometimes at least unclear.
Next: Allocating memory
In these two articles we have covered subjects that are (we do hope) of interest for both developers and system administrators. In the following articles we will start going deeper in how the memory is managed by the developer.
-
You can try running the same program with no loop (or with a shorter loop) and see how the kernel delays the loading of the pages in physical memory. ↩
-
But we will see later that it only partially contains the data segment of the loaded executables ↩
-
Clean private file-backed pages can also be unloaded, therefore the next time they get loaded there content may change depending on whether some writes occurred in the file. This is documented in
mmap
man page through the sentence: It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. ↩