我们遇到的场景是,一个高性能计算比赛中,需要限制大家使用的总内存量。而且程序使用MPI进行多进程,所以限制的是所有进程使用的总内存不大于1GB或3GB(两道题)。所以,我们需要对内存使用量进行测量与限制。
最大内存占用是没法精确估量的
通过一段时间的调研,可以明白只通过Linux原生工具,是没法精确估量最大内存使用量的。我们甚至发现如果使用/proc/self/status
这类接口,在calloc+memset
一个1GB的内存后也没能及时反映出来。经过调研,我们找到了一篇非常细节的博客,作者发现内核是会“欺骗”内存使用量的,并找到了导致Linux内核如此行为的Commit:只有每64次Page Fault之后,RSS(Resident Set Size,代表了实际被程序使用的物理内存)的计数器才会更新一次。
在2020年,Linux的邮件列表中也出现了一次对这个现象的深入讨论。在Linux的那个Commit中,提交者声称能让Cache Miss从4.5下降到4,从而避免因更新RSS计数器而过多地降低性能。讨论者们这个结论不够令人信服,因为一次Page Fault进入内核后,内核本身操作的影响就比0.5次Cache Miss大。
综上,不管Linux现在情况怎么样,我们想要分析内存使用情况,就应该自己用自定义的Mmoery Allocator(内存分配器),并以动态库覆盖的方式替换掉C中的malloc
和C++中的new
分配器,来达成目标。
寻找合适的内存分配器
很多自定义的内存分配器都提供了内存分配和释放时的Hook,以及内存数据的分析输出。我们找到了以下内存分配器:
- Facebook 的 jealloc(使用于Firefox和FreeBSD中)
- Google 的 tcmalloc(使用于Chrome和Google的各种内部项目中,见Google的说明博客)
- Microsoft 的 mimalloc 和 snmalloc
- Intel 的 TBB (Intel Thread Building Blocks)
- GNU 的 ptmalloc2(使用于glibc中)
- Doug Lea 的 dlmalloc(一个古老精妙且经典的内存分配器,启发了后来的很多分配器如ptmalloc)
- Mattias Jansson 的 rpmalloc
- Emery Berger 的 Hoard
这些内存分配器都可以做statistic分析。但是,使用jealloc和tcmalloc在MPI多进程替换malloc似乎有些问题(会崩),我也不是这方面的专家,最后选择了Microsoft的mimalloc。
通过mimalloc确定内存使用量
下面的代码是测试程序:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <errno.h>
#include <string.h>
#include <vector>
#include <mpi.h>
static ssize_t parse_readable_size(const char *text);
int main(int argc, const char **argv)
{
MPI_Init(NULL, NULL);
int myrank, nranks;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
if (argc < 2) {
if (myrank == 0) printf("Usage: %s size\n", argv[0]);
if (myrank == 0) printf("Example: %s 16MB\n", argv[0]);
MPI_Finalize();
return 0;
}
#ifdef USE_CALLOC
if (myrank == 0) printf("Using calloc\n");
#else
if (myrank == 0) printf("Using std::vector\n");
#endif
for (int ai = 1; ai < argc; ai++) {
size_t size = (size_t)parse_readable_size(argv[ai]);
if (myrank == 0) printf("size is %llu\n", size);
#ifdef USE_CALLOC
uint8_t *arr = (uint8_t *)calloc(sizeof(uint8_t), size);
#else
std::vector<uint8_t> arr(size, 0);
#endif
uint64_t x = 0; // Prevent compiler from eliminating this unused array.
for (size_t i = 0; i < size; i++) {
arr[i] = (uint8_t)i;
x += (uint64_t)arr[i];
}
volatile uint64_t sink;
sink = x;
#ifdef USE_CALLOC
free(arr);
#endif
}
MPI_Finalize();
}
static ssize_t parse_readable_size(const char *text)
{
static const double Bbase = 1.0;
static const double Kbase = 1024.0 * Bbase;
static const double Mbase = 1024.0 * Kbase;
static const double Gbase = 1024.0 * Mbase;
static const double Tbase = 1024.0 * Gbase;
static const double Pbase = 1024.0 * Tbase;
static const double Ebase = 1024.0 * Pbase;
static const double Zbase = 1024.0 * Ebase;
struct unit_t {
const char *suffix;
double base;
};
struct unit_t units[] = {
{ .suffix = "Bytes", .base = Bbase },
{ .suffix = "Byte", .base = Bbase },
{ .suffix = "B", .base = Bbase },
{ .suffix = "KBytes", .base = Kbase },
{ .suffix = "KB", .base = Kbase },
{ .suffix = "K", .base = Kbase },
{ .suffix = "MBytes", .base = Mbase },
{ .suffix = "MB", .base = Mbase },
{ .suffix = "M", .base = Mbase },
{ .suffix = "GBytes", .base = Gbase },
{ .suffix = "GB", .base = Gbase },
{ .suffix = "G", .base = Gbase },
{ .suffix = "TBytes", .base = Tbase },
{ .suffix = "TB", .base = Tbase },
{ .suffix = "T", .base = Tbase },
{ .suffix = "PBytes", .base = Pbase },
{ .suffix = "PB", .base = Pbase },
{ .suffix = "P", .base = Pbase },
{ .suffix = "EBytes", .base = Ebase },
{ .suffix = "EB", .base = Ebase },
{ .suffix = "E", .base = Ebase },
{ .suffix = "ZBytes", .base = Zbase },
{ .suffix = "ZB", .base = Zbase },
{ .suffix = "Z", .base = Zbase },
};
char *endptr = NULL;
errno = 0;
double dnum = strtod(text, &endptr);
int error = errno;
errno = 0;
if (endptr == text)
return -1; /* contains with non-number */
if (error == ERANGE)
return -1; /* number out of range for double */
if (dnum != dnum)
return -1; /* not a number */
while (*endptr == ' ')
endptr++;
if (*endptr == '\0')
return (ssize_t)dnum; /* no suffix */
for (size_t i = 0; i < sizeof(units) / sizeof(struct unit_t); i++) {
struct unit_t *unit = &units[i];
int matched = strncmp(endptr, unit->suffix, 32) == 0;
if (matched)
return (ssize_t)(dnum * unit->base);
}
return -1;
}
编译和运行的环境:
### Machine: MacBook Pro; Chip: Apple M1; Memory: 16 GB; OS: Ventura 13.2.1
$ mpirun --version
mpirun (Open MPI) 4.1.5
$ mpicc --version
gcc-12 (Homebrew GCC 12.2.0) 12.2.0
$ gcc --version
gcc-12 (Homebrew GCC 12.2.0) 12.2.0
编译命令:(注意gcc-4.8.5没法编译mimalloc,因为缺少gcc-4.9才加上的stdatomic.h。Linux下我尝试了gcc-9.3.1,能成功编译)
### Compile mimalloc
cd ~/mimalloc-2.1.2
mkdir -p out/release
cd out/release
cmake ../.. -DCMAKE_C_COMPILER=$(which gcc)
make
### Compiler test program
$ mpic++ -o mimalloc-test mimalloc-test.cpp
运行命令:
mpirun -n 1 -x MIMALLOC_RESERVE_OS_MEMORY=1400MiB -x MIMALLOC_LIMIT_OS_ALLOC=1 -x MIMALLOC_VERBOSE=1 -x MIMALLOC_SHOW_STATS=1 -x DYLD_INSERT_LIBRARIES=$(realpath ~/mimalloc-2.1.2/out/release/libmimalloc.dylib) ./mimalloc-test 0.9GB 1MB 1.2MB 512.9MB 512MB
运行参数解释:因为我喜欢用mpirun
的-x
参数写环境变量,所以不用export ...
的方式来设定mimalloc的配置。
-n 1
:启动1个进程。我试过起更多的MPI进程也能正常运行,这里示例只用了1个进程。-x MIMALLOC_RESERVE_OS_MEMORY=1400MiB
:进程启动时,让mimalloc向系统预申请好1.4GB的内存-x MIMALLOC_LIMIT_OS_ALLOC=1
:不允许超限申请额外的内存-x MIMALLOC_VERBOSE=1
:在进程启动时,输出mimalloc的配置信息-x MIMALLOC_SHOW_STATS=1
:在进程结束时,输出mimalloc的运行信息-x DYLD_INSERT_LIBRARIES
:覆盖malloc
和new
动态库。我这里是MacOS系统,如果是Linux,应该用-x LD_PRELOAD=.../libmimalloc.so
。
MacOS下的输出分析
mimalloc: option 'show_errors': 0
mimalloc: option 'show_stats': 1
mimalloc: option 'verbose': 1
mimalloc: option 'eager_commit': 1
mimalloc: option 'arena_eager_commit': 2
mimalloc: option 'purge_decommits': 1
mimalloc: option 'allow_large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 1433600
mimalloc: option 'deprecated_segment_cache': 0
mimalloc: option 'deprecated_page_reset': 0
mimalloc: option 'abandoned_page_purge': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'purge_delay': 10
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 1
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'max_segment_reclaim': 8
mimalloc: option 'destroy_on_exit': 0
mimalloc: option 'arena_reserve': 1048576
mimalloc: option 'arena_purge_mult': 10
mimalloc: option 'purge_extend_delay': 1
Using std::vector
size is 966367641
size is 1048576
size is 1258291
size is 537814630
size is 536870912
heap stats: peak total freed current unit count
reserved: 1.4 GiB 2.7 GiB 1.4 GiB 1.3 GiB
committed: 1.4 GiB 2.7 GiB 2.3 GiB 428.8 MiB
reset: 0
purged: 979.1 MiB
touched: 250.9 KiB 385.5 KiB 1.9 GiB -1.9 GiB ok
segments: 4 7 6 1 not all freed!
-abandoned: 2 2 1 1 not all freed!
-cached: 0 0 0 0 ok
pages: 0 0 29 -29 ok
-abandoned: 13 13 10 3 not all freed!
-extended: 0
-noretire: 0
mmaps: 0
commits: 0
resets: 0
purges: 4
threads: 2 2 2 0 ok
searches: 0.0 avg
numa nodes: 1
elapsed: 5.407 s
process: user: 5.277 s, system: 0.154 s, faults: 13, rss: 942.2 MiB, commit: 1.4 GiB
mimalloc: process done: 0x1f7038140
分析:
mimalloc: option 'reserve_os_memory': 1433600
说明我们正常向系统预申请了内存- 看程序结束时的
peak
列,其中reserved了就是我们申请的1.4 GiB内存。不知道为啥total是2.7 GiB,但只要程序没崩,说明没超过我们限制的1.4 GiB。 - rss是942.2 MiB,说明我们最高使用的那个0.9 GiB内存能正常反映出来了。
可见用mimalloc的rss输出能满足我们分析内存使用量的要求。
当然,把运行程序改成./mimalloc-test 1.5GB
程序就会死掉,原因是内存超额了。报错是mimalloc: error: unable to allocate memory (1610612736 bytes)
,这只需要把-x MIMALLOC_LIMIT_OS_ALLOC=1
参数去掉就可以了,然而我们的目标是限制内存使用,所以这个参数还是蛮好用的。
另外说一句,从2015年的OS X v10.6.8(El Capitan)开始,MacOS增加了System Integrity Protection,对/System, /usr, /bin, /sbin, /var
目录下的程序都进行了保护,从而没法向ls, echo
这些程序用DYLD_INSERT_LIBRARIES
来注入了,所以要自己写一个程序来测试。
Linux下的输出分析
mpirun -n 1 -x MIMALLOC_RESERVE_OS_MEMORY=1400MiB -x MIMALLOC_LIMIT_OS_ALLOC=1 -x MIMALLOC_VERBOSE=1 -x MIMALLOC_SHOW_STATS=1 -x LD_PRELOAD=$(realpath ~/mimalloc-2.1.2/out/release/libmimalloc.so) ./mimalloc-test 0.9GB 1MB 1.2MB 512.9MB 512MB
注意唯一的区别是把MacOS的DYLD_INSERT_LIBRARIES
和.dylib
后缀换成了Linux的LD_PRELOAD
和.so
后缀。
mimalloc: process init: 0xffffabf9c5b0
mimalloc: debug level : 2
mimalloc: secure level: 0
mimalloc: mem tracking: none
mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (size: 0x58000000 bytes, address: 0xffff53480000, alignment: 0x2000000, commit: 1)
mimalloc: reserved 1441792 KiB memory
mimalloc: using 4 numa regions
mimalloc: option 'show_errors': 1
mimalloc: option 'show_stats': 1
mimalloc: option 'verbose': 1
mimalloc: option 'eager_commit': 1
mimalloc: option 'arena_eager_commit': 2
mimalloc: option 'purge_decommits': 1
mimalloc: option 'allow_large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 1433600
mimalloc: option 'deprecated_segment_cache': 0
mimalloc: option 'deprecated_page_reset': 0
mimalloc: option 'abandoned_page_purge': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'purge_delay': 10
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 1
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'max_segment_reclaim': 8
mimalloc: option 'destroy_on_exit': 0
mimalloc: option 'arena_reserve': 1048576
mimalloc: option 'arena_purge_mult': 10
mimalloc: option 'purge_extend_delay': 1
Using std::vector
size is 966367641
size is 1048576
size is 1258291
size is 537814630
size is 536870912
heap stats: peak total freed current unit count
normal 1: 17.2 KiB 22.7 KiB 22.5 KiB 200 B 8 B 2.8 K not all freed
normal 4: 79.6 KiB 139.3 KiB 138.5 KiB 864 B 32 B 4.4 K not all freed
normal 6: 73.4 KiB 174.6 KiB 172.7 KiB 1.9 KiB 48 B 3.7 K not all freed
normal 8: 30.5 KiB 101.4 KiB 101.0 KiB 384 B 64 B 1.6 K not all freed
normal 9: 41.0 KiB 52.9 KiB 51.2 KiB 1.6 KiB 80 B 675 not all freed
normal 10: 32.5 KiB 61.1 KiB 60.1 KiB 1.0 KiB 96 B 650 not all freed
normal 11: 17.1 KiB 406.9 KiB 406.3 KiB 560 B 112 B 3.7 K not all freed
normal 12: 9.1 KiB 18.4 KiB 18.0 KiB 384 B 128 B 147 not all freed
normal 13: 20.5 KiB 30.7 KiB 29.3 KiB 1.4 KiB 160 B 196 not all freed
normal 14: 182.5 KiB 185.0 KiB 184.8 KiB 192 B 192 B 983 not all freed
normal 15: 8.1 KiB 16.0 KiB 14.7 KiB 1.3 KiB 224 B 73 not all freed
normal 16: 7.7 KiB 17.0 KiB 16.5 KiB 512 B 256 B 68 not all freed
normal 17: 56.7 KiB 94.1 KiB 92.8 KiB 1.2 KiB 320 B 300 not all freed
normal 18: 11.6 KiB 21.0 KiB 20.3 KiB 768 B 384 B 56 not all freed
normal 19: 7.9 KiB 16.2 KiB 15.3 KiB 896 B 448 B 37 not all freed
normal 20: 14.0 KiB 23.0 KiB 22.5 KiB 512 B 512 B 46 not all freed
normal 21: 39.5 KiB 165.6 KiB 160.6 KiB 5.0 KiB 640 B 264 not all freed
normal 22: 24.8 KiB 42.1 KiB 38.4 KiB 3.7 KiB 768 B 56 not all freed
normal 23: 3.5 KiB 23.7 KiB 23.7 KiB 0 896 B 27 ok
normal 24: 24.0 KiB 30.1 KiB 29.1 KiB 1.0 KiB 1.0 KiB 30 not all freed
normal 25: 96.6 KiB 151.8 KiB 134.2 KiB 17.5 KiB 1.2 KiB 121 not all freed
normal 26: 16.5 KiB 16.5 KiB 15.0 KiB 1.5 KiB 1.5 KiB 11 not all freed
normal 27: 1.7 KiB 3.5 KiB 3.5 KiB 0 1.7 KiB 2 ok
normal 28: 4.0 KiB 4.0 KiB 4.0 KiB 0 2.0 KiB 2 ok
normal 29: 12.5 KiB 20.0 KiB 20.0 KiB 0 2.5 KiB 8 ok
normal 30: 12.0 KiB 18.0 KiB 15.0 KiB 3.0 KiB 3.0 KiB 6 not all freed
normal 31: 3.5 KiB 3.5 KiB 3.5 KiB 0 3.5 KiB 1 ok
normal 32: 4.0 KiB 4.0 KiB 4.0 KiB 0 4.0 KiB 1 ok
normal 33: 145.5 KiB 150.5 KiB 145.5 KiB 5.0 KiB 5.0 KiB 30 not all freed
normal 34: 6.0 KiB 6.0 KiB 6.0 KiB 0 6.0 KiB 1 ok
normal 37: 190.7 KiB 210.8 KiB 200.7 KiB 10.0 KiB 10.0 KiB 21 not all freed
normal 40: 16.0 KiB 48.1 KiB 48.1 KiB 0 16.0 KiB 3 ok
normal 41: 261.0 KiB 361.4 KiB 361.4 KiB 0 20.0 KiB 18 ok
normal 43: 28.1 KiB 28.1 KiB 28.1 KiB 0 28.1 KiB 1 ok
normal 44: 32.1 KiB 32.1 KiB 32.1 KiB 0 32.1 KiB 1 ok
normal 45: 321.2 KiB 642.5 KiB 642.5 KiB 0 40.1 KiB 16 ok
normal 46: 48.1 KiB 48.1 KiB 48.1 KiB 0 48.1 KiB 1 ok
normal 48: 64.2 KiB 64.2 KiB 64.2 KiB 0 64.2 KiB 1 ok
normal 49: 722.8 KiB 1.8 MiB 1.7 MiB 80.3 KiB 80.3 KiB 24 not all freed
heap stats: peak total freed current unit count
normal: 2.4 Mi 5.1 Mi 4.9 Mi 139.6 Ki 264 B 20.2 K not all freed!
large: 15.5 Mi 22.4 Mi 22.4 Mi 0 435.2 KiB 53 ok
huge: 924.0 Mi 1.9 Gi 1.9 Gi 0 652.0 MiB 3 ok
total: 942.0 MiB 1.9 GiB 1.9 GiB 139.6 KiB not all freed
malloc req: 933.2 MiB 1.9 GiB 1.9 GiB 123.2 KiB not all freed
reserved: 1.4 GiB 2.7 GiB 1.4 GiB 1.3 GiB
committed: 1.4 GiB 3.3 GiB 2.9 GiB 390.4 MiB
reset: 0
purged: 1.5 GiB
touched: 941.6 MiB 1.9 GiB 1.9 GiB 1.4 MiB not all freed
segments: 6 10 8 2 not all freed!
-abandoned: 3 3 3 0 ok
-cached: 0 0 0 0 ok
pages: 108 132 103 29 not all freed!
-abandoned: 15 15 15 0 ok
-extended: 326
-noretire: 118
mmaps: 6
commits: 45
resets: 0
purges: 12
threads: 3 4 4 0 ok
searches: 0.7 avg
numa nodes: 4
elapsed: 12.239 s
process: user: 11.961 s, system: 0.140 s, faults: 0, rss: 1.2 GiB, commit: 1.4 GiB
mimalloc: process done: 0xffffabf9c5b0
Linux的输出就丰富得多了,它展示了内部的bin(就是normal 1
这些),每个bin对应一个unit大小的内存分配,比如8B
,32B
,48B
…。如果bin的内存在程序结束时没有被释放,则写not all freed
,正确释放则ok
。
heap stat中,展示了normal, large, huge, tota, malloc req
几行。其中,normal
就是列出来的这些比较小的分配,large
对应的unit是435.2 KiB
,而huge
对应的unit是652.0 MiB
。我们第一个array要的内存比较大(900MB
),所以huge那行是924.0 Mi
,这是合理的。malloc req
则写了933.2 MiB
,这也是合理的。
但是,Linux下的rss是1.2 GiB
,这似乎有点不对。所以,在Linux下应该看malloc req
作为申请了多少内存的标准。
写在最后
本文参照了mimalloc的issue #730的意见,使用了MIMALLOC_RESERVE_OS_MEMORY
和MIMALLOC_LIMIT_OS_ALLOC
两个配置参数来达到内存使用量的分析目的。这个issue中也提到了使用Linux的ulimit
来进行控制可能是更好的选择。但是,ulimit不是很好用(见这个answer),所以还是用了mimalloc来分析,它也很好地完成了任务。