内存使用量的限制分析

我们遇到的场景是,一个高性能计算比赛中,需要限制大家使用的总内存量。而且程序使用MPI进行多进程,所以限制的是所有进程使用的总内存不大于1GB或3GB(两道题)。所以,我们需要对内存使用量进行测量与限制。

最大内存占用是没法精确估量的

通过一段时间的调研,可以明白只通过Linux原生工具,是没法精确估量最大内存使用量的。我们甚至发现如果使用/proc/self/status这类接口,在calloc+memset一个1GB的内存后也没能及时反映出来。经过调研,我们找到了一篇非常细节的博客,作者发现内核是会“欺骗”内存使用量的,并找到了导致Linux内核如此行为的Commit:只有每64次Page Fault之后,RSS(Resident Set Size,代表了实际被程序使用的物理内存)的计数器才会更新一次。

在2020年,Linux的邮件列表中也出现了一次对这个现象的深入讨论。在Linux的那个Commit中,提交者声称能让Cache Miss从4.5下降到4,从而避免因更新RSS计数器而过多地降低性能。讨论者们这个结论不够令人信服,因为一次Page Fault进入内核后,内核本身操作的影响就比0.5次Cache Miss大。

综上,不管Linux现在情况怎么样,我们想要分析内存使用情况,就应该自己用自定义的Mmoery Allocator(内存分配器),并以动态库覆盖的方式替换掉C中的malloc和C++中的new分配器,来达成目标。

寻找合适的内存分配器

很多自定义的内存分配器都提供了内存分配和释放时的Hook,以及内存数据的分析输出。我们找到了以下内存分配器:

这些内存分配器都可以做statistic分析。但是,使用jealloc和tcmalloc在MPI多进程替换malloc似乎有些问题(会崩),我也不是这方面的专家,最后选择了Microsoft的mimalloc。

通过mimalloc确定内存使用量

下面的代码是测试程序:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <errno.h>
#include <string.h>
#include <vector>

#include <mpi.h>

static ssize_t parse_readable_size(const char *text);

int main(int argc, const char **argv)
{
    MPI_Init(NULL, NULL);
    int myrank, nranks;
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);

    if (argc < 2) {
        if (myrank == 0) printf("Usage: %s size\n", argv[0]);
        if (myrank == 0) printf("Example: %s 16MB\n", argv[0]);
        MPI_Finalize();
        return 0;
    }

#ifdef USE_CALLOC
    if (myrank == 0) printf("Using calloc\n");
#else
    if (myrank == 0) printf("Using std::vector\n");
#endif

    for (int ai = 1; ai < argc; ai++) {
        size_t size = (size_t)parse_readable_size(argv[ai]);
        if (myrank == 0) printf("size is %llu\n", size);

#ifdef USE_CALLOC
        uint8_t *arr = (uint8_t *)calloc(sizeof(uint8_t), size);
#else
        std::vector<uint8_t> arr(size, 0);
#endif

        uint64_t x = 0; // Prevent compiler from eliminating this unused array.
        for (size_t i = 0; i < size; i++) {
            arr[i] = (uint8_t)i;
            x += (uint64_t)arr[i];
        }
        volatile uint64_t sink;
        sink = x;

#ifdef USE_CALLOC
        free(arr);
#endif
    }

    MPI_Finalize();
}


static ssize_t parse_readable_size(const char *text)
{
    static const double Bbase = 1.0;
    static const double Kbase = 1024.0 * Bbase;
    static const double Mbase = 1024.0 * Kbase;
    static const double Gbase = 1024.0 * Mbase;
    static const double Tbase = 1024.0 * Gbase;
    static const double Pbase = 1024.0 * Tbase;
    static const double Ebase = 1024.0 * Pbase;
    static const double Zbase = 1024.0 * Ebase;
    struct unit_t {
        const char *suffix;
        double base;
    };
    struct unit_t units[] = {
        { .suffix = "Bytes",  .base = Bbase },
        { .suffix = "Byte",   .base = Bbase },
        { .suffix = "B",      .base = Bbase },
        { .suffix = "KBytes", .base = Kbase },
        { .suffix = "KB",     .base = Kbase },
        { .suffix = "K",      .base = Kbase },
        { .suffix = "MBytes", .base = Mbase },
        { .suffix = "MB",     .base = Mbase },
        { .suffix = "M",      .base = Mbase },
        { .suffix = "GBytes", .base = Gbase },
        { .suffix = "GB",     .base = Gbase },
        { .suffix = "G",      .base = Gbase },
        { .suffix = "TBytes", .base = Tbase },
        { .suffix = "TB",     .base = Tbase },
        { .suffix = "T",      .base = Tbase },
        { .suffix = "PBytes", .base = Pbase },
        { .suffix = "PB",     .base = Pbase },
        { .suffix = "P",      .base = Pbase },
        { .suffix = "EBytes", .base = Ebase },
        { .suffix = "EB",     .base = Ebase },
        { .suffix = "E",      .base = Ebase },
        { .suffix = "ZBytes", .base = Zbase },
        { .suffix = "ZB",     .base = Zbase },
        { .suffix = "Z",      .base = Zbase },
    };

    char *endptr = NULL;
    errno = 0;
    double dnum = strtod(text, &endptr);
    int error = errno;
    errno = 0;

    if (endptr == text)
        return -1; /* contains with non-number */
    if (error == ERANGE)
        return -1; /* number out of range for double */
    if (dnum != dnum)
        return -1; /* not a number */

    while (*endptr == ' ')
        endptr++;
    if (*endptr == '\0')
        return (ssize_t)dnum; /* no suffix */

    for (size_t i = 0; i < sizeof(units) / sizeof(struct unit_t); i++) {
        struct unit_t *unit = &units[i];
        int matched = strncmp(endptr, unit->suffix, 32) == 0;
        if (matched)
            return (ssize_t)(dnum * unit->base);
    }

    return -1;
}

编译和运行的环境:

### Machine: MacBook Pro; Chip: Apple M1; Memory: 16 GB; OS: Ventura 13.2.1
$ mpirun --version
mpirun (Open MPI) 4.1.5
$ mpicc --version
gcc-12 (Homebrew GCC 12.2.0) 12.2.0
$ gcc --version
gcc-12 (Homebrew GCC 12.2.0) 12.2.0

编译命令:(注意gcc-4.8.5没法编译mimalloc,因为缺少gcc-4.9才加上的stdatomic.h。Linux下我尝试了gcc-9.3.1,能成功编译)

### Compile mimalloc
cd ~/mimalloc-2.1.2
mkdir -p out/release
cd out/release
cmake ../.. -DCMAKE_C_COMPILER=$(which gcc)
make
### Compiler test program
$ mpic++ -o mimalloc-test mimalloc-test.cpp

运行命令:

mpirun -n 1 -x MIMALLOC_RESERVE_OS_MEMORY=1400MiB -x MIMALLOC_LIMIT_OS_ALLOC=1 -x MIMALLOC_VERBOSE=1 -x MIMALLOC_SHOW_STATS=1 -x DYLD_INSERT_LIBRARIES=$(realpath ~/mimalloc-2.1.2/out/release/libmimalloc.dylib) ./mimalloc-test 0.9GB 1MB 1.2MB 512.9MB 512MB 

运行参数解释:因为我喜欢用mpirun-x参数写环境变量,所以不用export ...的方式来设定mimalloc的配置。

  • -n 1:启动1个进程。我试过起更多的MPI进程也能正常运行,这里示例只用了1个进程。
  • -x MIMALLOC_RESERVE_OS_MEMORY=1400MiB:进程启动时,让mimalloc向系统预申请好1.4GB的内存
  • -x MIMALLOC_LIMIT_OS_ALLOC=1:不允许超限申请额外的内存
  • -x MIMALLOC_VERBOSE=1:在进程启动时,输出mimalloc的配置信息
  • -x MIMALLOC_SHOW_STATS=1:在进程结束时,输出mimalloc的运行信息
  • -x DYLD_INSERT_LIBRARIES :覆盖mallocnew动态库。我这里是MacOS系统,如果是Linux,应该用-x LD_PRELOAD=.../libmimalloc.so

MacOS下的输出分析

mimalloc: option 'show_errors': 0
mimalloc: option 'show_stats': 1
mimalloc: option 'verbose': 1
mimalloc: option 'eager_commit': 1
mimalloc: option 'arena_eager_commit': 2
mimalloc: option 'purge_decommits': 1
mimalloc: option 'allow_large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 1433600
mimalloc: option 'deprecated_segment_cache': 0
mimalloc: option 'deprecated_page_reset': 0
mimalloc: option 'abandoned_page_purge': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'purge_delay': 10
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 1
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'max_segment_reclaim': 8
mimalloc: option 'destroy_on_exit': 0
mimalloc: option 'arena_reserve': 1048576
mimalloc: option 'arena_purge_mult': 10
mimalloc: option 'purge_extend_delay': 1
Using std::vector
size is 966367641
size is 1048576
size is 1258291
size is 537814630
size is 536870912
heap stats:     peak       total       freed     current        unit       count   
  reserved:     1.4 GiB     2.7 GiB     1.4 GiB     1.3 GiB                          
 committed:     1.4 GiB     2.7 GiB     2.3 GiB   428.8 MiB                          
     reset:     0      
    purged:   979.1 MiB
   touched:   250.9 KiB   385.5 KiB     1.9 GiB    -1.9 GiB                          ok
  segments:     4           7           6           1                                not all freed!
-abandoned:     2           2           1           1                                not all freed!
   -cached:     0           0           0           0                                ok
     pages:     0           0          29         -29                                ok
-abandoned:    13          13          10           3                                not all freed!
 -extended:     0      
 -noretire:     0      
     mmaps:     0      
   commits:     0      
    resets:     0      
    purges:     4      
   threads:     2           2           2           0                                ok
  searches:     0.0 avg
numa nodes:     1
   elapsed:     5.407 s
   process: user: 5.277 s, system: 0.154 s, faults: 13, rss: 942.2 MiB, commit: 1.4 GiB
mimalloc: process done: 0x1f7038140

分析:

  • mimalloc: option 'reserve_os_memory': 1433600说明我们正常向系统预申请了内存
  • 看程序结束时的peak列,其中reserved了就是我们申请的1.4 GiB内存。不知道为啥total是2.7 GiB,但只要程序没崩,说明没超过我们限制的1.4 GiB。
  • rss是942.2 MiB,说明我们最高使用的那个0.9 GiB内存能正常反映出来了。

可见用mimalloc的rss输出能满足我们分析内存使用量的要求。

当然,把运行程序改成./mimalloc-test 1.5GB程序就会死掉,原因是内存超额了。报错是mimalloc: error: unable to allocate memory (1610612736 bytes),这只需要把-x MIMALLOC_LIMIT_OS_ALLOC=1参数去掉就可以了,然而我们的目标是限制内存使用,所以这个参数还是蛮好用的。

另外说一句,从2015年的OS X v10.6.8(El Capitan)开始,MacOS增加了System Integrity Protection,对/System, /usr, /bin, /sbin, /var目录下的程序都进行了保护,从而没法向ls, echo这些程序用DYLD_INSERT_LIBRARIES来注入了,所以要自己写一个程序来测试。

Linux下的输出分析

mpirun -n 1 -x MIMALLOC_RESERVE_OS_MEMORY=1400MiB -x MIMALLOC_LIMIT_OS_ALLOC=1 -x MIMALLOC_VERBOSE=1 -x MIMALLOC_SHOW_STATS=1 -x LD_PRELOAD=$(realpath ~/mimalloc-2.1.2/out/release/libmimalloc.so) ./mimalloc-test 0.9GB 1MB 1.2MB 512.9MB 512MB 

注意唯一的区别是把MacOS的DYLD_INSERT_LIBRARIES.dylib后缀换成了Linux的LD_PRELOAD.so后缀。

mimalloc: process init: 0xffffabf9c5b0
mimalloc: debug level : 2
mimalloc: secure level: 0
mimalloc: mem tracking: none
mimalloc: warning: unable to allocate aligned OS memory directly, fall back to over-allocation (size: 0x58000000 bytes, address: 0xffff53480000, alignment: 0x2000000, commit: 1)
mimalloc: reserved 1441792 KiB memory
mimalloc: using 4 numa regions
mimalloc: option 'show_errors': 1
mimalloc: option 'show_stats': 1
mimalloc: option 'verbose': 1
mimalloc: option 'eager_commit': 1
mimalloc: option 'arena_eager_commit': 2
mimalloc: option 'purge_decommits': 1
mimalloc: option 'allow_large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 1433600
mimalloc: option 'deprecated_segment_cache': 0
mimalloc: option 'deprecated_page_reset': 0
mimalloc: option 'abandoned_page_purge': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'purge_delay': 10
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 1
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'max_segment_reclaim': 8
mimalloc: option 'destroy_on_exit': 0
mimalloc: option 'arena_reserve': 1048576
mimalloc: option 'arena_purge_mult': 10
mimalloc: option 'purge_extend_delay': 1
Using std::vector
size is 966367641
size is 1048576
size is 1258291
size is 537814630
size is 536870912
heap stats:     peak       total       freed     current        unit       count   
normal   1:    17.2 KiB    22.7 KiB    22.5 KiB   200   B       8   B       2.8 K    not all freed
normal   4:    79.6 KiB   139.3 KiB   138.5 KiB   864   B      32   B       4.4 K    not all freed
normal   6:    73.4 KiB   174.6 KiB   172.7 KiB     1.9 KiB    48   B       3.7 K    not all freed
normal   8:    30.5 KiB   101.4 KiB   101.0 KiB   384   B      64   B       1.6 K    not all freed
normal   9:    41.0 KiB    52.9 KiB    51.2 KiB     1.6 KiB    80   B     675        not all freed
normal  10:    32.5 KiB    61.1 KiB    60.1 KiB     1.0 KiB    96   B     650        not all freed
normal  11:    17.1 KiB   406.9 KiB   406.3 KiB   560   B     112   B       3.7 K    not all freed
normal  12:     9.1 KiB    18.4 KiB    18.0 KiB   384   B     128   B     147        not all freed
normal  13:    20.5 KiB    30.7 KiB    29.3 KiB     1.4 KiB   160   B     196        not all freed
normal  14:   182.5 KiB   185.0 KiB   184.8 KiB   192   B     192   B     983        not all freed
normal  15:     8.1 KiB    16.0 KiB    14.7 KiB     1.3 KiB   224   B      73        not all freed
normal  16:     7.7 KiB    17.0 KiB    16.5 KiB   512   B     256   B      68        not all freed
normal  17:    56.7 KiB    94.1 KiB    92.8 KiB     1.2 KiB   320   B     300        not all freed
normal  18:    11.6 KiB    21.0 KiB    20.3 KiB   768   B     384   B      56        not all freed
normal  19:     7.9 KiB    16.2 KiB    15.3 KiB   896   B     448   B      37        not all freed
normal  20:    14.0 KiB    23.0 KiB    22.5 KiB   512   B     512   B      46        not all freed
normal  21:    39.5 KiB   165.6 KiB   160.6 KiB     5.0 KiB   640   B     264        not all freed
normal  22:    24.8 KiB    42.1 KiB    38.4 KiB     3.7 KiB   768   B      56        not all freed
normal  23:     3.5 KiB    23.7 KiB    23.7 KiB     0         896   B      27        ok
normal  24:    24.0 KiB    30.1 KiB    29.1 KiB     1.0 KiB     1.0 KiB    30        not all freed
normal  25:    96.6 KiB   151.8 KiB   134.2 KiB    17.5 KiB     1.2 KiB   121        not all freed
normal  26:    16.5 KiB    16.5 KiB    15.0 KiB     1.5 KiB     1.5 KiB    11        not all freed
normal  27:     1.7 KiB     3.5 KiB     3.5 KiB     0           1.7 KiB     2        ok
normal  28:     4.0 KiB     4.0 KiB     4.0 KiB     0           2.0 KiB     2        ok
normal  29:    12.5 KiB    20.0 KiB    20.0 KiB     0           2.5 KiB     8        ok
normal  30:    12.0 KiB    18.0 KiB    15.0 KiB     3.0 KiB     3.0 KiB     6        not all freed
normal  31:     3.5 KiB     3.5 KiB     3.5 KiB     0           3.5 KiB     1        ok
normal  32:     4.0 KiB     4.0 KiB     4.0 KiB     0           4.0 KiB     1        ok
normal  33:   145.5 KiB   150.5 KiB   145.5 KiB     5.0 KiB     5.0 KiB    30        not all freed
normal  34:     6.0 KiB     6.0 KiB     6.0 KiB     0           6.0 KiB     1        ok
normal  37:   190.7 KiB   210.8 KiB   200.7 KiB    10.0 KiB    10.0 KiB    21        not all freed
normal  40:    16.0 KiB    48.1 KiB    48.1 KiB     0          16.0 KiB     3        ok
normal  41:   261.0 KiB   361.4 KiB   361.4 KiB     0          20.0 KiB    18        ok
normal  43:    28.1 KiB    28.1 KiB    28.1 KiB     0          28.1 KiB     1        ok
normal  44:    32.1 KiB    32.1 KiB    32.1 KiB     0          32.1 KiB     1        ok
normal  45:   321.2 KiB   642.5 KiB   642.5 KiB     0          40.1 KiB    16        ok
normal  46:    48.1 KiB    48.1 KiB    48.1 KiB     0          48.1 KiB     1        ok
normal  48:    64.2 KiB    64.2 KiB    64.2 KiB     0          64.2 KiB     1        ok
normal  49:   722.8 KiB     1.8 MiB     1.7 MiB    80.3 KiB    80.3 KiB    24        not all freed

heap stats:     peak       total       freed     current        unit       count   
    normal:     2.4 Mi      5.1 Mi      4.9 Mi    139.6 Ki    264   B      20.2 K    not all freed!
     large:    15.5 Mi     22.4 Mi     22.4 Mi      0         435.2 KiB    53        ok
      huge:   924.0 Mi      1.9 Gi      1.9 Gi      0         652.0 MiB     3        ok
     total:   942.0 MiB     1.9 GiB     1.9 GiB   139.6 KiB                          not all freed
malloc req:   933.2 MiB     1.9 GiB     1.9 GiB   123.2 KiB                          not all freed

  reserved:     1.4 GiB     2.7 GiB     1.4 GiB     1.3 GiB                          
 committed:     1.4 GiB     3.3 GiB     2.9 GiB   390.4 MiB                          
     reset:     0      
    purged:     1.5 GiB
   touched:   941.6 MiB     1.9 GiB     1.9 GiB     1.4 MiB                          not all freed
  segments:     6          10           8           2                                not all freed!
-abandoned:     3           3           3           0                                ok
   -cached:     0           0           0           0                                ok
     pages:   108         132         103          29                                not all freed!
-abandoned:    15          15          15           0                                ok
 -extended:   326      
 -noretire:   118      
     mmaps:     6      
   commits:    45      
    resets:     0      
    purges:    12      
   threads:     3           4           4           0                                ok
  searches:     0.7 avg
numa nodes:     4
   elapsed:    12.239 s
   process: user: 11.961 s, system: 0.140 s, faults: 0, rss: 1.2 GiB, commit: 1.4 GiB
mimalloc: process done: 0xffffabf9c5b0

Linux的输出就丰富得多了,它展示了内部的bin(就是normal 1这些),每个bin对应一个unit大小的内存分配,比如8B32B48B…。如果bin的内存在程序结束时没有被释放,则写not all freed,正确释放则ok

heap stat中,展示了normal, large, huge, tota, malloc req几行。其中,normal就是列出来的这些比较小的分配,large对应的unit是435.2 KiB,而huge对应的unit是652.0 MiB。我们第一个array要的内存比较大(900MB),所以huge那行是924.0 Mi,这是合理的。malloc req则写了933.2 MiB,这也是合理的。

但是,Linux下的rss是1.2 GiB,这似乎有点不对。所以,在Linux下应该看malloc req作为申请了多少内存的标准。

写在最后

本文参照了mimalloc的issue #730的意见,使用了MIMALLOC_RESERVE_OS_MEMORYMIMALLOC_LIMIT_OS_ALLOC两个配置参数来达到内存使用量的分析目的。这个issue中也提到了使用Linux的ulimit来进行控制可能是更好的选择。但是,ulimit不是很好用(见这个answer),所以还是用了mimalloc来分析,它也很好地完成了任务。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值