jemalloc 深入分析之Jemalloc 内存分配释放过程

最新推荐文章于 2024-07-02 05:15:00 发布

EversChen5

最新推荐文章于 2024-07-02 05:15:00 发布

阅读量3.7k

点赞数

本文链接：https://blog.csdn.net/ip5108/article/details/86751122

版权

为了更好的阅读效果，推荐下载pdf文档：
详细文章请参考：《jemalloc 深入分析》
https://github.com/everschen/tools/blob/master/DOC/Jemalloc.pdf
https://download.csdn.net/download/ip5108/10941278

Jemalloc 内存分配释放过程

7.1. 从arena中分配small size内存的过程
small region size区间为： 8 <= size <= 14336 = SMALL_MAXCLASS。
函数arena_malloc中，
#define SMALL_MAXCLASS ((((size_t)1) << 13) + (((size_t)3) << 11))
SMALL_MAXCLASS=14k=14336
对于small allocation，size <= SMALL_MAXCLASS=14k=14336，需要确定请求大小对应到哪一bin上，确定的公式如下： size2index(usize)，该公式通过查数组或计算来确定bin的index。

7.1.1. tcache中的分配过程
当size区间在size <= SMALL_MAXCLASS=14k=14336时，如果也enable了tcache，使用tcache_alloc_small进行分配。tcache_alloc_small调用tcache_alloc_easy进行快速分配，
tbin = &tcache->tbins[binind];
*tcache_success = true;
ret = *(tbin->avail - tbin->ncached);
tbin->ncached–;
如上面代码所示，直接取得栈顶元素返回。如果tcache_bin用完，则调用arena_tcache_fill_small从arena分配分配内存填充tcache_bin，然后再通过tcache_alloc_easy进行分配。arena_tcache_fill_small的内存填充和非tcache的分配方式一样。

7.1.2. 从bin的runcur中的分配过程
如果不满足tcache分配条件，则用arena_malloc_hard再调用arena_malloc_small进行分配。在arena_malloc_small中，如果当前bin存在runcur，直接调用arena_run_reg_alloc进行分配，
arena_run_reg_alloc(arena_run_t *run, arena_bin_info_t *bin_info)
{
void *ret;
size_t regind;
arena_chunk_map_misc_t *miscelm;
void *rpages;
assert(run->nfree > 0);
assert(!bitmap_full(run->bitmap, &bin_info->bitmap_info));
regind = (unsigned)bitmap_sfu(run->bitmap, &bin_info->bitmap_info);
miscelm = arena_run_to_miscelm(run);
rpages = arena_miscelm_to_rpages(miscelm);
ret = (void )((uintptr_t)rpages + (uintptr_t)bin_info->reg0_offset +
(uintptr_t)(bin_info->reg_interval * regind));
run->nfree–;
return (ret);
}
其中bitmap_sfu()返回bitmap中第一个1的位置，并且将该位置0。
接下来就是通过run找对应的miscelm，再通过miscelm找到run对应的page，它的起始位置rpages。run 找对应的miscelm，只需要减去run在miscelm的偏移就可以了。
size_t pageind = ((uintptr_t)miscelm - ((uintptr_t)chunk +
map_misc_offset)) / sizeof(arena_chunk_map_misc_t) + map_bias;
通过miscelm的地址减去map_misc的开始地址（chunk + map_misc_offset）得到该page的偏移地址，然后除以sizeof(arena_chunk_map_misc_t)就得到当前存储page的序号，再加上map_bias，就得到该page在当前chunk中的绝对page号码，然后根据这个绝对page号很容易得到该page的首地址rpages。
rpages的值+regindreg_interval（同reg_size），就能得出这个空闲的region的实际地址了。
最后再将run的nfree减一，整个内存申请过程就结束了。

7.1.3. 从bin的runs，arena->runs_avail[i],new chunk的分配过程
如果runcur已分配完成，则是arena_bin_malloc_hard的分配过程。选择run的顺序是bin的runs，arena->runs_avail[i],new chunk。
如果当前bin runcur已满，则调用arena_bin_malloc_hard选择一个run作为runcur，再调用arena_run_reg_alloc进行分配。

选择run的过程，从 bin->runs中选择地址最低的 run。函数为arena_bin_nonfull_run_tryget。arena_bin_nonfull_run_tryget通过ph配对堆找到地址最小的run，比较方式是先按run所在chunk的en_sn比较，如果两个run在同一个chunk，则按run的地址比较。
若 bin->runs为空，则从 arena->runs_avail 中找空间，分配新的 run。函数路径为arena_run_alloc_small_helper/ arena_run_first_best_fit。
pind2sz_tab包含了从4096开始的每一个4096的倍数的region size都有保存。
index
size= pind2sz(index)
psz2ind(size-4096+1)
0
4096
0
1
8192
1
2
12288
2
3
16384
3
4
20480
4
5
24576
5
6
28672
6
7
32768
7
8
40960
8
9
49152
9
10
57344
10
…
…
…
run_quantize_ceil 对于small runsize 和large的size来说，这个函数的处理结果输入和输出都是相同的。
psz2ind找到这个run size所在的pind，然后从arena->runs_avail[pind]开始找run。
如果找到的run 的size大于需要的值，需要用arena_run_split_small把后面多余的pages分开，重新插入回arena->runs_avail，这次插入回的pind是根据剩余的pages大小找到对应的pind。所以arena->runs_avail[]是随时都根据run的size有序排列的。
如果arena->runs_avail 也空间不够了，只好重新弄个 chunk，分出所需空间，剩余部分放入 arena->runs_avail 。chunk的分配过程参考chunk分配过程。

7.1.4. 从arena bin的分配debug过程
arena是jemalloc的总的管理块，一个进程中可以有多个arena，arena的最大个可以通过静态变量narenas_auto。
(gdb) p je_narenas_auto
$52 = 2
可通过静态数组arenas获取进程中所有arena的指针：
(gdb) p je_arenas@3
$53 = {0x7ce5c00140, 0x7ce5c8fc00, 0x0}
/ bins is used to store trees of free regions. */
(gdb) p (je_arenas[0])->bins[5]
$30 = {lock = {lock = {__private = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}},
witness = {name = 0x0, rank = 0, comp = 0x0, link = {qre_next = 0x0, qre_prev = 0x0}}}, runcur = 0x7dc4606788, runs = {ph_root = 0x0},
stats = {nmalloc = 552, ndalloc = 223, nrequests = 841, curregs = 329,
nfills = 138, nflushes = 47, nruns = 2, reruns = 6, curruns = 2}}
可以通过变量arena_bin_info获取对应bin的其他信息：
(gdb) p je_arena_bin_info[5]
$32 = {reg_size = 80, redzone_size = 0, reg_interval = 80, run_size = 20480, nregs = 256, bitmap_info = {nbits = 256, ngroups = 4}, reg0_offset = 0}
/

Current run being used to service allocations of this bin’s size
class.
*/
(gdb) p /x *(*je_arenas[0])->bins[5].runcur
$31 = {binind = 0x5, nfree = 0xb7, bitmap = {0x8300001000000000,
0xffffffffffffb000, 0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0x0,
0x0, 0x0}}
其中，nfree表示的是当前run中空闲的region个数。ngroups=4表示bitmap里有4个bitmap是有效的。

7.2. small size内存的释放过程

7.2.1. 释放内存到tcache的过程
如果tcache打开，并且tbin->ncached < tbin_info->ncached_max，也就是tcache没有满时，直接保存在tcache中。
tbin->ncached++;
*(tbin->avail - tbin->ncached) = ptr;

7.2.2. 如果tcache已满，释放一半tcache回各自的run
如果cache的数量达到了ncached_max值，需要进行flush回收。释放从栈底开始的一半的tcache。在释放过程中，每次总是挑选栈底region所在的arena的region先进行释放。如果非当前arena的region，则先保存在栈底，在下一个循环中释放。
当原来全用完的 run，现在有一个 region 可分配了，将其插入所属 bin 中，供分配。如果该run比runcur小时，需要调整runcur，要不直接插入bin->runs就行。

7.2.3. 如果run已满，释放回arena->runs_avail[i]
如果bin_info->nregs！=1时，把run从bin->runs中移除。然后再释放run。
如果bin_info->nregs ==1时，为什么不移除呢？因为如果run只有一个run时，永远不会插入非满runs树中。
/*

The following block’s conditional is necessary because if the
run only contains one region, then it never gets inserted
into the non-full runs tree.
*/
释放run时，首先尝试合并前后未分配run，合并后再重新插入arena->runs_avail[]。

7.2.4. 如果chunk已满，释放chunk 回arena->spare
释放chunk的过程，首先从arena->runs_avail[]移除这个chunk的run，然后把chunk从arena->achunks中删除，再把这个chunk保存在arena->spare中。

7.2.5. 如果arena->spare已保存前面释放的chunk，则释放spare的chunk
首先进行chunk和node的去注册，也就是从chunks_rtree删除它们的索引关系。
arena->chunks_szsnad_cached， arena->chunks_ad_cached，arena_maybe_purge

7.3. 从arena中分配large size内存的过程
Large region size区间为：LARGE_MINCLASS= 16384 = 4pages <= size <= 448pages = 1835008 = large_maxclass。
Large的分配run和map_misc只作为地址使用，arena->runs_avail[]里保存的是map_misc，而不像small，真的有arena_run_s，因为这个只是给small region用的。

7.3.1. tcache分配过程
当size区间在LARGE_MINCLASS=16384<=size <= tcache_maxclass=65536时，如果enable了tcache，使用tcache_alloc_large进行分配。
首先从 tcache 中满足。函数为arena_malloc/tcache_alloc_large，两点说明：
tcache 只涵盖一部分的 large allocation 请求（size <= tcache_maxclass=65536）
对应的 tcache_bin 为空时，不进行填充，而是走非 tcache 分配。这点与 small allocation 的情形是不同的，因为large的内存比较大，如果申请了不使用有点浪费。对于large，只有在释放large内存的时候会被缓存到tcache中。

7.3.2. 非tcache的分配过程
如果没有enable tcache，或者1835008>=size>65536则用arena_malloc_hard/arena_malloc_large进行分配。所以从这里的大小可以看出，large的最大空间是小于一个chunk的最大可用空间2043904=499pages=499*4K。
large_maxclass = index2size(size2index(chunksize)-1);

Cache oblivious 时的伪随机数生成
/*

Compute a uniformly distributed offset within the first page
that is a multiple of the cacheline size, e.g. [0 … 63) * 64
for 4 KiB pages and 64-byte cachelines.
/
r = prng_lg_range_zu(&arena->offset_state, LG_PAGE -
LG_CACHELINE, false);
random_offset = ((uintptr_t)r) << LG_CACHELINE;
这里r取的是高6位的伪随机数，取高位值是为了保证更长的循环周期，这是PRNG的实现算法。然后左移6位，所以random_offset的值是小于4K的一个伪随机值。
/
If defined, explicitly attempt to more uniformly distribute large allocation
pointer alignments across all cache indices.
*/
#define JEMALLOC_CACHE_OBLIVIOUS
In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a CPU cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter.
cache-oblivious算法是对于CPU cache没有明确参数的一种优化。因为这个offset的存在，所以在分配large的时候需要增加large_pad的内存分配，这个大小是一个page，4K。

调用arena_run_alloc_large从 arena->runs_avail 中找空间，分配新的 run。函数路径为arena_run_alloc_large_helper/ arena_run_first_best_fit。这个和small的分配方式类似，arena_run_first_best_fit函数是一样的，只是参数size大小不一样。
pind2sz_tab包含了从4096开始的每一个4096的倍数的region size都有保存。
index
size= pind2sz(index)
psz2ind(size-4096+1)
0
4096
0
1
8192
1
2
12288
2
3
16384
3
4
20480
4
5
24576
5
6
28672
6
7
32768
7
8
40960
8
9
49152
9
10
57344
10
…
…
…
run_quantize_ceil 对于small runsize 和large的size来说，这个函数的处理结果输入和输出都是相同的。
psz2ind找到这个run size所在的pind，然后从arena->runs_avail[pind]开始找run。
如果找到的run 的size大于需要的值，需要用arena_run_split_large把后面多余的pages分开，重新插入回arena->runs_avail，这次插入回的pind是根据剩余的pages大小找到对应的pind。所以arena->runs_avail[]是随时都根据run的size有序排列的。
如果arena->runs_avail 也空间不够了，只好重新分配一个 chunk，chunk的分配过程参考chunk分配过程。然后调用arena_run_split_large/ arena_run_split_large_helper分出所需空间，arena_run_split_remove先在arena->runs_avail中删除原来的run，然后写入剩余部分的第一页和最后一页的map_bits值，再调用arena_avail_insert把剩余部分放入 arena->runs_avail[]，然后再设置分出来分配的run的map_bits，因为大小减小了。

7.4. large size内存的释放过程
large 内存和small有点不同的是，small内存的一个run可能包含多个region，large内存的run都只包含一块内存。其实没有run的概念，run只是用来定位地址用。

7.4.1. 释放large内存到tcache的过程
如果tcache打开，并且tbin->ncached < tbin_info->ncached_max，也就是tcache没有满时，直接保存在tcache中。
tbin->ncached++;
*(tbin->avail - tbin->ncached) = ptr;

7.4.2. 如果tcache已满，释放一半tcache回arena->runs_avail[i]
如果cache的数量达到了ncached_max值，需要进行flush回收。释放从栈底开始的一半的tcache。在释放过程中，每次总是挑选栈底large内存所在的arena先进行释放。如果非当前arena的run，则先保存在栈底，在下一个循环中释放。
释放run时，首先尝试合并前后未分配run，合并后再重新插入arena->runs_avail[]。

7.4.3. 如果chunk已满，释放chunk 回arena->spare
释放chunk的过程，首先从arena->runs_avail[]移除这个chunk的run，然后把chunk从arena->achunks中删除，再把这个chunk保存在arena->spare中。

7.4.4. 如果arena->spare已保存前面释放的chunk，则释放spare的chunk
首先进行chunk和node的去注册，也就是从chunks_rtree删除它们的索引关系。
arena->chunks_szsnad_cached， arena->chunks_ad_cached，arena_maybe_purge

7.5. 从arena中分配huge size内存的过程
huge size区间为：HUGE_MAXCLASS > size > 448pages = 1835008 = large_maxclass。
#define HUGE_MAXCLASS ((((size_t)1) << 62) + (((size_t)3) << 60))
huge_malloc/arena_chunk_alloc_huge首先会尝试cache分配chunk地址，如果没有可复用的chunk地址则用arena_chunk_alloc_huge_hard分配新的地址。
chunk_alloc_cache/chunk_recycle在chunks_szsnad_cached和chunks_ad_cached中重复利用先前分配的chunk空间。
/*

Trees of chunks that were previously allocated (trees differ only in
node ordering). These are used when allocating chunks, in an attempt
to re-use address space. Depending on function, different tree
orderings are needed, which is why there are two trees with the same
contents.
*/
extent_tree_t chunks_szsnad_cached;
extent_tree_t chunks_ad_cached;
extent_tree_t chunks_szsnad_retained;
extent_tree_t chunks_ad_retained;
huge chunk的extent_node_s会保存在arena->huge中。
extent_node_s和chunk在一起的吗？不在一起，没有办法在一起，因为chunk是2M对齐的，实际huge的chunk大小都超过2M，大小2M取整，所以没有空间存放extent_node_s。所以这里通过register的方式，把它们两个值关联，方便以后通过chunk地址找到extent_node_s地址。实际通过radix tree查找。参考radix tree基数树。

7.6. huge size内存的释放过程
huge_dalloc(tsdn, ptr);
首先进行chunk和node的去注册，也就是从chunks_rtree删除它们的索引关系。然后从arena->huge删除node。
arena->chunks_szsnad_cached， arena->chunks_ad_cached，arena_maybe_purge

详细文章请参考：《jemalloc 深入分析》
https://github.com/everschen/tools/blob/master/DOC/Jemalloc.pdf
https://download.csdn.net/download/ip5108/10941278

EversChen5

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
jemalloc 深入分析之Jemalloc 内存分配释放过程

Jemalloc 内存分配释放过程7.1. 从arena中分配small size内存的过程small region size区间为： 8 &amp;amp;lt;= size &amp;amp;lt;= 14336 = SMALL_MAXCLASS。函数arena_malloc中，#define SMALL_MAXCLASS ((((size_t)1) &amp;amp;lt;&amp;amp;lt; 13) + (((size_t)3) ...
复制链接

扫一扫