TCMalloc : Thread-Caching Malloc(中文翻译)

最新推荐文章于 2024-04-28 17:54:49 发布

dd18709200301

最新推荐文章于 2024-04-28 17:54:49 发布

阅读量500

点赞数

分类专栏： go 文章标签：内存管理 golang malloc

原文链接：http://goog-perftools.sourceforge.net/doc/tcmalloc.html

版权

go 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

TCMalloc : Thread-Caching Malloc

Sanjay Ghemawat, Paul Menage <opensource@google.com>

TCMalloc是为Google创建的并发环境优化的内存分配器。它是GO内存管理的理论基础，文中很多概念和GO内存管理的概念是一致的，思路也是类似的。了解TCMalloc原理有助于更好的了解GO的内存管理机制。该篇原文是一篇英文介绍，我把它翻译成中文，方便阅读。

TCMalloc : Thread-Caching Malloc

Motivation（前言）

Usage（用法）

Overview（概述）

Small Object Allocation（小对象(<=32K)分配）

Large Object Allocation（大对象>32K分配）

Spans

Deallocation（对象析构）

Central Free Lists for Small Objects（中央空闲队列）

Garbage Collection of Thread Caches（垃圾回收）

Performance Notes（性能报告）

PTMalloc2 unittest

Caveats（注意事项）

Motivation（前言）

TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list.

TCmalloc比我测试过的 glibc 2.3 malloc（ptmalloc2独立库）以及其他所有的malloc都要快。ptmalloc2在2.8 GHz P4上分配释放一个小对象大约需要300ns，而TCmalloc对于同样的操作只需要大约50ns。速度对于分配器的实现非常重要，如果它不足够快，应用程序开发者则更倾向于在其之上实现自定义的空闲链表。除非应用程序开发者非常谨慎的适当调整可用链表大小并从空闲队列清除空闲对象，否则将可能带来额外的复杂性以及更多的内存使用。

TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.

TCmalloc也减少了多线程之间的锁竞争问题。对于小对象而言，几乎不存在锁竞争问题。对于大对象而言，TCmalloc尝试使用细粒度的，高效的自旋锁。ptmalloc2也通过单线程arena来减少锁竞争问题，但这存在一个非常大的问题。在ptmalloc2中内存不能从一个arena移动到另一个arena中，这将导致大量的内存浪费。举个例子：在一款google应用程序中，在第一阶段将为其分配大约300MB的用户内存，当第一阶段完成时，第二阶段将在相同地址空间中开始，如果第二阶段和第一阶段分配了不同的arena，那将不能使用第一阶段剩余的任何内存，并将向地址空间添加另外的300MB。类似问题在其他应用中也存在，我们称之为内存膨胀。

Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

TCMalloc的另一大好处是对于小对象分配的高效空间利用。比如：你可以分配N个8byte的对象而只使用8N*1.01的内存空间。仅需要在其头部加上百分之一的空间开销。ptmalloc2为每个对象使用一个四字节的标头，并且（我认为）将大小四舍五入为8个字节的倍数，最后使用16N个字节结束。

Usage（用法）

To use TCmalloc, just link tcmalloc into your application via the "-ltcmalloc" linker flag.

使用TCmalloc只需要通过-ltcmalloc标示将tmmalloc运行时库链接到你的应用成程序中。

You can use tcmalloc in applications you didn't compile yourself, by using LD_PRELOAD:

你可以应用tcmalloc于你的应用程序中而不是需要自己编译，只需要通过LD_PRELOAD加入动态链接库：

   $ LD_PRELOAD="/usr/lib/libtcmalloc.so"

LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage.

LD_PRELOAD链接方式不太友好，如无必要不推荐使用该方式。

TCMalloc includes a heap checker and heap profiler as well.If you'd rather link in a version of TCMalloc that does not include the heap profiler and checker (perhaps to reduce binary size for a static binary), you can link in libtcmalloc_minimal instead.

TCmalloc还包含一个堆检测器和堆分析器，如果你不需要这两个功能，比如减少静态二进制文件的大小，你可以链接libtcmalloc_minimal。

Overview（概述）

TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.

TCMalloc给每个线程都分配一个线程本地缓存，基本满足所有小对象分配需求。对象将根据需要从central空闲链表管理中心申请内存到本地缓存。并且通过定期垃圾回收机制回收本地内存归central。

TCMalloc treates objects with size <= 32K ("small" objects) differently from larger objects. Large objects are allocated directly from the central heap using a page-level allocator (a page is a 4K aligned region of memory). I.e., a large object is always page-aligned and occupies an integral number of pages.

TCMalloc对于<=32K的对象不同于大对象。大对象的分配是直接通过central heap，页级别的分配器（页是内存的4K对齐内存区域）分配。大对象总是页对齐的，并且占据整数页。

A run of pages can be carved up into a sequence of small objects, each equally sized. For example a run of one page (4K) can be carved up into 32 objects of size 128 bytes each.

可以将一系列的页面划分为大小相等的小对象列表。比如将一个4K的页面划分为32个大小均为128字节的对象。

Small Object Allocation（小对象(<=32K)分配）

Each small object size maps to one of approximately 170 allocatable size-classes. For example, all allocations in the range 961 to 1024 bytes are rounded up to 1024. The size-classes are spaced so that small sizes are separated by 8 bytes, larger sizes by 16 bytes, even larger sizes by 32 bytes, and so forth. The maximal spacing (for sizes >= ~2K) is 256 bytes.

每个小对象可以映射到大约170个不同对象大小类等级中。比如对于大小在961～1024之间的对象都可以四舍五入到1024字节。大小尺寸不同的对象被分隔成不同的类以便于小对象按照8字节分隔，大对象按16字节分隔，更大则按照32字节分隔，大小约>=2k的对象最大间距为256字节。

A thread cache contains a singly linked list of free objects per size-class.

一个线程缓存包含一个指针数组，每个指针都关联着一个大小类的空闲对象链表。

When allocating a small object: (1) We map its size to the corresponding size-class. (2) Look in the corresponding free list in the thread cache for the current thread. (3) If the free list is not empty, we remove the first object from the list and return it. When following this fast path, TCMalloc acquires no locks at all. This helps speed-up allocation significantly because a lock/unlock pair takes approximately 100 nanoseconds on a 2.8 GHz Xeon.

分配一个小对象的步骤：

将该对象的大小映射到一个相应的大小类；
查找当前线程的缓存池相应的空闲链表；
如果空闲链表非空，则删除链表首个对象并返回它；

通过这种方式，TCMalloc将不需要锁。这将大大有助于加速分配。一次锁定/解锁操作在2.8 GHz Xeon上大约需要100ns。

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

如果空闲链表为空：

从central中申请一组该对象大小等级对应的空闲对象；
将他们放入到当前线程缓存对应的空闲链表中；
返回其中之一给应用程序；

If the central free list is also empty: (1) We allocate a run of pages from the central page allocator. (2) Split the run into a set of objects of this size-class. (3) Place the new objects on the central free list. (4) As before, move some of these objects to the thread-local free list.

如果central空闲列表也为空：

从中央页分配器（central page allocator）申请页内存；
划分申请到的页为不同大小等级的空闲对象链表；
放置新的对象到central空闲链表中；
重复之前的步骤，将一部分空闲对象分配给线程缓存，并返回其中一个给应用程序；

Large Object Allocation（大对象>32K分配）

A large object size (> 32K) is rounded up to a page size (4K) and is handled by a central page heap. The central page heap is again an array of free lists. For i < 256, the kth entry is a free list of runs that consist of k pages. The 256th entry is a free list of runs that have length >= 256 pages:

大对象（>32K）总是向整页大小对齐，一个页面的大小为4K，然后由中央页堆（central page heap）处理。中央页堆同样是一个管理不同页规格空闲链表的数组。对于下标 i<256 的数组条目，第k个元素链接的是每个页面为K页的页面链表。而到第256个元素链接的是单个页大小>=256页的页面列表。

An allocation for k pages is satisfied by looking in the kth free list. If that free list is empty, we look in the next free list, and so forth. Eventually, we look in the last free list if necessary. If that fails, we fetch memory from the system (using sbrk, mmap, or by mapping in portions of /dev/mem).

申请一个k页大小的对象，通过查询对应的第k个空闲链表就可以满足，如果该列表为空，则依次查询下一个列表。最终，我们也会查找最后一个空闲链表如果有必要的话。如果都失败了，则向操作系统申请内存满足该次内存申请。

If an allocation for k pages is satisfied by a run of pages of length > k, the remainder of the run is re-inserted back into the appropriate free list in the page heap.

如果通过分配一个大约K页的内存大小满足了本次kye内存申请，则会将多出来的部分重新插入页面堆对应大小的空闲链表中。

Spans

The heap managed by TCMalloc consists of a set of pages. A run of contiguous pages is represented by a Span object. A span can either be allocated, or free. If free, the span is one of the entries in a page heap linked-list. If allocated, it is either a large object that has been handed off to the application, or a run of pages that have been split up into a sequence of small objects. If split into small objects, the size-class of the objects is recorded in the span.

堆内存被TCmalloc的一组页面管理着，一组连续的页面表示一个span对象。一个span也可以被申请或释放。如果该span是空闲的，该span就是该页面堆链表中的条目之一；如果已分配，它即可以是已移交给应用程序的大对象，也可以是已分成多个小对象列表的一组页面。如果被划分为小对象，该对象等级则会记录在该span中。

A central array indexed by page number can be used to find the span to which a page belongs. For example, span a below occupies 2 pages, span b occupies 1 page, span c occupies 5 pages and span d occupies 3 pages.

由页码索引的中央数组可用于查询该span包含哪些页面。比如，一个span a占据了2个页面，span b占据了1个页面，span c占据了5个页面，span d占据了3个页面。

A 32-bit address space can fit 2^20 4K pages, so this central array takes 4MB of space, which seems acceptable. On 64-bit machines, we use a 3-level radix tree instead of an array to map from a page number to the corresponding span pointer.

一个32位的地址空间可以表示2^20个4K的页面。所以这个中央数组需要利用4MB的空间，这个看起来是可接受的。在64位计算机上，我们使用3级树取代数组将页码映射到相应的span指针上。

Deallocation（对象析构）

When an object is deallocated, we compute its page number and look it up in the central array to find the corresponding span object. The span tells us whether or not the object is small, and its size-class if it is small. If the object is small, we insert it into the appropriate free list in the current thread's thread cache. If the thread cache now exceeds a predetermined size (2MB by default), we run a garbage collector that moves unused objects from the thread cache into central free lists.

当析构一个对象的时候，我们需要计算他的页码并通过中央数组去查找它对应的span对象。该span会告诉我们该对象是否是个小对象，如果是，它的大小等级是多少。如果一个对象是小对象，我们将它插入到当前线程的缓存中的合适的空闲队列中。如果线程缓存超过预定大小（默认2MB），则将运行垃圾回收器将未使用的对象回收，归还到中央空闲队列中。

If the object is large, the span tells us the range of pages covered by the object. Suppose this range is [p,q]. We also lookup the spans for pages p-1 and q+1. If either of these neighboring spans are free, we coalesce them with the [p,q] span. The resulting span is inserted into the appropriate free list in the page heap.

如果该对象是个大对象，该span将告诉我们该对象覆盖的页面的范围。假设页面范围是[p. q]，我们则需要查找在p-1~q+1范围的span，如果这些相邻span中的任意一个是空闲的，我们将它们与[p，q]跨度合并。将最终span插入到页面堆中的相应空闲列表中。

Central Free Lists for Small Objects（中央空闲队列）

As mentioned before, we keep a central free list for each size-class. Each central free list is organized as a two-level data structure: a set of spans, and a linked list of free objects per span.

如上面所提到的，我们维护了为每种对象大小等级维护一个中央空闲链表。每个链表都由一个2级的数据结构表示：一个span的集合，以及每个span对应的空闲对象链表。

An object is allocated from a central free list by removing the first entry from the linked list of some span. (If all spans have empty linked lists, a suitably sized span is first allocated from the central page heap.)

通过删除中央空闲链表的头部对象可申请一个对象。（如果所有的spans都是空链表，则首先从中央页面堆申请分配适当大小的span）。

An object is returned to a central free list by adding it to the linked list of its containing span. If the linked list length now equals the total number of small objects in the span, this span is now completely free and is returned to the page heap.

通过将对象添加到其包含范围的链接列表中，可以将其返回到中央空闲列表。如果链接列表的长度现在等于范围中小对象的总数，则该范围现在是完全空闲的，并返回到页面堆。

Garbage Collection of Thread Caches（垃圾回收）

A thread cache is garbage collected when the combined size of all objects in the cache exceeds 2MB. The garbage collection threshold is automatically decreased as the number of threads increases so that we don't waste an inordinate amount of memory in a program with lots of threads.

当一个线程空闲缓存空间总和超过2MB是，则会由垃圾回收器回收。随着线程数量的增加，垃圾回收阈值会自动降低，这样我们就不会在一个具有很多线程的程序中浪费过多的内存。

We walk over all free lists in the cache and move some number of objects from the free list to the corresponding central list.

我们遍历线程缓存中的所有空闲队列，并回收其中一些对象到相应的中央空闲链表中。

The number of objects to be moved from a free list is determined using a per-list low-water-mark L. L records the minimum length of the list since the last garbage collection. Note that we could have shortened the list by L objects at the last garbage collection without requiring any extra accesses to the central list. We use this past history as a predictor of future accesses and move L/2 objects from the thread cache free list to the corresponding central free list. This algorithm has the nice property that if a thread stops using a particular size, all objects of that size will quickly move from the thread cache to the central free list where they can be used by other threads.

使用每个列表的低水位标记L确定要从空闲列表中移动的对象的数量。L记录自最近一次垃圾回收以来列表的最小长度。请注意，我们可以在上一次垃圾回收时将列表缩短L个对象，而无需对中央列表进行任何额外的访问。我们将过去的历史用作将来访问的预测器，并将L / 2对象从线程缓存可用列表移至相应的中央可用列表。该算法具有很好的属性，即如果线程停止使用特定大小，则该大小的所有对象将迅速从线程缓存移到中央空闲列表，其他线程可以在该空闲列表中使用它们。

Performance Notes（性能报告）

PTMalloc2 unittest

The PTMalloc2 package (now part of glibc) contains a unittest program t-test1.c. This forks a number of threads and performs a series of allocations and deallocations in each thread; the threads do not communicate other than by synchronization in the memory allocator.

t-test1 (included in google-perftools/tests/tcmalloc, and compiled as ptmalloc_unittest1) was run with a varying numbers of threads (1-20) and maximum allocation sizes (64 bytes - 32Kbytes). These tests were run on a 2.4GHz dual Xeon system with hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with one million operations per thread in each test. In each case, the test was run once normally, and once with LD_PRELOAD=libtcmalloc.so.

The graphs below show the performance of TCMalloc vs PTMalloc2 for several different metrics. Firstly, total operations (millions) per elapsed second vs max allocation size, for varying numbers of threads. The raw data used to generate these graphs (the output of the "time" utility) is available in t-test1.times.txt.

TCMalloc is much more consistently scalable than PTMalloc2 - for all thread counts >1 it achieves ~7-9 million ops/sec for small allocations, falling to ~2 million ops/sec for larger allocations. The single-thread case is an obvious outlier, since it is only able to keep a single processor busy and hence can achieve fewer ops/sec. PTMalloc2 has a much higher variance on operations/sec - peaking somewhere around 4 million ops/sec for small allocations and falling to <1 million ops/sec for larger allocations.
TCMalloc is faster than PTMalloc2 in the vast majority of cases, and particularly for small allocations. Contention between threads is less of a problem in TCMalloc.
TCMalloc's performance drops off as the allocation size increases. This is because the per-thread cache is garbage-collected when it hits a threshold (defaulting to 2MB). With larger allocation sizes, fewer objects can be stored in the cache before it is garbage-collected.
There is a noticeably drop in the TCMalloc performance at ~32K maximum allocation size; at larger sizes performance drops less quickly. This is due to the 32K maximum size of objects in the per-thread caches; for objects larger than this tcmalloc allocates from the central page heap.

Next, operations (millions) per second of CPU time vs number of threads, for max allocation size 64 bytes - 128 Kbytes

Here we see again that TCMalloc is both more consistent and more efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc typically achieves ~2-2.5 million ops per second of CPU time with a large number of threads, whereas PTMalloc achieves generally 0.5-1 million ops per second of CPU time, with a lot of cases achieving much less than this figure. Above 32K max allocation size, TCMalloc drops to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU time is being burned spinning waiting for locks in the heavily multi-threaded case).

Caveats（注意事项）

For some systems, TCMalloc may not work correctly on with applications that aren't linked against libpthread.so (or the equivalent on your OS). It should work on Linux using glibc 2.3, but other OS/libc combinations have not been tested.

对于某些系统，TCMalloc可能无法在未与libpthread.so（或OS上的等效文件）链接的应用程序上正常工作。它应该可以在使用glibc 2.3的Linux上运行，但是尚未测试其他OS / libc组合。

TCMalloc may be somewhat more memory hungry than other mallocs, though it tends not to have the huge blowups that can happen with other mallocs. In particular, at startup TCMalloc allocates approximately 6 MB of memory. It would be easy to roll a specialized version that trades a little bit of speed for more space efficiency.

TCMalloc可能比其他malloc占用更多的内存，尽管它往往不具有其他malloc可能发生的巨大崩溃。特别是，在启动时，TCMalloc分配大约6 MB的内存。推出一个专门的版本，以牺牲一点速度来提高空间效率，这将很容易。

TCMalloc currently does not return any memory to the system.

TCMalloc当前不向系统返回任何内存。

Don't try to load TCMalloc into a running binary (e.g., using JNI in Java programs). The binary will have allocated some objects using the system malloc, and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able to handle such objects.

不要尝试将TCMalloc加载到正在运行的二进制文件中（例如，在Java程序中使用JNI）。二进制文件将使用系统malloc分配一些对象，并可能尝试将它们传递给TCMalloc进行释放。 TCMalloc将无法处理此类对象。

dd18709200301

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
TCMalloc : Thread-Caching Malloc(中文翻译)

TCMalloc : Thread-Caching Malloc（线程缓存Malloc）Sanjay Ghemawat, Paul Menage <[email protected]>Motivation（前言）TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested.
复制链接

扫一扫