TCMalloc : Thread-Caching Malloc (翻译)

Motivation

TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list。

TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.

Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

动机

glibc 2.3的malloc(或者ptmalloc2 一个单独的库程序)和其他的内存分配至今我测试过的,都没有tcmalloc快。ptmalloc在2.8GHz P4针对小块内存执行一次malloc/free  接近300纳秒。TCmalloc做同样的操作需要50纳秒。对于malloc的实现来说速度是非常重要的,因为如果malloc不够快的话,应用程序可能会自己去管理malloc的操作之后释放的一些空闲的链表。这可能会导致更加复杂,如果应用程序的编写者写的不够小心的话,会消耗更多内存,会从那些空闲链表归还一些内存给操作系统。

TCmalloc 在多线程架构的程序中,减少了锁的粒度。对于小内存对象,几乎是不需要无锁操作。对于大对象,TCmalloc尝试使用更细粒度、高效的自旋锁。ptmalloc2通过线程的私有内存减少锁的粒度,但是这里面存在一个大问题。ptmalloc2的内存不能从一个区域移动到其他区域,不能共享。这会导致大量的空间浪费。例如,谷歌的某个应用程序,第一步可能会分配接近300M来创建它的数据结构。当第一步完成之后,第二步可以从同一个地址空间继续分配。如果第二步是从一个不同的区域开始分配内存,那么这一步就不能复用第一步剩余内存空间,在地址空间中再次分配300M内存。类似的内存暴增的问题在其他应用程序中也出现过。

使用TCMalloc的另一个好处就是管理小对象需要空间很小,节约内存。例如,例如分配N个8字节的对象总共需要接近8N *1.01字节,接近1%额外空间进行管理。每个对象ptmalloc2使用的是4字节头,(我认为)8的倍数的字节对齐,结果是16N的字节数。

 

 

Usage

To use TCmalloc, just link tcmalloc into your application via the "-ltcmalloc" linker flag.

You can use tcmalloc in applications you didn't compile yourself, by using LD_PRELOAD:

   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" 

LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage.

TCMalloc includes a heap checker and heap profiler as well.

If you'd rather link in a version of TCMalloc that does not include the heap profiler and checker (perhaps to reduce binary size for a static binary), you can link in libtcmalloc_minimal instead.

使用

应用程序通过链接"-ltcmalloc"就可以使用TCmalloc的功能。

如果不想重新编译应用程序,可以使用LD_PRELOAD进行加载:

      $ LD_PRELOAD="/usr/lib/libtcmalloc.so"

LD_PRELOAD 很诡异,我们不推荐这种使用方式。

TCMalloc包含堆检测和堆统计功能。

如果你只是想单纯使用tcmalloc作为内存分配器不需要堆检测和统计功能,(或者想减小静态库的二进制大小),你链接libtcmalloc_minimal 就可以了。

Overview

TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.

TCMalloc treates objects with size <= 32K ("small" objects) differently from larger objects. Large objects are allocated directly from the central heap using a page-level allocator (a page is a 4K aligned region of memory). I.e., a large object is always page-aligned and occupies an integral number of pages.

A run of pages can be carved up into a sequence of small objects, each equally sized. For example a run of one page (4K) can be carved up into 32 objects of size 128 bytes each.

概览

TCMalloc 给每个线程分配一个局部的缓存。小块内存的分配从线程的局部缓存获取内存。如果需要内存对象会从中央数据结构缓存移动到线程的局部缓存,垃圾回收器会定期将线程局部缓存迁移到中央数据结构缓存中。

TCMalloc通过大小区分内存对象,小于等于32K的被称作小对象,否则称作大对象。大对象直接从中央堆中使用页级分配器中分配(一个页4K 对齐方式4K)。例如一个大对象总是使用页对齐,占用一连串的页内存。

一连串的页可以被切割成很多个小对象,这些对象的大小相同。例如一页内存(4k)可以切分成32个128字节的对象。

Small Object Allocation

Each small object size maps to one of approximately 170 allocatable size-classes. For example, all allocations in the range 961 to 1024 bytes are rounded up to 1024. The size-classes are spaced so that small sizes are separated by 8 bytes, larger sizes by 16 bytes, even larger sizes by 32 bytes, and so forth. The maximal spacing (for sizes >= ~2K) is 256 bytes.

A thread cache contains a singly linked list of free objects per size-class.

When allocating a small object: (1) We map its size to the corresponding size-class. (2) Look in the corresponding free list in the thread cache for the current thread. (3) If the free list is not empty, we remove the first object from the list and return it. When following this fast path, TCMalloc acquires no locks at all. This helps speed-up allocation significantly because a lock/unlock pair takes approximately 100 nanoseconds on a 2.8 GHz Xeon.

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

If the central free list is also empty: (1) We allocate a run of pages from the central page allocator. (2) Split the run into a set of objects of this size-class. (3) Place the new objects on the central free list. (4) As before, move some of these objects to the thread-local free list.

小对象分配

每个小对象的字节数被映射到一个接近170个分配好的size-classes。例如,所有分配在961-1024字节的对象都对齐到1024字节,意思就是分配1024字节的对象给应用程序。这个size-classes定义小字节数使用8字节进行间隔或者递增,大字节数使用16字节间隔,再大一点的32字节间隔,以此类推。最大的间隔是256字节(大于等于2k)。

每个线程的局部缓存对每类大小的内存使用一个单向链表管理所有释放了的对象。

当分配一个小对象:

(1)我们通过要分配的内存字节数映射找到对应的对应的size-class。

(2)在当前线程中找到对应的自由链表(释放了的对象)。

(3)如果自由链表不为空,我们将这个链表中的第一个元素返回给应用程序。

如果分配的时候走的是上面的步骤,TCMalloc的实现是无锁的。这对加速分配是非常重要的,因为一次锁、解锁大约需要100纳秒在2.8GHz Xeon上。

如果自由链表是空的:

(1)我们从中央缓冲区的自由链表中的对应的size-class中取出一堆这类大小的对象。

(2)将这一堆对象放在该线程的局部缓存中的自由链表中。

(3)返回其中的一个对象返回给应用程序。

如果中央缓冲中的自由链表也是空的:

(1)我们从中央也分配器中分配一系列页内存。

(2)切分这些页放在对应的size-class的集合中。

(3)将这些新切分的对象放在中央缓存的自由链表中。

(4)和之前一样,移动一部分对象到当前线程的局部缓存的自由链表中。

 

Large Object Allocation

A large object size (> 32K) is rounded up to a page size (4K) and is handled by a central page heap. The central page heap is again an array of free lists. For i < 256, the kth entry is a free list of runs that consist of k pages. The 256th entry is a free list of runs that have length >= 256 pages:

An allocation for k pages is satisfied by looking in the kth free list. If that free list is empty, we look in the next free list, and so forth. Eventually, we look in the last free list if necessary. If that fails, we fetch memory from the system (using sbrk, mmap, or by mapping in portions of /dev/mem).

If an allocation for k pages is satisfied by a run of pages of length > k, the remainder of the run is re-inserted back into the appropriate free list in the page heap.

大对象分配

一个大对象的字节数(大于32k)按页(4k)对齐在中央页堆。中央页堆也是一个自由链表的数组。对于i<256,第k项是一个包好k页的自由链表。但是第256项的自由链表的元素的内存是大于等于256页的字节数。

分配k页内存优先从第k个自由链表查找。如果这个自由链表为空,我会从下一个自由链表中找,如果也是空,继续找下一个。如果需要,我们会遍历到最后一个自由链表。如果失败了,我们直接向系统申请内存(使用sbrk、mmap、或者映射到 /dev/mem)。

如果在申请分配k个页时候实际分配的内存页大于k,剩余的页会被放回对应的页堆的自由链表中。

 

Spans

The heap managed by TCMalloc consists of a set of pages. A run of contiguous pages is represented by a Span object. A span can either be allocated, or free. If free, the span is one of the entries in a page heap linked-list. If allocated, it is either a large object that has been handed off to the application, or a run of pages that have been split up into a sequence of small objects. If split into small objects, the size-class of the objects is recorded in the span.

A 32-bit address space can fit 2^20 4K pages, so this central array takes 4MB of space, which seems acceptable. On 64-bit machines, we use a 3-level radix tree instead of an array to map from a page number to the corresponding span pointer.

A central array indexed by page number can be used to find the span to which a page belongs. For example, span a below occupies 2 pages, span b occupies 1 page, span c occupies 5 pages and span d occupies 3 pages.

 

跨度

TCMalloc使用一组页管理对堆进行管理。使用 一个跨度来代表一系列连续的页。跨度可以分配、释放。如果释放,这个跨度被回收到页堆的某个对应的链表中。如果分配,它是从一个大对象中交出来给到应用程序使用,或者将这几个连续的页切分成一系列晓得对象。如果被切分成小对象,这些对象的size-class被记录在跨度中。

一个32位的地址空间可以分出2的20次方个4K大小的页,所以需要使用4MB的空间的中央数组来管理这些页(一个指针4字节),这个看起来是可以接受的。在64位的系统中,我们使用一个3层的基数树来管理这些对应的跨度的指针。

一个中央数组使用页的编号来索引,被用于找到这个跨度包含哪些页。比如,跨度a包含2个页,跨度b包含一个页,跨度c包含5个页、跨度d包含3个页。

 

Deallocation

When an object is deallocated, we compute its page number and look it up in the central array to find the corresponding span object. The span tells us whether or not the object is small, and its size-class if it is small. If the object is small, we insert it into the appropriate free list in the current thread's thread cache. If the thread cache now exceeds a predetermined size (2MB by default), we run a garbage collector that moves unused objects from the thread cache into central free lists.

If the object is large, the span tells us the range of pages covered by the object. Suppose this range is [p,q]. We also lookup the spans for pages p-1 and q+1. If either of these neighboring spans are free, we coalesce them with the [p,q] span. The resulting span is inserted into the appropriate free list in the page heap.

 

释放

当一个对象被释放,我们计算它的页的编号,通过这个编号去找到它在中央数组中对应的跨度对象。这跨度会告诉我们它的size-classs,不论这个对象是不是小对象。如果是小对象,我们将它插入到当前线程的局部缓存对应的自由链表中。如果这个线程的缓存已经超过阈值(默认2M),我们会运行一个垃圾收集器去将那些不使用的对象从当前线程移动到中央自由链表。

如果这个对象是大对象,这个跨度告诉我们这个对象包含那几个连续的页(如果有多个页)。假设范围是[p,q]。我们也会查看p-1到q+1的那些跨度。如果这些相邻的跨度都已释放,我们将这些跨度合并。最终这个跨度会被回收到页堆的对应的自由链表中。

 

Central Free Lists for Small Objects

As mentioned before, we keep a central free list for each size-class. Each central free list is organized as a two-level data structure: a set of spans, and a linked list of free objects per span.

An object is allocated from a central free list by removing the first entry from the linked list of some span. (If all spans have empty linked lists, a suitably sized span is first allocated from the central page heap.)

An object is returned to a central free list by adding it to the linked list of its containing span. If the linked list length now equals the total number of small objects in the span, this span is now completely free and is returned to the page heap.

 

小对象的中央自由列表

前面我们提到,对于每种大小类型的内存,我们维护一个中央自由链表。每个中央自由链表使用两级数据结构管理:一组跨度(集合),每个跨度维护一个自由链表来管理释放的对象。

 

一个队形的分配是来自于一个跨度的自由链表中的第一个元素。(如果所有的跨度的自由链表都没有元素,需要从中央页堆分配一个大小合适的跨度)。

 

一个对象被回收到中央自由链表的过程是找到包含它的跨度,把这个对象放到这个跨度的自由链表中。如果和这个链表的长度现在刚好等于这个跨度的小对象的总数,这个跨度现在已经完全被释放了,将这个跨度返回给页堆。

 

Garbage Collection of Thread Caches

A thread cache is garbage collected when the combined size of all objects in the cache exceeds 2MB. The garbage collection threshold is automatically decreased as the number of threads increases so that we don't waste an inordinate amount of memory in a program with lots of threads.

We walk over all free lists in the cache and move some number of objects from the free list to the corresponding central list.

The number of objects to be moved from a free list is determined using a per-list low-water-mark LL records the minimum length of the list since the last garbage collection. Note that we could have shortened the list by L objects at the last garbage collection without requiring any extra accesses to the central list. We use this past history as a predictor of future accesses and move L/2 objects from the thread cache free list to the corresponding central free list. This algorithm has the nice property that if a thread stops using a particular size, all objects of that size will quickly move from the thread cache to the central free list where they can be used by other threads.

 

线程缓存的垃圾回收机制

一个线程的缓存是在所有的分配对象的大小超过2M的时候触发垃圾回收。当进程的线程数在增加的时候,这个垃圾回收的阈值是自动调整(减小)的,这样当线程数很多的时候就不会浪费太多内存。

我们遍历缓存中所有的自由链表,将一定数量的对象从自由链表移动到对应的中央自由链表中。

使用一个叫做per-list low-water-mark (每个链表的低水位值)L来决定要移动的对象的数量。L是由上一次回收时从该自由链表中移动的最小长度。注意 我们会根据上一次回收的对象个数来缩小这条链表,不需要额外的向中央自由链表获取信息。我们根据以往的历史数据来预测未来,将L/2个对象从线程局部缓存移动到对应的中央自由链表中。这个算法有很好的预测特性,如果一个线程不使用固定的大小,所有的对象将会很快从线程局部缓存移动到中央自由链表,它们就可以被其他线程使用。

Caveats

TCMalloc may be somewhat more memory hungry than other mallocs, (but tends not to have the huge blowups that can happen with other mallocs). In particular, at startup TCMalloc allocates approximately 6 MB of memory. It would be easy to roll a specialized version that trades-off a little bit of speed for more space efficiency.

TCMalloc currently does not return any memory to the system.

Don't try to load TCMalloc into a running binary (e.g., using JNI in Java programs). The binary will have allocated some objects using the system malloc, and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able to handle such objects.

注意事项

TCMalloc可能会更吃内存针对一些其他的内存分配器,(但是不会出现像其他的一些分配器一样狂吃内存,内存暴增)。尤其是,在启动的过程TCMalloc会预先分配6M的内存。写一个以时间换空间的版本是很容易的事。

TCMalloc当前版本不会归还已经分配的内存给操作系统。

不要尝试在一个已经运行的二进制程序中加载TCMalloc(例如在Java程序中使用JNI)。这个二进制程序会将那些使用malloc分配的内存,尝试使用TCMalloc进行回收。TCMalloc将不能正确处理这些内存。

 

 

Performance Notes

Here is a log of some of the performance improvements seen by switching to tcmalloc:

 

性能记录

下面是tcmalloc的一些性能提升的日志。

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值