Tcmalloc内存分管理代码分析

最新推荐文章于 2024-03-24 16:47:47 发布

fpcc

最新推荐文章于 2024-03-24 16:47:47 发布

阅读量628

点赞数

分类专栏： C++ 文章标签： c++ GC

本文链接：https://blog.csdn.net/fpcc/article/details/126885197

版权

C++ 专栏收录该内容

237 篇文章 35 订阅

订阅专栏

一、介绍

基础的算法流程和数据结构在前面已经分析完成了，实在是没有想到这块儿的内容如此之多，大意了。有了算法流程和源码中对内存的分配回收管理的源码，基本的套路就明白了。但是实际上，在分配和管理的过程不是简单的像开发过程中直接一个new或者一个delete事情就完成了。特别是在前面提到的线程内部缓冲和全局缓冲以及Span和PageHeap的关系，都需要有一套流程管理，或者说在某些地方可能要使用一些类似于算法的处理。
内存的管理，其实除了底层的和上层的控制，这块尤为关键，下面就为分析一下这些代码。一些涉及到链表等的底层原始的代码，就不再分析了，这个是数据结构的范畴，如果有兴趣，可以自己分析一下这里的实现和数据结构的相关书籍中的有什么不同。

二、TCMallocGuard的应用

tcmalloc.cc中定义了TCMallocGuard这个类的静态变量，它的目的在于在main函数调用前完成初始化一系列的相关数据和流程逻辑：

// The constructor allocates an object to ensure that initialization
// runs before main(), and therefore we do not have a chance to become
// multi-threaded before initialization.  We also create the TSD key
// here.  Presumably by the time this constructor runs, glibc is in
// good enough shape to handle pthread_key_create().
//
// The destructor prints stats when the program exits.
class TCMallocGuard {
 public:
  TCMallocGuard() {
    TCMallocInternalFree(TCMallocInternalMalloc(1));
    ThreadCache::InitTSD();
    TCMallocInternalFree(TCMallocInternalMalloc(1));
  }
};

static TCMallocGuard module_enter_exit_hook;

此处的构造函数有点意思，在申请和释放一字节的中间夹了一个线程特定数据（Thread Specific Data）的申请，这个函数在前面分析过就是创建一个KEY。肯定又是有啥特殊原因，注释说的也不是特别清楚。看注释猜测应该是为了保证顺序创建的安全性。在前面提到过，管理内存的基本单位为Span或者说Object。但是从OS角度来看，管理内存的基本单元就是Page。从一般的经验来看，Page越小，也即粒度越小，管理精度越高但管理成本越高，或者说分配和回收的时间成本都很高；反之Page越大，也即粒度越大，管理精度越粗但管理成本降低，体现出来就是分配和回收的速度快了。这个好理解，管一个人和管一百个人当然不一样。

三、ThreadCache流程控制

不知道还记不记得前面分析的函数CreateCacheIfNecessary，这个函数会在启动后第一次使用时调用一次，用来分配内存。它被下面的函数调用，一看名字就明白是啥了，代码如下：

template <typename Policy, typename CapacityPtr>
inline void* ABSL_ATTRIBUTE_ALWAYS_INLINE AllocSmall(Policy policy,
                                                     size_t size_class,
                                                     size_t size,
                                                     CapacityPtr capacity) {
  ASSERT(size_class != 0);
  void* result;

  if (UsePerCpuCache()) {
    result = tc_globals.cpu_cache().Allocate<Policy::handle_oom>(size_class);
  } else {
    result = ThreadCache::GetCache()->Allocate<Policy::handle_oom>(size_class);
  }

  if (!Policy::can_return_nullptr()) {
    ASSUME(result != nullptr);
  }

  if (ABSL_PREDICT_FALSE(result == nullptr)) {
    SetCapacity(0, capacity);
    return nullptr;
  }
  size_t weight;
  if (ABSL_PREDICT_FALSE(weight = ShouldSampleAllocation(size))) {
    return SampleifyAllocation(policy, size, weight, size_class, result,
                               nullptr, capacity);
  }
  SetClassCapacity(size_class, capacity);
  return result;
}
inline ThreadCache* ABSL_ATTRIBUTE_ALWAYS_INLINE
ThreadCache::GetCacheIfPresent() {
#ifdef ABSL_HAVE_TLS
  // __thread is faster
  return thread_local_data_;
#else
  return tsd_inited_
             ? reinterpret_cast<ThreadCache*>(pthread_getspecific(heap_key_))
             : nullptr;
#endif
}

inline ThreadCache* ThreadCache::GetCache() {
  ThreadCache* tc = GetCacheIfPresent();
  return (ABSL_PREDICT_TRUE(tc != nullptr)) ? tc : CreateCacheIfNecessary();
}

同样，销毁ThreadCache在前面也分析了，就是那个DestroyThreadCache函数。
这里面有一个问题，就是ThreadCache的大小控制和链表的管理动作的时机。在Tcmalloc中，使用慢启动算法来控制FreeList的容量。那么什么是慢启动（slow start）算法呢？其实就是控制FreeList长度的一种方式，而其长度其实就是内存缓冲大小的控制。那么FreeList的长度有什么好控制的呢？
首先，此链表太长，就意味着空间的浪费，一共只需要十个Object（Span），你分出来一千个；太短，则意味着需要不断的去向CentralFreeList去申请，这就需要锁，浪费时间。其次，这个长度对内存的回收也是一种相同的痛苦。
而在tcmalloc中，慢启动就是通过使用频率来判断是否加长FreeList，如果反复使用非常频繁，这个长度就会变大。而如果这个链表主要是回收内存，也就是说经常从FreeList中Remove对象，则FreeList的最大长度就会被限制到一个指定的数量上。来看一下相关的代码：

void* ThreadCache::FetchFromCentralCache(size_t size_class, size_t byte_size) {
  FreeList* list = &list_[size_class];
  ASSERT(list->empty());
  const int batch_size = tc_globals.sizemap().num_objects_to_move(size_class);

  const int num_to_move = std::min<int>(list->max_length(), batch_size);
  void* batch[kMaxObjectsToMove];
  int fetch_count =
      tc_globals.transfer_cache().RemoveRange(size_class, batch, num_to_move);
  if (fetch_count == 0) {
    return nullptr;
  }

  if (--fetch_count > 0) {
    size_ += byte_size * fetch_count;
    list->PushBatch(fetch_count, batch + 1);
  }

  // Increase max length slowly up to batch_size.  After that,
  // increase by batch_size in one shot so that the length is a
  // multiple of batch_size.
  if (list->max_length() < batch_size) {
    list->set_max_length(list->max_length() + 1);
  } else {
    // Don't let the list get too long.  In 32 bit builds, the length
    // is represented by a 16 bit int, so we need to watch out for
    // integer overflow.
    int new_length = std::min<int>(list->max_length() + batch_size,
                                   kMaxDynamicFreeListLength);
    // The list's max_length must always be a multiple of batch_size,
    // and kMaxDynamicFreeListLength is not necessarily a multiple
    // of batch_size.
    new_length -= new_length % batch_size;
    ASSERT(new_length % batch_size == 0);
    list->set_max_length(new_length);
  }
  return batch[0];
}

看到上面的>max_length()这个函数没有，在看看上面的英文注释“先将长度增加到batch_size，然后再倍增”。batch_size会因为class size的大小不同（64KB大小），会有不同。只要max_length没有超过batch_size，只要调用此函数去CentralCache获取新对象其值就会加1，如果
max_length没有达到batch_size，是每次FetchFromCentralCache都增加batch_size，直到达到上限kMaxDynamicFreeListLength（8192），这个上限的目的是为了防止从CentralFreeList竞争获取对象时锁的时间太长。回收也是如此：

void ThreadCache::DeallocateSlow(void* ptr, FreeList* list, size_t size_class) {
  if (ABSL_PREDICT_FALSE(list->length() > list->max_length())) {
    ListTooLong(list, size_class);
  }
  if (size_ >= max_size_) {
    Scavenge();
  }
}

void ThreadCache::ListTooLong(FreeList* list, size_t size_class) {
  const int batch_size = tc_globals.sizemap().num_objects_to_move(size_class);
  ReleaseToCentralCache(list, size_class, batch_size);

  // If the list is too long, we need to transfer some number of
  // objects to the central cache.  Ideally, we would transfer
  // num_objects_to_move, so the code below tries to make max_length
  // converge on num_objects_to_move.

  if (list->max_length() < batch_size) {
    // Slow start the max_length so we don't overreserve.
    list->set_max_length(list->max_length() + 1);
  } else if (list->max_length() > batch_size) {
    // If we consistently go over max_length, shrink max_length.  If we don't
    // shrink it, some amount of memory will always stay in this freelist.
    list->set_length_overages(list->length_overages() + 1);
    if (list->length_overages() > kMaxOverages) {
      ASSERT(list->max_length() > batch_size);
      list->set_max_length(list->max_length() - batch_size);
      list->set_length_overages(0);
    }
  }
}

前面提到过，和分配类似，未达到batch_size每次会max_length() + 1，但一旦到达，并且超过kMaxOverages（3次）后，就一次减少batch_size并将kMaxOverages重置为0。再看一下内存垃圾回收：

void ThreadCache::DeallocateSlow(void* ptr, FreeList* list, size_t size_class) {
  if (ABSL_PREDICT_FALSE(list->length() > list->max_length())) {
    ListTooLong(list, size_class);
  }
  if (size_ >= max_size_) {
    Scavenge();
  }
}
// Release idle memory to the central cache
void ThreadCache::Scavenge() {
  // If the low-water mark for the free list is L, it means we would
  // not have had to allocate anything from the central cache even if
  // we had reduced the free list size by L.  We aim to get closer to
  // that situation by dropping L/2 nodes from the free list.  This
  // may not release much memory, but if so we will call scavenge again
  // pretty soon and the low-water marks will be high on that call.
  for (int size_class = 0; size_class < kNumClasses; size_class++) {
    FreeList* list = &list_[size_class];
    const int lowmark = list->lowwatermark();
    if (lowmark > 0) {
      const int drop = (lowmark > 1) ? lowmark / 2 : 1;
      ReleaseToCentralCache(list, size_class, drop);

      // Shrink the max length if it isn't used.  Only shrink down to
      // batch_size -- if the thread was active enough to get the max_length
      // above batch_size, it will likely be that active again.  If
      // max_length shinks below batch_size, the thread will have to
      // go through the slow-start behavior again.  The slow-start is useful
      // mainly for threads that stay relatively idle for their entire
      // lifetime.
      const int batch_size =
          tc_globals.sizemap().num_objects_to_move(size_class);
      if (list->max_length() > batch_size) {
        list->set_max_length(
            std::max<int>(list->max_length() - batch_size, batch_size));
      }
    }
    list->clear_lowwatermark();
  }

  IncreaseCacheLimit();
}

判断FreeList超长（>max_size），启动回收，也就是从ThreadCache把别到CentralCache中去。也就是两缓存间的FreeList互相增减的过程。英文注释说得很清楚“ If the low-water mark for the free list is L, it means we would not have had to allocate anything from the central cache even if we had reduced the free list size by L. We aim to get closer to that situation by dropping L/2 nodes from the free list. This may not release much memory, but if so we will call scavenge again pretty soon and the low-water marks will be high on that call”
在前面提到过，在ThreadCache中管理的内存对象是Object，而在CentralCache中管理的是Span，Span里有Object的链表。所以虽然同样是FreeList，还是要区分清楚。而且在CentralCache的中又为分了空和非空两种，优先在空的链表中查找，如果空的没有了，才会在非空中查找Object。

四、PageHeap流程控制

tcmalloc对管理的动态内存整体进行了一个抽象即PageHeap，当CentralCache中的内存不够时，就会从此处来获取。PageHeap有两种应对策略，128个page以内的span，每个相同page的Span都用一个链表来缓存，超过128个page的span，存储于一个有序set。也就是说，如果一个span的大小<128个Page，那就是一个小Span，否则就是大Span。
在前面看过了GrowHeap，这里看一CentralCahce返回给PageHeap的代码：

template <class Forwarder>
inline Span* CentralFreeList<Forwarder>::ReleaseToSpans(void* object,
                                                        Span* span,
                                                        size_t object_size) {
  if (ABSL_PREDICT_FALSE(span->FreelistEmpty(object_size))) {
#ifdef TCMALLOC_SMALL_BUT_SLOW
    nonempty_.prepend(span);
#else
    const uint8_t index = GetFirstNonEmptyIndex();
    nonempty_.Add(span, index);
    span->set_nonempty_index(index);
#endif
  }

#ifdef TCMALLOC_SMALL_BUT_SLOW
  // We maintain a single nonempty list for small-but-slow. Also, we do not
  // collect histogram stats due to performance issues.
  if (ABSL_PREDICT_TRUE(span->FreelistPush(object, object_size))) {
    return nullptr;
  }
  nonempty_.remove(span);
  return span;
#else
  const uint8_t prev_index = span->nonempty_index();
  const uint8_t prev_bitwidth = absl::bit_width(span->Allocated());
  if (ABSL_PREDICT_FALSE(!span->FreelistPush(object, object_size))) {
    // Update the histogram as the span is full and will be removed from the
    // nonempty_ list.
    RecordSpanUtil(prev_bitwidth, /*increase=*/false);
    nonempty_.Remove(span, prev_index);
    return span;
  }
  // As the objects are being added to the span, its utilization might change.
  // We remove the stale utilization from the histogram and add the new
  // utilization to the histogram after we release objects to the span.
  const uint8_t cur_bitwidth = absl::bit_width(span->Allocated());
  if (cur_bitwidth != prev_bitwidth) {
    RecordSpanUtil(prev_bitwidth, /*increase=*/false);
    RecordSpanUtil(cur_bitwidth, /*increase=*/true);
    // If span allocation changes so that it moved to a different nonempty_
    // list, we remove it from the previous list and add it to the desired
    // list indexed by cur_index.
    const uint8_t cur_index = IndexFor(cur_bitwidth);
    if (cur_index != prev_index) {
      nonempty_.Remove(span, prev_index);
      nonempty_.Add(span, cur_index);
      span->set_nonempty_index(cur_index);
    }
  }
  return nullptr;
#endif
}
static void ReturnSpansToPageHeap(MemoryTag tag, absl::Span<Span*> free_spans,
                                  size_t objects_per_span)
    ABSL_LOCKS_EXCLUDED(pageheap_lock) {
  absl::base_internal::SpinLockHolder h(&pageheap_lock);
  for (Span* const free_span : free_spans) {
    ASSERT(tag == GetMemoryTag(free_span->start_address()));
    tc_globals.page_allocator().Delete(free_span, objects_per_span, tag);
  }
}

tcmalloc使用radix tree数据结构(支持二层和三层的radix tree,编译选项指定),默认使用的二层radix tree。radix tree是一个基数树，用于搜索，看到这个名字基本也就明白了它是干啥的，就是为了快速定位Span的位置，保证对Span和Page的快速管理，这也和前面的PageMap和PageID统一在了一起。tcmalloc默认只会对第一层对象进行分配（2KB），第二层在用到时才分配，目的也很简单，减少本身对内存的占用。

五、span流程控制

span的管理和应用其实是整个tcmalloc的全流程的应用，不管是分配ThreadCache还是在CentralCache还是在PageHeap中，都要对Span进行管理，可能复杂的是有些需要对其进行拆分，有的要进行合并。合并和拆分的时机就是算法决定的。在前面分析可以知道，1~N个Page可以形成一个Span（其实就是128个为界限）。它有三种状态：IN_USE、ON_NORMAL_FREELIST、ON_RETURNED_FREELIST，看到后会不会很熟悉的感觉，链表中就有这些名字，用来区别不同状态下的span和object。
span中包括了object的链表，object与class size有关。在tcmalloc中会把相同Page和Object和相邻的class合并成一个。反过来，也可以拆分成实际需要的情况，这个在前面反复提到过。下面看一下代码：

void PageHeap::MergeIntoFreeList(Span* span) {
  ASSERT(span->location() != Span::IN_USE);
  span->set_freelist_added_time(absl::base_internal::CycleClock::Now());

  // Coalesce -- we guarantee that "p" != 0, so no bounds checking
  // necessary.  We do not bother resetting the stale pagemap
  // entries for the pieces we are merging together because we only
  // care about the pagemap entries for the boundaries.
  //
  // Note that only similar spans are merged together.  For example,
  // we do not coalesce "returned" spans with "normal" spans.
  const PageId p = span->first_page();
  const Length n = span->num_pages();
  Span* prev = pagemap_->GetDescriptor(p - Length(1));
  if (prev != nullptr && prev->location() == span->location()) {
    // Merge preceding span into this span
    ASSERT(prev->last_page() + Length(1) == p);
    const Length len = prev->num_pages();
    span->AverageFreelistAddedTime(prev);
    RemoveFromFreeList(prev);
    Span::Delete(prev);
    span->set_first_page(span->first_page() - len);
    span->set_num_pages(span->num_pages() + len);
    pagemap_->Set(span->first_page(), span);
  }
  Span* next = pagemap_->GetDescriptor(p + n);
  if (next != nullptr && next->location() == span->location()) {
    // Merge next span into this span
    ASSERT(next->first_page() == p + n);
    const Length len = next->num_pages();
    span->AverageFreelistAddedTime(next);
    RemoveFromFreeList(next);
    Span::Delete(next);
    span->set_num_pages(span->num_pages() + len);
    pagemap_->Set(span->last_page(), span);
  }

  PrependToFreeList(span);
}

好多的细节还是没有全部说清楚，包括一些HUGE对象都没有看，但基本上就是这个套路了。可以多看着源码多考虑，不懂可以去网上看看官方文档或者别人的分析。

六、总结

通过分析这一系列的GC和内存管理的源码，是不是觉得，其实编程就是最基础的数据结构+算法，有没有这种更深的体悟？有思想，有行动，有觉悟，如此往复轮回，才能把一个框架或者一套源码吃透。不要想着把所有的源码都吃透，你即使有这个能力，也没这个时间，正如《三国志》中描述诸葛亮“独观其大略”，这里的大略，是指在看透了一两个相关的知识后对相似的知识只抓重点，剖析不同点即可。
学习的目的不是学习，是汲取，是获得，并最终“悟”。“六根清净方为道，退步原来是向前”。中国人讲究术与道，其实就是实践和理论的问题。而计算机恰恰又是一个术与道高度结合的学科，晴耕雨读，不懈努力，手脑并重，未来方可有期。