TCMALLOC 源码阅读

最新推荐文章于 2024-09-13 22:13:02 发布

math715

最新推荐文章于 2024-09-13 22:13:02 发布

阅读量3.6k

点赞数 3

前言

最近一直在折腾内存管理，先是自己实现了一个非常简单的内存管理，但是和malloc对比测试之后发现效率相差甚多。偶然在网上发现TCMalloc，下载下来之后与malloc做了简单的对比测试，惊奇的发现tcmalloc的速度果然如它自己所述，比glibc的malloc快了很多倍。遂果断扔掉自己实现的那个简单的内存管理，改用tcmalloc。同时也对tcmalloc的实现感到非常的好奇，因此决定阅读研究其源码，一探究竟。

函数入口

我想大多数人想了解一个库的源代码肯怕第一件事就是急切寻找函数的入口，我也是。对于如此神奇的代码，我已经迫不及待想一窥它的究竟，没有耐心在网上搜索，在代码里寻找了，我选择了最简单有效的办法–调试。

Tcmalloc的函数入口在tcmalloc.cc中定义，malloc和free的入口函数代码如下：

// CAVEAT: The code structure below ensures that MallocHook methods are always  
//         called from the stack frame of the invoked allocation function.  
//         heap-checker.cc depends on this to start a stack trace from  
//         the call to the (de)allocation function.  
extern "C" PERFTOOLS_DLL_DECL void* tc_malloc(size_t size) __THROW {  
      void* result = do_malloc_or_cpp_alloc(size);  
      MallocHook::InvokeNewHook(result, size);  
      return result;  
}  
extern "C" PERFTOOLS_DLL_DECL void tc_free(void* ptr) __THROW {  
      MallocHook::InvokeDeleteHook(ptr);  
      do_free(ptr);  
}  
// The default "do_free" that uses the default callback.  
inline void do_free(void* ptr) {  
      return do_free_with_callback(ptr, &InvalidFree);  
}

调试到这，我大脑里出现了第一个问题：
1）测试程序只是简单的链接了下tcmalloc的库，调用malloc时咋就跑这来了？

我想大概是修改glibc的malloc函数入口地址。但是目前这个不是我关心的重点，目前我关注的重点是内存是怎么分配的，为何这么高效。

tc_malloc函数代码非常简单，只有两行。不难理解，第一行实现内存的分配。第二行是干嘛的呢？从注释里可以看出大概是用来跟踪调试的，这个同样不是我目前关心的重点。

继续看do_malloc_or_cpp_alloc代码，代码如下：

    // TODO(willchan): Investigate whether or not lining this much is harmful to  
    // performance.  
    // This is equivalent to do_malloc() except when tc_new_mode is set to true.  
    // Otherwise, it will run the std::new_handler if set.  
    inline void* do_malloc_or_cpp_alloc(size_t size) {  
      return tc_new_mode ? cpp_alloc(size, true) : do_malloc(size);  
    }

看来该函数也只是一个路由函数，根据tc_new_mode来选择不同的分配函数。这个tc_new_mode到底是啥东东，作何用处？

在tcmalloc文件中搜索tc_new_mode，发现它的设置函数：

    // This function behaves similarly to MSVC's _set_new_mode.  
    // If flag is 0 (default), calls to malloc will behave normally.  
    // If flag is 1, calls to malloc will behave like calls to new,  
    // and the std_new_handler will be invoked on failure.  
    // Returns the previous mode.  
    extern "C" PERFTOOLS_DLL_DECL int tc_set_new_mode(int flag) __THROW {  
      int old_mode = tc_new_mode;  
      tc_new_mode = flag;  
      return old_mode;  
    }

从注释中不难看出，原来通过设置该值来决定是使用普通的malloc模式还是使用C++里面new的模式，如果该值被设置为1，当内存分配失败时std_new_handler就会被调用，从而向应用程序抛出异常。

真正实现内存分配的函数就是do_malloc，代码如下：


    inline void* do_malloc(size_t size) {  
      void* ret = NULL;  
      // The following call forces module initialization  
      ThreadCache* heap = ThreadCache::GetCache();  
      if (size <= kMaxSize) {  
        size_t cl = Static::sizemap()->SizeClass(size);  
        size = Static::sizemap()->class_to_size(cl);  
        if ((FLAGS_tcmalloc_sample_parameter > 0) && heap->SampleAllocation(size)) {  
          ret = DoSampledAllocation(size);  
        } else {  
          // The common case, and also the simplest.  This just pops the  
          // size-appropriate freelist, after replenishing it if it's empty.  
          ret = CheckedMallocResult(heap->Allocate(size, cl));  
        }  
      } else {  
        ret = do_malloc_pages(heap, size);  
      }  
      if (ret == NULL) errno = ENOMEM;  
      return ret;  
    }

kMaxSize是一个常量值，大小为256*1024。这段代码的逻辑很简单，当要分配的内存不超过kMaxSize时且FLAGS_tcmalloc_sample_paramete >= 0的时候就从ThreadCache中分配。对于FLAGS_tcmalloc_sample_parameter<0的情况暂且不去研究。当要分配的内存大小超过kMaxSize时从页面分配。回顾一下tcmalloc的介绍文档，tcmalloc的内存池分为线程局部缓存内存池和中心共享缓存内存池。当分配的内存大小<32K时从局部缓存分配，超过32K的就去中心页堆进行分配。从代码中基本得到印证，但是文档中说的是32K，而代码里是256K，查看以前版本代码发现kMaxSize在以前版本是32K,在当前最新版本中改成了256K。

线程局部缓存ClassSize分析

TCMalloc小对象分配机制
首先我们回顾下TCMalloc文档的小对象分配机制。文档中说明TCMalloc给每个线程都保存一个缓存池，缓存池里有各种大小的内存对象。小内存分配过程如下：
1. 将要分配的大小映射到对应的对齐对象。
2. 在当前线程的局部缓存中查找该对齐对象链表。
3. 如果该链表不为空，删除链表第一个节点并返回给调用者。

问题
1. 小对象是如何划分的？
2. 对于任意一个小于kMaxSize的size是如何映射到某一种缓存对象上？

本文将通过分析源代码来弄清楚这两个问题。
SizeMap分析
在do_malloc函数中有如下两行代码：
size_t cl = Static::sizemap()->SizeClass(size);
size = Static::sizemap()->class_to_size(cl);
不难理解，这两行代码就是size映射到它最接近的缓存对象上。接下来继续探究SizeClass(size_t)和class_to_(size_t)两个函数,这两个函数在common.h文件中，代码如下：

    class SizeMap  
    {  
    private:  
        ... //其他暂时不关心的  
        //-------------------------------------------------------------------  
        // Mapping from size to size_class and vice versa  
        //-------------------------------------------------------------------  


        // Sizes <= 1024 have an alignment >= 8.  So for such sizes we have an  
        // array indexed by ceil(size/8).  Sizes > 1024 have an alignment >= 128.  
        // So for these larger sizes we have an array indexed by ceil(size/128).  
        //  
        // We flatten both logical arrays into one physical array and use  
        // arithmetic to compute an appropriate index.  The constants used by  
        // ClassIndex() were selected to make the flattening work.  
        //  
        // Examples:  
        //   Size       Expression                      Index  
        //   -------------------------------------------------------  
        //   0          (0 + 7) / 8                     0  
        //   1          (1 + 7) / 8                     1  
        //   ...  
        //   1024       (1024 + 7) / 8                  128  
        //   1025       (1025 + 127 + (120<<7)) / 128   129  
        //   ...  
        //   32768      (32768 + 127 + (120<<7)) / 128  376  
        static const int kMaxSmallSize = 1024;  
        static const size_t kClassArraySize =  
          ((kMaxSize + 127 + (120 << 7)) >> 7) + 1;  
        unsigned char class_array_[kClassArraySize];  


        // Compute index of the class_array[] entry for a given size  
        static inline int ClassIndex(int s) {  
            ASSERT(0 <= s);  
            ASSERT(s <= kMaxSize);  
            const bool big = (s > kMaxSmallSize);  
            const int add_amount = big ? (127 + (120<<7)) : 7;  
            const int shift_amount = big ? 7 : 3;  
            return (s + add_amount) >> shift_amount;  
        }  


        // Mapping from size class to max size storable in that class  
        size_t class_to_size_[kNumClasses];  
    public:  
        inline int SizeClass(int size) {  
        return class_array_[ClassIndex(size)];  
      }  


      ...//暂时不关心的  
      // Mapping from size class to max size storable in that class  
      inline size_t class_to_size(size_t cl) {  
        return class_to_size_[cl];  
      }  
    }

原来SizeClass只是返回class_array_数组中的某个元素，这元素的索引由ClassIndex函数计算。ClassIndex的计算逻辑也很简单，size<=1024的按8字节对齐，size>1024的按128字节对齐。这样对于[0,1,2,…,1024]就映射成了[0,1,…,128],对于[1025,1026,…,kMaxSize]就会映射成[9,10,…,2048]. 我们需要将这两数组按照原来size的顺序合并成一个数组。1024映射成了128，按理，1025应该映射成129. 为达到该目的，我们将后面的一个数组全部加上120，这样两个数组就可以合并成[0,1,…,128,129,…,2168]。
如此就不难理解当size<=1024时size=(size+7)/8,位运算表达式为：(size+7)>>3. 当size>1024时，size=(size+127+120*128)/128, 位运算表达式为:(size+127+(120<<7))>>7

ClassIndex(size_t)算是搞清楚了，但是从代码中可以看出ClassIndex计算出来的只是class_array_的索引值。class_array_里存储的是class_to_size_的索引，class_to_size_的大小为kNumClass，kNumClass的定义如下：

    #if defined(TCMALLOC_LARGE_PAGES)  
    static const size_t kPageShift  = 15;  
    static const size_t kNumClasses = 78;  
    #else  
    static const size_t kPageShift  = 13;  
    static const size_t kNumClasses = 86;  
    #endif

class_arrar_和class_to_size_的初始化代码如下：

    // Initialize the mapping arrays  
    void SizeMap::Init() {  
      // Do some sanity checking on add_amount[]/shift_amount[]/class_array[]  
      ...  
      // Compute the size classes we want to use  
      int sc = 1;   // Next size class to assign  
      int alignment = kAlignment;  
      CHECK_CONDITION(kAlignment <= 16);  
      for (size_t size = kAlignment; size <= kMaxSize; size += alignment) {  
        alignment = AlignmentForSize(size);  
        ...  
        if (sc > 1 && my_pages == class_to_pages_[sc-1]) {  
          // See if we can merge this into the previous class without  
          // increasing the fragmentation of the previous class.  
          const size_t my_objects = (my_pages << kPageShift) / size;  
          const size_t prev_objects = (class_to_pages_[sc-1] << kPageShift)  
                                      / class_to_size_[sc-1];  
          if (my_objects == prev_objects) {  
            // Adjust last class to include this size  
            class_to_size_[sc-1] = size;  
            continue;  
          }  
        }  


        // Add new class  
        class_to_pages_[sc] = my_pages;  
        class_to_size_[sc] = size;  
        sc++;  
      }  
      ...  
      // Initialize the mapping arrays  
      int next_size = 0;  
      for (int c = 1; c < kNumClasses; c++) {  
        const int max_size_in_class = class_to_size_[c];  
        for (int s = next_size; s <= max_size_in_class; s += kAlignment) {  
          class_array_[ClassIndex(s)] = c;  
        }  
        next_size = max_size_in_class + kAlignment;  
      }  


      // Double-check sizes just to be safe  
      ...  

      // Initialize the num_objects_to_move array.  
      ...  
    }

代码中还包含了其他的成员初始化，这些目前都不是我们关心的，为更清楚了解class_array_和class_to_size_是如何初始化的，我们删除不必要的代码。class_to_size_保存了每一类对齐内存对象的大小，各类大小的关系大致为：class_to_size_[n]=class_to_size_[n-1]+AlignmentForSize(class_to_size_[n-1])。代码中还有一些对齐类的大小是需要调整的，这里先不考虑。

AlignmentForSize函数的代码实现如下：

    static inline int LgFloor(size_t n) {  
      int log = 0;  
      for (int i = 4; i >= 0; --i) {  
        int shift = (1 << i);  
        size_t x = n >> shift;  
        if (x != 0) {  
          n = x;  
          log += shift;  
        }  
      }  
      ASSERT(n == 1);  
      return log;  
    }  
    int AlignmentForSize(size_t size) {  
      int alignment = kAlignment;  
      if (size > kMaxSize) {  
        // Cap alignment at kPageSize for large sizes.  
        alignment = kPageSize;  
      } else if (size >= 128) {  
        // Space wasted due to alignment is at most 1/8, i.e., 12.5%.  
        alignment = (1 << LgFloor(size)) / 8;  
      } else if (size >= 16) {  
        // We need an alignment of at least 16 bytes to satisfy  
        // requirements for some SSE types.  
        alignment = 16;  
      }  
      // Maximum alignment allowed is page size alignment.  
      if (alignment > kPageSize) {  
        alignment = kPageSize;  
      }  
      CHECK_CONDITION(size < 16 || alignment >= 16);  
      CHECK_CONDITION((alignment & (alignment - 1)) == 0);  
      return alignment;  
    }

先说下LgFloor函数，该函数的功能是返回size值的最高位1的位置。

例如，LgFloor(8)=3, LgFloor(9)=3, LgFloor(10)=3,…,LgFloor(16)=4,…
通过AlignmentForSize函数，我们不难得知class_to_size_的分类规则大致如下：
size在[16,128]之间按16字节对齐，size在[129,256*1024]之间按(2^(n+1)-2^n)/8对齐，n为7~18,
即[129,130,…,256*1024]会被映射为[128+16,128+2*16,…,128+8*16,256+32,256+2*32,…,256+8*32],超过256*1024以上的按页对齐。
因为ClassIndex计算出来的结果还是太密集了，因此需要通过class_array_来索引到其真正映射到的对齐类。

ThreadCache分析之线程局部缓存

线程局部缓存
tcmalloc采用线程局部存储技术为每一个线程创建一个ThreadCache，所有这些ThreadCache通过链表串起来。
线程局部缓存有两种实现：
1. 静态局部缓存，通过__thread关键字定义一个静态变量。
2. 动态局部缓存，通过pthread_key_create,pthread_setspecific,pthread_getspecific来实现。

静态局部缓存的优点是设置和读取的速度非常快，比动态方式快很多，但是也有它的缺点。

主要有如下两个缺点：
1. 静态缓存在线程结束时没有办法清除。
2. 不是所有的操作系统都支持。

ThreadCache局部缓存的实现
tcmalloc采用的是动态局部缓存，但同时检测系统是否支持静态方式，如果支持那么同时保存一份拷贝，方便快速读取。

   // If TLS is available, we also store a copy of the per-thread object  
   // in a __thread variable since __thread variables are faster to read  
   // than pthread_getspecific().  We still need pthread_setspecific()  
   // because __thread variables provide no way to run cleanup code when  
   // a thread is destroyed.  
   // We also give a hint to the compiler to use the "initial exec" TLS  
   // model.  This is faster than the default TLS model, at the cost that  
   // you cannot dlopen this library.  (To see the difference, look at  
   // the CPU use of __tls_get_addr with and without this attribute.)  
   // Since we don't really use dlopen in google code -- and using dlopen  
   // on a malloc replacement is asking for trouble in any case -- that's  
   // a good tradeoff for us.  
    #ifdef HAVE_TLS  
      static __thread ThreadCache* threadlocal_heap_  
    # ifdef HAVE___ATTRIBUTE__  
       __attribute__ ((tls_model ("initial-exec")))  
    # endif  
       ;  
    #endif  


      // Thread-specific key.  Initialization here is somewhat tricky  
      // because some Linux startup code invokes malloc() before it  
      // is in a good enough state to handle pthread_keycreate().  
      // Therefore, we use TSD keys only after tsd_inited is set to true.  
      // Until then, we use a slow path to get the heap object.  
      static bool tsd_inited_;  
      static pthread_key_t heap_key_;

尽管在编译器和连接器层面可以支持TLS，但是操作系统未必支持，因此需要实时的检查系统是否支持。主要是通过手动方式标识一些不支持的操作系统，代码如下：
thread_cache.h
[cpp] view plain copy

// Even if we have support for thread-local storage in the compiler  
// and linker, the OS may not support it.  We need to check that at  
// runtime.  Right now, we have to keep a manual set of "bad" OSes.  
#if defined(HAVE_TLS)  
extern bool kernel_supports_tls;   // defined in thread_cache.cc  
void CheckIfKernelSupportsTLS();  
inline bool KernelSupportsTLS() {  
  return kernel_supports_tls;  
}  
#endif    // HAVE_TLS  
thread_cache.cc  
#if defined(HAVE_TLS)  
bool kernel_supports_tls = false;      // be conservative  
# if defined(_WIN32)    // windows has supported TLS since winnt, I think.  
    void CheckIfKernelSupportsTLS() {  
      kernel_supports_tls = true;  
    }  
# elif !HAVE_DECL_UNAME    // if too old for uname, probably too old for TLS  
    void CheckIfKernelSupportsTLS() {  
      kernel_supports_tls = false;  
    }  
# else  
#   include <sys/utsname.h>    // DECL_UNAME checked for <sys/utsname.h> too  
    void CheckIfKernelSupportsTLS() {  
      struct utsname buf;  
      if (uname(&buf) != 0) {   // should be impossible  
        Log(kLog, __FILE__, __LINE__,  
            "uname failed assuming no TLS support (errno)", errno);  
        kernel_supports_tls = false;  
      } else if (strcasecmp(buf.sysname, "linux") == 0) {  
        // The linux case: the first kernel to support TLS was 2.6.0  
        if (buf.release[0] < '2' && buf.release[1] == '.')    // 0.x or 1.x  
          kernel_supports_tls = false;  
        else if (buf.release[0] == '2' && buf.release[1] == '.' &&  
                 buf.release[2] >= '0' && buf.release[2] < '6' &&  
                 buf.release[3] == '.')                       // 2.0 - 2.5  
          kernel_supports_tls = false;  
        else  
          kernel_supports_tls = true;  
      } else if (strcasecmp(buf.sysname, "CYGWIN_NT-6.1-WOW64") == 0) {  
        // In my testing, this version of cygwin, at least, would hang  
        // when using TLS.  
        kernel_supports_tls = false;  
      } else {        // some other kernel, we'll be optimisitic  
        kernel_supports_tls = true;  
      }  
      // TODO(csilvers): VLOG(1) the tls status once we support RAW_VLOG  
    }  
#  endif  // HAVE_DECL_UNAME  
#endif    // HAVE_TLS

Thread Specific Key初始化
接下来看看每一个局部缓存是如何创建的。首先看看heap_key_的创建，它在InitTSD函数中

    void ThreadCache::InitTSD() {  
      ASSERT(!tsd_inited_);  
      perftools_pthread_key_create(&heap_key_, DestroyThreadCache);  
      tsd_inited_ = true;  


    #ifdef PTHREADS_CRASHES_IF_RUN_TOO_EARLY  
      // We may have used a fake pthread_t for the main thread.  Fix it.  
      pthread_t zero;  
      memset(&zero, 0, sizeof(zero));  
      SpinLockHolder h(Static::pageheap_lock());  
      for (ThreadCache* h = thread_heaps_; h != NULL; h = h->next_) {  
        if (h->tid_ == zero) {  
          h->tid_ = pthread_self();  
        }  
      }  
    #endif  
    }

该函数在TCMallocGuard的构造函数中被调用。TCMallocGuard类的声明和定义分别在tcmalloc_guard.h和tcmalloc.cc文件中。

    class TCMallocGuard {  
     public:  
      TCMallocGuard();  
      ~TCMallocGuard();  
    };  
    // The constructor allocates an object to ensure that initialization  
    // runs before main(), and therefore we do not have a chance to become  
    // multi-threaded before initialization.  We also create the TSD key  
    // here.  Presumably by the time this constructor runs, glibc is in  
    // good enough shape to handle pthread_key_create().  
    //  
    // The constructor also takes the opportunity to tell STL to use  
    // tcmalloc.  We want to do this early, before construct time, so  
    // all user STL allocations go through tcmalloc (which works really  
    // well for STL).  
    //  
    // The destructor prints stats when the program exits.  
    static int tcmallocguard_refcount = 0;  // no lock needed: runs before main()  
    TCMallocGuard::TCMallocGuard() {  
      if (tcmallocguard_refcount++ == 0) {  
    #ifdef HAVE_TLS    // this is true if the cc/ld/libc combo support TLS  
        // Check whether the kernel also supports TLS (needs to happen at runtime)  
        tcmalloc::CheckIfKernelSupportsTLS();  
    #endif  
        ReplaceSystemAlloc();    // defined in libc_override_*.h  
        tc_free(tc_malloc(1));  
        ThreadCache::InitTSD();  
        tc_free(tc_malloc(1));  
        // Either we, or debugallocation.cc, or valgrind will control memory  
        // management.  We register our extension if we're the winner.  
    #ifdef TCMALLOC_USING_DEBUGALLOCATION  
        // Let debugallocation register its extension.  
    #else  
        if (RunningOnValgrind()) {  
          // Let Valgrind uses its own malloc (so don't register our extension).  
        } else {  
          MallocExtension::Register(new TCMallocImplementation);  
        }  
    #endif  
      }  
    }  


    TCMallocGuard::~TCMallocGuard() {  
      if (--tcmallocguard_refcount == 0) {  
        const char* env = getenv("MALLOCSTATS");  
        if (env != NULL) {  
          int level = atoi(env);  
          if (level < 1) level = 1;  
          PrintStats(level);  
        }  
      }  
    }  
    #ifndef WIN32_OVERRIDE_ALLOCATORS  
    static TCMallocGuard module_enter_exit_hook;  
    #endif

线程局部缓存Cache的创建和关联
接下来看如何创建各个线程的ThreadCache的创建。我们看GetCache代码，该代码在do_malloc中被调用。

    inline ThreadCache* ThreadCache::GetCache() {  
      ThreadCache* ptr = NULL;  
      if (!tsd_inited_) {  
        InitModule();  
      } else {  
        ptr = GetThreadHeap();  
      }  
      if (ptr == NULL) ptr = CreateCacheIfNecessary();  
      return ptr;  
    }  
    void ThreadCache::InitModule() {  
      SpinLockHolder h(Static::pageheap_lock());  
      if (!phinited) {  
        Static::InitStaticVars();  
        threadcache_allocator.Init();  
        phinited = 1;  
      }  
    }

该函数首先判断tsd_inited_是否为true，该变量在InitTSD中被设置为true。那么首次调用GetCache时tsd_inited_肯定为false，这时就InitModule就会被调用。InitModule函数主要是来进行系统的内存分配器初始化。如果tsd_inited_已经为true了，那么线程的thread specific就可以使用了，GetThreadHeap就是通过heap_key_查找当前线程的ThreadCache. 如果ptr为NULL，那么CreateCacheIfNecessary就会被调用，该函数来创建ThreadCache。


    ThreadCache* ThreadCache::CreateCacheIfNecessary() {  
      // Initialize per-thread data if necessary  
      ThreadCache* heap = NULL;  
      {  
        SpinLockHolder h(Static::pageheap_lock());  
        // On some old glibc's, and on freebsd's libc (as of freebsd 8.1),  
        // calling pthread routines (even pthread_self) too early could  
        // cause a segfault.  Since we can call pthreads quite early, we  
        // have to protect against that in such situations by making a  
        // 'fake' pthread.  This is not ideal since it doesn't work well  
        // when linking tcmalloc statically with apps that create threads  
        // before main, so we only do it if we have to.  
    #ifdef PTHREADS_CRASHES_IF_RUN_TOO_EARLY  
        pthread_t me;  
        if (!tsd_inited_) {  
          memset(&me, 0, sizeof(me));  
        } else {  
          me = pthread_self();  
        }  
    #else  
        const pthread_t me = pthread_self();  
    #endif  


        // This may be a recursive malloc call from pthread_setspecific()  
        // In that case, the heap for this thread has already been created  
        // and added to the linked list.  So we search for that first.  
        for (ThreadCache* h = thread_heaps_; h != NULL; h = h->next_) {  
          if (h->tid_ == me) {  
            heap = h;  
            break;  
          }  
        }  


        if (heap == NULL) heap = NewHeap(me);  
      }  


      // We call pthread_setspecific() outside the lock because it may  
      // call malloc() recursively.  We check for the recursive call using  
      // the "in_setspecific_" flag so that we can avoid calling  
      // pthread_setspecific() if we are already inside pthread_setspecific().  
      if (!heap->in_setspecific_ && tsd_inited_) {  
        heap->in_setspecific_ = true;  
        perftools_pthread_setspecific(heap_key_, heap);  
    #ifdef HAVE_TLS  
        // Also keep a copy in __thread for faster retrieval  
        threadlocal_heap_ = heap;  
    #endif  
        heap->in_setspecific_ = false;  
      }  
      return heap;  
    }  
    ThreadCache* ThreadCache::NewHeap(pthread_t tid) {  
      // Create the heap and add it to the linked list  
      ThreadCache *heap = threadcache_allocator.New();  
      heap->Init(tid);  
      heap->next_ = thread_heaps_;  
      heap->prev_ = NULL;  
      if (thread_heaps_ != NULL) {  
        thread_heaps_->prev_ = heap;  
      } else {  
        // This is the only thread heap at the momment.  
        ASSERT(next_memory_steal_ == NULL);  
        next_memory_steal_ = heap;  
      }  
      thread_heaps_ = heap;  
      thread_heap_count_++;  
      return heap;  
    }

CreateIfNecessary创建一个ThreadCache对象，并且将该对象与当前线程的pthread_key_关联，同时添加到ThreadCache链表的头部。这里有个特别的情况需要说明，首次调用malloc所创建的ThreadCache对象没有和pthread_key_关联，只是添加到了ThreadCache链表中去了，程序可能还会在tsd_inited_为true之前多次调用malloc，也就会多次进入CreateCacheIfNecessary函数，这时函数中会去遍历ThreadCache链表，发现当前线程已经创建好的ThreadCache对象。

ThreadCache分析之空闲内存链表

前面几篇博文中已经描述了TCMalloc将内存从小到大划分成很多个固定大小的内存块，将每种大小的空闲内存块使用链表管理起来。本文就来分析下ThreadCache中空闲链表的实现。
TreadCache::FreeList的代码如下:

     class FreeList {  
       private:  
        void*    list_;       // Linked list of nodes  


    #ifdef _LP64  
        // On 64-bit hardware, manipulating 16-bit values may be slightly slow.  
        uint32_t length_;      // Current length.  
        uint32_t lowater_;     // Low water mark for list length.  
        uint32_t max_length_;  // Dynamic max list length based on usage.  
        // Tracks the number of times a deallocation has caused  
        // length_ > max_length_.  After the kMaxOverages'th time, max_length_  
        // shrinks and length_overages_ is reset to zero.  
        uint32_t length_overages_;  
    #else  
        // If we aren't using 64-bit pointers then pack these into less space.  
        uint16_t length_;  
        uint16_t lowater_;  
        uint16_t max_length_;  
        uint16_t length_overages_;  
    #endif  


       public:  
        void Init() {  
          list_ = NULL;  
          length_ = 0;  
          lowater_ = 0;  
          max_length_ = 1;  
          length_overages_ = 0;  
        }  


        // Return current length of list  
        size_t length() const {  
          return length_;  
        }  


        // Return the maximum length of the list.  
        size_t max_length() const {  
          return max_length_;  
        }  


        // Set the maximum length of the list.  If 'new_max' > length(), the  
        // client is responsible for removing objects from the list.  
        void set_max_length(size_t new_max) {  
          max_length_ = new_max;  
        }  


        // Return the number of times that length() has gone over max_length().  
        size_t length_overages() const {  
          return length_overages_;  
        }  


        void set_length_overages(size_t new_count) {  
          length_overages_ = new_count;  
        }  


        // Is list empty?  
        bool empty() const {  
          return list_ == NULL;  
        }  


        // Low-water mark management  
        int lowwatermark() const { return lowater_; }  
        void clear_lowwatermark() { lowater_ = length_; }  


        void Push(void* ptr) {  
          SLL_Push(&list_, ptr);  
          length_++;  
        }  


        void* Pop() {  
          ASSERT(list_ != NULL);  
          length_--;  
          if (length_ < lowater_) lowater_ = length_;  
          return SLL_Pop(&list_);  
        }  


        void* Next() {  
          return SLL_Next(&list_);  
        }  


        void PushRange(int N, void *start, void *end) {  
          SLL_PushRange(&list_, start, end);  
          length_ += N;  
        }  


        void PopRange(int N, void **start, void **end) {  
          SLL_PopRange(&list_, N, start, end);  
          ASSERT(length_ >= N);  
          length_ -= N;  
          if (length_ < lowater_) lowater_ = length_;  
        }

FreeList的成员list_是一个指向首个空闲内存块的指针，它也是链表的head。链表所有的操作均在SSL_*系列函数中实现。SSL_*系列接口如下：


    void *SLL_Next(void *t)；//取t的下一块空闲内存。t也是一个空闲内存块。  
    void SLL_SetNext(void *t, void *n)；//设置空闲内存块t的下一个空闲内存快n。  
    void SLL_Push(void **list, void *element)；//在链表最前面插入一个空闲内存块element。  
    void *SLL_Pop(void **list)；//取出链表中第一个空闲内存块。  
    void SLL_PopRange(void **head, int N, void **start, void **end)；//取出链表中前面N个空闲内存块。  
    void SLL_PushRange(void **head, void *start, void *end)；//在链表前面插入从start到end一个链表。  
    size_t SLL_Size(void *head)；//计算链表的长度。

SLL链表利用空闲内存本身来存储其下一个空闲内存块的地址，即将内存块的前4个字节(32位)或者8个字节(64位)用来存储next。示意图如下：

各接口实现代码在linked_list.h中：

 inline void *SLL_Next(void *t) {  
      return *(reinterpret_cast<void**>(t));//将当前内存块重新解释为一个二级指针，它指向一个void*，即空闲内存块。再通过*操作符获取它指向下一块空闲内存的地址。  
    }  
    inline void SLL_SetNext(void *t, void *n) {  
      *(reinterpret_cast<void**>(t)) = n;  
    }  


    inline void SLL_Push(void **list, void *element) {  
      SLL_SetNext(element, *list);  
      *list = element;  
    }  


    inline void *SLL_Pop(void **list) {  
      void *result = *list;  
      *list = SLL_Next(*list);  
      return result;  
    }  


    // Remove N elements from a linked list to which head points.  head will be  
    // modified to point to the new head.  start and end will point to the first  
    // and last nodes of the range.  Note that end will point to NULL after this  
    // function is called.  
    inline void SLL_PopRange(void **head, int N, void **start, void **end) {  
      if (N == 0) {  
        *start = NULL;  
        *end = NULL;  
        return;  
      }  


      void *tmp = *head;  
      for (int i = 1; i < N; ++i) {  
        tmp = SLL_Next(tmp);  
      }  


      *start = *head;  
      *end = tmp;  
      *head = SLL_Next(tmp);  
      // Unlink range from list.  
      SLL_SetNext(tmp, NULL);  
    }  


    inline void SLL_PushRange(void **head, void *start, void *end) {  
      if (!start) return;  
      SLL_SetNext(end, *head);  
      *head = start;  
    }  


    inline size_t SLL_Size(void *head) {  
      int count = 0;  
      while (head) {  
        count++;  
        head = SLL_Next(head);  
      }  
      return count;  
    }

总结 1：

malloc的函数入口为tc_malloc。
free的函数入口为tc_free。
tc_new_mode的设置决定内存分配失败时是否抛出异常。
小于kMaxSize的内存从线程局部缓存分配，超过的从中心页堆中分配。

总结 2:

线程局部缓存将[0~256*1024]范围的内存按如下规则对齐：
size<16,8字节对齐，对齐结果[8,16]
size在[16,128)之间，按16字节对齐，对齐结果[32,48,…,128]
size在[128,256*1024)，按(2^(n+1)-2^n)/8对齐，对齐结果[128+16,128+2*16,…,128+8*16,256+32,256+2*32,…,256+8*32]
用class_to_size_保存所有对齐结果。
class_array_保存ClassIndex(size)到class_to_size_的映射关系。
ClassIndex(size)通过简单的字节对齐算法计算class_array_的索引。