深入理解GO语言：内存分配_go语言malloc-CSDN博客

本文链接：https://blog.csdn.net/cyq6239075/article/details/106182129

Go语言内置运行时（就是runtime），抛弃了传统的内存分配方式，改为自主管理，最开始是基于tcmalloc，虽然后面改动相对已经很大了。使用自主管理可以实现更好的内存使用模式，比如内存池、预分配等等，从而避免了系统调用所带来的性能问题。

内存整体架构概述：

Go的内存分配的核心思想可以分为以下几点：

(1)每次从操作系统申请一大块儿的内存,由mheap结构按照span为单位统一管理这块内存，mheap为其他请求分配内存，以此减少系统调用

(2)内存分配采用TCMalloc算法。该算法的核心思想就是把内存切分的非常的细小，分为多级管理mheap----span----mcentral---object---tiny---mcache，以降低锁的粒度。

(3)回收对象内存时，并没有将其真正释放掉，只是放回预先分配的大块内存中，以便复用。只有内存闲置过多的时候，才会尝试归还部分内存给操作系统，降低整体开销.

整体结构如下图：

Go在程序启动的时候，会分配一块连续的内存(虚拟内存)。以64位系统为例，总共申请的虚拟内存：512MB（spans）+64GB（bitmap）+512GB（arena），如下图：

在go1.11y及以后改用了稀疏索引的方式来管理整体的内存. 可以超过 512G 内存, 也可以允许内存空间扩展时不连续.在全局的 mheap struct 中有个 arenas 二阶数组, 在 linux amd64 上,一阶只有一个 slot, 二阶有 4M 个 slot, 每个 slot 指向一个 heapArena 结构, 每个 heapArena 结构可以管理 64M 内存, 所以在新的版本中, go 可以管理 4M*64M=256TB 内存, 即目前 64 位机器中 48bit 的寻址总线全部 256TB 内存.如下：

heapArena 和heap、mspan的关系如下图：

下面介绍仍然是采用go1.10版本，虽然结构变化但是基本流程未变化。

虚拟内存是由程序看到的一段逻辑连续的内存，并不是物理内存，虚拟内存和物理内存的关系是由映射表来转换。只有当真正写这块内存时才会占用真正的物理内存。

Spans用于保存mspan的指针，arena区域中每一个page在spans中占用一个指针大小空间，一个page的大小是8kb，spans区域用于表示arena区中的某一页(page)属于哪个mspan。因为arena是512GB，所以可以计算出spans的大小：(512GB/8KB)*8byte=512MB

mspan可以说是go内存管理的最基本单元，mspan由一个或者多个连续的page组成，不同个数的page组合形成不同的class，总共有67种class，每种calss提供固定大小的对象。为什么需要这么多种的mspan呢？我们知道go分配内存时是按需分配，用户申请的对象大小各不相同，根据申请对象的大小选择合适的mspan给用户分配对象，这样可以减少内存碎片化和节约内存。

Bitmap用于保存arena中每一个指针大小的区域是否被使用，以及是否被GC扫过，因此每个指针大小的区域需要两个bit来表示，计算出bitmap大小：(512GB/8byte)*2=64GB. bitmap 主要的作用还是服务于GC

bitmap是从尾巴向头部增长，spans和arena都是从头部向尾巴增长。

bitmap中每个bit的作用示意图：

Arena是由page组成，每个page大小是8k。arena中包含基本的管理单元和程序运行时候生成的对象或实体，这两部分分别被spans和bitmap这两块非heap区域的内存所对应着。

spans和bitmap都会根据arena的动态变化而动态调整大小，如下图示：

主要结构体

mheap

type mheap struct {
    lock      mutex
    free      [_MaxMHeapList]mSpanList // free lists of given length up to _MaxMHeapList
    freelarge mTreap                   // free treap of length >= _MaxMHeapList
    busy      [_MaxMHeapList]mSpanList // busy lists of large spans of given length
    busylarge mSpanList                // busy lists of large spans length >= _MaxMHeapList
    sweepgen  uint32                   // sweep generation, see comment in mspan
    sweepdone uint32                   // all spans are swept
    sweepers  uint32                   // number of active sweepone calls
    allspans []*mspan // all spans out there
    spans []*mspan
   
    sweepSpans [2]gcSweepBuf

    _ uint32 // align uint64 fields on 32-bit for atomics
    pagesInUse         uint64  // pages of spans in stats _MSpanInUse; R/W with mheap.lock
    pagesSwept         uint64  // pages swept this cycle; updated atomically
    pagesSweptBasis    uint64  // pagesSwept to use as the origin of the sweep ratio; updated atomically
    sweepHeapLiveBasis uint64  // value of heap_live to use as the origin of sweep ratio; written with lock, read without
    sweepPagesPerByte  float64 // proportional sweep ratio; written with lock, read without
    // TODO(austin): pagesInUse should be a uintptr, but the 386
    // compiler can't 8-byte align fields.

    // Malloc stats.
    largealloc  uint64                  // bytes allocated for large objects
    nlargealloc uint64                  // number of large object allocations
    largefree   uint64                  // bytes freed for large objects (>maxsmallsize)
    nlargefree  uint64                  // number of frees for large objects (>maxsmallsize)
    nsmallfree  [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)

    // range of addresses we might see in the heap
    bitmap        uintptr // Points to one byte past the end of the bitmap
    bitmap_mapped uintptr
    arena_start uintptr
    arena_used  uintptr // Set with setArenaUsed.
    arena_alloc uintptr
    arena_end   uintptr
    arena_reserved bool

    _ uint32 // ensure 64-bit alignment

    central [numSpanClasses]struct {
        mcentral mcentral
        pad      [sys.CacheLineSize - unsafe.Sizeof(mcentral{})%sys.CacheLineSize]byte
    }

    spanalloc             fixalloc // allocator for span*
    cachealloc            fixalloc // allocator for mcache*
    treapalloc            fixalloc // allocator for treapNodes* used by large objects
    specialfinalizeralloc fixalloc // allocator for specialfinalizer*
    specialprofilealloc   fixalloc // allocator for specialprofile*
    speciallock           mutex    // lock for special record allocators.

    unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
}

mheap可以认为是程序持有的整个堆空间，mheap全局唯一，可以认为是个全局变量。
mheap拥有134个mcentral，以及所有分配的mspan。

上面介绍了分配对象时根据对象的大小找到合适的mspan，Go的内存管理将mspan分成67类，每一种mspan对应一种对象大小。Mspan的分配由Mcentral结构负责，每种mspan对应一个mcentral，mheap中有134个mcentral，这是因为每种mspan又分为两类：包含指针和不包含指针，所有一种有134中，这么分是为内存GC时可以节省大量的扫描时间。

在分配内存时，需要对size进行对齐处理，根据best-fit找到合适的mspan，对未用完的内存还会拆分成其他大小的mspan继续使用。Go程序在new一个object时(忽略逃逸分析)，根据object的size做不同的分配策略：

（1）极小对象(size<16byte)直接在当前P的mcache上的tiny缓存上分配;

（2）小对象(16byte <= size <= 32k)在当前P的mcache上对应object的空闲列表中分配，无空闲列表则会继续向mcentral申请(还是没有则向mheap申请)；

（3）大对象(size>32k)直接通过mheap申请。

直接向mheap申请的大对象是由mcache发出的，又因为mcache在P上，程序运行的时候往往会存在多个P，因此，这个内存申请是并发的；所以为了保证线程安全，必须有一个全局锁。

假如需要分配的内存时，mheap中也没有了，则向操作系统申请一系列新的页（最小 1MB）

mspan

type mspan struct {
    next *mspan     // next span in list, or nil if none
    prev *mspan     // previous span in list, or nil if none
    list *mSpanList // For debugging. TODO: Remove.

    startAddr uintptr // address of first byte of span aka s.base()
    npages    uintptr // number of pages in span

    manualFreeList gclinkptr // list of free objects in _MSpanManual spans
eeindex uintptr
  
    allocCache uint64
    allocBits  *gcBits
    gcmarkBits *gcBits

    sweepgen    uint32
    divMul      uint16     // for divide by elemsize - divMagic.mul
    baseMask    uint16     // if non-0, elemsize is a power of 2, & this will get object allocation base
    allocCount  uint16     // number of allocated objects
    spanclass   spanClass  // size class and noscan (uint8)
    incache     bool       // being used by an mcache
    state       mSpanState // mspaninuse etc
    needzero    uint8      // needs to be zeroed before allocation
    divShift    uint8      // for divide by elemsize - divMagic.shift
    divShift2   uint8      // for divide by elemsize - divMagic.shift2
    elemsize    uintptr    // computed from sizeclass or from npages
    unusedsince int64      // first time spotted by gc in mspanfree state
    npreleased  uintptr    // number of pages released to the os
    limit       uintptr    // end of data in span
    speciallock mutex      // guards specials list
    specials    *special   // linked list of special records sorted by offset.
}

mspan中记录arena中page的起始地址，class类，page个数，双向链表等信息，根据这些信息可以定位到arena中的该span的位置。如下图：

mspan结构本身的内存是从系统分配的，不属于go内存管理范畴。mspan在上文讲spans的时候具体讲过，就是方便根据对象大小来分配使用的内存块，一共有67种类型；最主要解决的是内存碎片问题，减少了内存碎片，提高了内存使用率

object: 对象，用来存储一个变量数据内存空间，一个 span 在初始化时，会被切割成一堆等大的 object。假设 object 的大小是 16B，span 大小是 8K，那么就会把 span 中的 page 就会被初始化 8K / 16B = 512 个 object。所谓内存分配，就是分配一个 object 出去。

span中有一个freeindex标记下一次分配对象时应该开始搜索的地址, 分配后freeindex会增加,
在freeindex之前的元素都是已分配的, 在freeindex之后的元素有可能已分配, 也有可能未分配.

span每次GC以后都可能会回收掉一些元素, allocBits用于标记哪些元素是已分配的, 哪些元素是未分配的.
使用freeindex + allocBits可以在分配时跳过已分配的元素, 把对象设置在未分配的元素中,
但因为每次都去访问allocBits效率会比较慢, span中有一个整数型的allocCache用于缓存freeindex开始的bitmap, 使用deBruijn序列来计算出一个未使用的object.

gcmarkBits用于在gc时标记哪些对象存活, 每次gc以后gcmarkBits会变为allocBits.
需要注意的是span结构本身的内存是从系统分配的, 上面提到的spans区域和bitmap区域都只是一个索引.

针对待分配对象的大小不同有不同的分配逻辑：

(0, 16B) 且不包含指针的对象： Tiny分配

(0, 16B) 包含指针的对象：正常分配

[16B, 32KB] : 正常分配

(32KB, -) : 大对象分配其中Tiny分配和大对象分配都属于内存管理的优化范畴，这里暂时仅关注一般的分配方法。

Tiny

mspan结构中的Spanclass字段用来标记该span属于哪种class，sizeclass=1 的 span，用来给 <= 8B 的对象使用，所以像 int32, byte, bool 以及小字符串等常用的微小对象，都会使用 sizeclass=1 的 span，但分配给他们 8B 的空间，大部分是用不上的。并且这些类型使用频率非常高，就会导致出现大量的内部碎片。所以 Go 尽量不使用 sizeclass=1 的 span，而是将 < 16B 的对象为统一视为 tiny 对象(tinysize)。分配时，从 sizeclass=2 的 span 中获取一个 16B 的 object 用以分配。如果存储的对象小于 16B，这个空间会被暂时保存起来 (mcache.tiny 字段)，下次分配时会复用这个空间，直到这个 object 用完为止。

以申请size为n的内存为例，分配步骤如下：

获取当前线程的私有缓存mcache
跟据size计算出适合的class的ID
从mcache的alloc[class]链表中查询可用的span
如果mcache没有可用的span则从mcentral申请一个新的span加入mcache中
如果mcentral中也没有可用的span则从mheap中申请一个新的span加入mcentral
从该span中获取到空闲对象地址并返回

mcentral

type mcentral struct {
    lock      mutex
    spanclass spanClass
    nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
    empty     mSpanList // list of spans with no free objects (or cached in an mcache)

    // nmalloc is the cumulative count of objects allocated from
    // this mcentral, assuming all spans in mcaches are
    // fully-allocated. Written atomically, read under STW.
    nmalloc uint64
}

有了管理内存的基本单位span，还要有个数据结构来管理span，这个数据结构叫mcentral，各线程需要内存时从mcentral管理的span中申请内存，mcentral有个关键方法cacheSpan()，它是整个分配的核心算法。当某个线程释放内存时又会回收进central

每个mcentral都会包含两个mspan的列表：

（1）没有空闲对象或mspan已经被mcache缓存的mspan列表(empty mspanList)

（2）有空闲对象的mspan列表(empty mspanList)

由于mspan是全局的，会被所有的mcache访问，所以会出现并发性问题，因而mcentral会存在一个锁

假如需要分配内存时，mcentral没有空闲的mspan列表了，此时需要向mheap去获取。

mheap、mspan、mcentral之间的关系图，如下：

mcache

type mcache struct {
    // The following members are accessed on every malloc,
    // so they are grouped here for better caching.
    next_sample int32   // trigger heap sample after allocating this many bytes
    local_scan  uintptr // bytes of scannable heap allocated

    // Allocator cache for tiny objects w/o pointers.
    // See "Tiny allocator" comment in malloc.go.

    // tiny points to the beginning of the current tiny block, or
    // nil if there is no current tiny block.
    //
    // tiny is a heap pointer. Since mcache is in non-GC'd memory,
    // we handle it by clearing it in releaseAll during mark
    // termination.
    tiny             uintptr
    tinyoffset       uintptr
    local_tinyallocs uintptr // number of tiny allocs not counted in other stats

    // The rest is not accessed on every malloc.

    alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass

    stackcache [_NumStackOrders]stackfreelist

    // Local allocator stats, flushed during GC.
    local_nlookup    uintptr                  // number of pointer lookups
    local_largefree  uintptr                  // bytes freed for large objects (>maxsmallsize)
    local_nlargefree uintptr                  // number of frees for large objects (>maxsmallsize)
    local_nsmallfree [_NumSizeClasses]uintptr // number of frees for small objects (<=maxsmallsize)
}

为了避免多线程申请内存时不断的加锁，goroutine为每个线程分配了span内存块的缓存，这个缓存即是mcache，每个goroutine都会绑定的一个mcache，各个goroutine申请内存时不存在锁竞争的情况。
简单介绍下G-P-M 模型中G\P\M的含义

G: 表示Goroutine，每个Goroutine对应一个G结构体，G存储Goroutine的运行堆栈、状态以及任务函数，可重用。G并非执行体，每个G需要绑定到P才能被调度执行。

P: Processor，表示逻辑处理器，对G来说，P相当于CPU核，G只有绑定到P(在P的local runq中)才能被调度。对M来说，P提供了相关的执行环境(Context)，如内存分配状态(mcache)，任务队列(G)等，P的数量决定了系统内最大可并行的G的数量（前提：物理CPU核数 >= P的数量），P的数量由用户设置的GOMAXPROCS决定，但是不论GOMAXPROCS设置为多大，P的数量最大为256。

M: Machine，OS线程抽象，代表着真正执行计算的资源，在绑定有效的P后，进入schedule循环；而schedule循环的机制大致是从Global队列、P的Local队列以及wait队列中获取G，切换到G的执行栈上并执行G的函数，调用goexit做清理工作并回到M，如此反复。M并不保留G状态，这是G可以跨M调度的基础，M的数量是不定的，由Go Runtime调整，为了防止创建过多OS线程导致系统调度不过来，目前默认最大限制为10000个。