深入linux内核架构--slab分配器（建议收藏）

最新推荐文章于 2024-07-29 21:15:34 发布

Linux加油站

最新推荐文章于 2024-07-29 21:15:34 发布

阅读量1.4k

点赞数 1

文章标签： linux 架构网络

本文链接：https://blog.csdn.net/m0_74282605/article/details/127819805

版权

本文深入探讨了Linux内核中的slab分配器，它是管理内核小块内存的关键机制。slab主要通过kmalloc和kfree进行内存的分配和释放，并通过kmem_cache_create创建缓存对象。slab缓存由kmem_cache和slab对象组成，通过per_cpu数组和slab列表进行管理。slab分配涉及kmalloc_caches和kmem_cache_node的初始化，以及slab的创建、分配和释放流程。slab着色旨在优化CPU缓存利用，通过对内存地址进行偏移，使得缓存命中更均匀。

摘要由CSDN通过智能技术生成

简介：malloc对于大家来说应该都不陌生了，这是系统库给我们提供了申请指定大小内存的函数，之前介绍的伙伴系统，只能以页的方式申请内存，对于小块(小于一页)内存的申请我们就得通过自定义的库函数来实现相关需求，所以在用户空间层面诞生了诸如ptmalloc(glibc)，tcmalloc(google)，jemalloc(facebook)等优秀的内存分配库。但是这些库内核没法使用，且内核也有大量申请小块内存的需求，诸如管理dentry，inode，fs_struct，page，task_struct等等一系列内核对象。所以内核提出了slab分配器，用来管理内核中小块内存分配，而cpu cache也是配合slab使用的，有时候也把slab称为缓存。

内核中内存管理

对于内核来说，slab主要包括kmalloc及kfree两个函数来分配及释放小块内存：
kmalloc(size, flags)：分配长度为size字节的一个内存区，并返回指向该内存区起始void指针，如果没有足够内存，返回NULL。

kfree(ptr)：释放ptr指向的内存区。

对于内核开发者还可以通过kmem_cache_create创建一个缓存kmem_cache对象；
通过kmem_cache_alloc、kmem_cache_alloc_node提供特定类型的内核缓存对象申请。他们最终都会调用到slab_alloc。所以主要的slab操作都在slab_alloc函数中。

slab缓存由两部分组成：保存管理性数据的缓存对象和保存被管理对象的各个slab对象

【文章福利】小编推荐自己的Linux内核技术交流群：【977878001】整理一些个人觉得比较好得学习书籍、视频资料共享在群文件里面，有需要的可以自行添加哦！！！前100进群领取，额外赠送一份 价值699的内核资料包（含视频教程、电子书、实战项目及代码）

内核资料直通车：Linux内核源码技术学习路线+视频教程代码资料

学习直通车：Linux内核源码/内存调优/文件系统/进程管理/设备驱动/网络协议栈-学习视频教程-腾讯课堂

slab cache
上图中缓存即为kmem_cache，slab即为page页帧，缓存对象即为void指针。一个kmem_cache会在不同的内存节点管理很多页帧，这些页帧在各个内存节点被划分为3类：部分空闲，全部空闲以及全部分配。

每个缓存kmem_cache对象只负责一种slab对象类型的管理，各个kmem_cache缓存中slab对象大小各不相同，由创建的时候指定，而缓存会根据指定的slab对象大小根据cpu cacheline 或者void*指针大小进行对齐，然后根据一个公式计算一个合适的gfporder来确定每次申请的内存页帧的数量。其计算方法为：

依次递增gfporder，这样关联页数就为2^gfporder，对应的字节数为PAGE_SIZE << gfporder；
PAGE_SIZE << gfporder = head + num * size + left_over
当left_over * 8 <=  PAGE_SIZE << gfporder 时就决定是这个gfporder，因为这是一个可以接受的碎片大小。

对应的计算函数为calculate_slab_order，相关代码如下：

/**
 * calculate_slab_order - calculate size (page order) of slabs
 * @cachep: pointer to the cache that is being created
 * @size: size of objects to be created in this cache.
 * @flags: slab allocation flags
 *
 * Also calculates the number of objects per slab.
 *
 * This could be made much more intelligent.  For now, try to avoid using
 * high order pages for slabs.  When the gfp() functions are more friendly
 * towards high-order requests, this should be changed.
 */
static size_t calculate_slab_order(struct kmem_cache *cachep,
                size_t size, slab_flags_t flags)
{
    size_t left_over = 0;
    int gfporder;
    for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
        unsigned int num;
        size_t remainder;
        num = cache_estimate(gfporder, size, flags, &remainder);
        if (!num)
            continue;
        /* Can't handle number of objects more than SLAB_OBJ_MAX_NUM */
        if (num > SLAB_OBJ_MAX_NUM)
            break;
        if (flags & CFLGS_OFF_SLAB) {
            struct kmem_cache *freelist_cache;
            size_t freelist_size;
            freelist_size = num * sizeof(freelist_idx_t);
            freelist_cache = kmalloc_slab(freelist_size, 0u);
            if (!freelist_cache)
                continue;
            /*
             * Needed to avoid possible looping condition
             * in cache_grow_begin()
             */
            if (OFF_SLAB(freelist_cache))
                continue;
            /* check if off slab has enough benefit */
            if (freelist_cache->size > cachep->size / 2)
                continue;
        }
        /* Found something acceptable - save it away */
        cachep->num = num;
        cachep->gfporder = gfporder;
        left_over = remainder;
        /*
         * A VFS-reclaimable slab tends to have most allocations
         * as GFP_NOFS and we really don't want to have to be allocating
         * higher-order pages when we are unable to shrink dcache.
         */
        if (flags & SLAB_RECLAIM_ACCOUNT)
            break;
        /*
         * Large number of objects is good, but very large slabs are
         * currently bad for the gfp()s.
         */
        if (gfporder >= slab_max_order)
            break;
        /*
         * Acceptable internal fragmentation?
         */
        if (left_over * 8 <= (PAGE_SIZE << gfporder))
            break;
    }
    return left_over;
}
/*
 * Calculate the number of objects and left-over bytes for a given buffer size.
 */
static unsigned int cache_estimate(unsigned long gfporder, size_t buffer_size,
        slab_flags_t flags, size_t *left_over)
{
    unsigned int num;
    size_t slab_size = PAGE_SIZE << gfporder;
    /*
     * The slab management structure can be either off the slab or
     * on it. For the latter case, the memory allocated for a
     * slab is used for:
     *
     * - @buffer_size bytes for each object
     * - One freelist_idx_t for each object
     *
     * We don't need to consider alignment of freelist because
     * freelist will be at the end of slab page. The objects will be
     * at the correct alignment.
     *
     * If the slab management structure is off the slab, then the
     * alignment will already be calculated into the size. Because
     * the slabs are all pages aligned, the objects will be at the
     * correct alignment when allocated.
     */
    if (flags & (CFLGS_OBJFREELIST_SLAB | CFLGS_OFF_SLAB)) {
        num = slab_size / buffer_size;
        *left_over = slab_size % buffer_size;
    } else {
        num = slab_size / (buffer_size + sizeof(freelist_idx_t));
        *left_over = slab_size %
            (buffer_size + sizeof(freelist_idx_t));
    }
    return num;
}

系统中所有的缓存都保存在一个双链表中，这使得内核可以遍历所有的缓存，这主要用于缩减分配给内存的数量，常见的场景就是：dentry及inode slab缓存的回收，当机器物理内存不足时就会缩减这一部分内存占用(这一部分内存被称为SReclaimable，可以通过cat /proc/meminfo查看)。

基本结构

kmem_cache数据结构代表一个slab 缓存，其中有一些缓存元信息包括：缓存名，缓存对象大小，关联的内存页帧数，着色信息等等；还有一个__per_cpu array_cache用于表示该缓存在各个CPU中的slab对象；kmem_cache_node用于管理各个内存节点上slab对象的分配。

array_cache是一个per_cpu数组，所以访问不需要加锁，是与cpu cache打交道的直接数据结构，每次获取空闲slab对象时都是通过entry[avail--]去获取，当avail==0时，又从kmem_cache_node中获取batchcount个空闲对象到array_cache中。

kmem_cache_node用于管理slab(实际对象存储伙伴页帧)，其会管理三个slab列表，部分空闲partial，全部空闲empty，全部占用full。array_cache获取batchcount空闲对象时，先尝试从partial分配，如果不够则再从empty分配剩余对象，如果都不够，则需要grow分配新的slab页帧

page页帧，这个就不必多说了，这是物理存储地址，是一个union结构，当被用作slab时，会初始化一下slab管理数据，诸如起始object地址s_mem，lru缓存节点，是否被激活active，关联到的kmem_cache以及freelist空闲对象数组(是一个void*指针，其实存的是char or short数组)。

具体数据结构如下：

struct kmem_cache {
/* 0) per-CPU数据，在每次分配/释放期间都会访问 */
    struct array_cache __percpu *cpu_cache;  // 每个cpu中的slab对象

/* 1) Cache tunables. Protected by slab_mutex */
    unsigned int batchcount; // 当__percpu  cpu_cache为空时，从缓存slab中获取的对象数目，它还表示缓存增长时分配的对象数目。初始时为1，后续会调整。
    unsigned int limit; // __percpu  cpu_cache中的对象数目上限，当slab free达到limit时，需要将array_caches中的部分obj返回到kmem_cache_node的页帧中。
    unsigned int shared;

    unsigned int size; // slab中的每个对象大小
    struct reciprocal_value reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */

    slab_flags_t flags;     /* constant flags */
    unsigned int num;       /* # of objs per slab */

/* 3) cache_grow/shrink */
    /* order of pgs per slab (2^n) */
    unsigned int gfporder; // slab关联页数

    /* force GFP flags, e.g. GFP_DMA */
    gfp_t allocflags;  

    size_t colour;          /* cache colouring range */
    unsigned int colour_off;    /* colour offset */
    struct kmem_cache *freelist_cache; // 空闲对象管理
    unsigned int freelist_size; // 空闲对象数量

    /* constructor func */
    void (*ctor)(void *obj); // 这个在2.6之后已经废弃了

/* 4) cache creation/removal */
    const char *name;
    struct list_head list;
    int refcount;
    int object_size;
    int align;

/* 5) statistics */
#ifdef CONFIG_DEBUG_SLAB
    unsigned long num_active;