详解slab机制(2) 创建slab的过程

2.2、创建slab的过程:

现在去看结构kmem_cache的各个成员定义是很模糊的,直接看函数源码:

struct kmem_cache *

kmem_cache_create (const char *name, size_t size, size_t align,

         unsigned long flags, void (*ctor)(void *))

{

         size_t left_over, slab_size, ralign;

         struct kmem_cache *cachep = NULL, *pc;

         gfp_t gfp;

         /*

          * Sanity checks... these are all serious usage bugs.

          */

/*参数检查: 名字不能为NULL、不许在中断中调用本函数(本函数可能睡眠)

  获取长度不得小于4字节(CPU字长)、获取长度不得大于最大值(1<<22 = 4MB)*/

         if (!name || in_interrupt() || (size < BYTES_PER_WORD) ||

             size > KMALLOC_MAX_SIZE) {

                   printk(KERN_ERR "%s: Early error in slab %s\n", __func__,

                                     name);

                   BUG();

         }

 

         /*

          * We use cache_chain_mutex to ensure a consistent view of

          * cpu_online_mask as well.  Please see cpuup_callback

          */

         if (slab_is_available()) {

                   get_online_cpus();

                   mutex_lock(&cache_chain_mutex);

         }

 

    /*一些检查机制,无需关注*/

         list_for_each_entry(pc, &cache_chain, next) {

                   char tmp;

                   int res;

 

                   /*

                    * This happens when the module gets unloaded and doesn't

                    * destroy its slab cache and no-one else reuses the vmalloc

                    * area of the module.  Print a warning.

                    */

                   res = probe_kernel_address(pc->name, tmp);

                   if (res) {

                            printk(KERN_ERR

                                   "SLAB: cache with size %d has lost its name\n",

                                   pc->buffer_size);

                            continue;

                   }

 

                   if (!strcmp(pc->name, name)) {

                            printk(KERN_ERR

                                   "kmem_cache_create: duplicate cache %s\n", name);

                            dump_stack();

                            goto oops;

                   }

         }

 

#if DEBUG

         WARN_ON(strchr(name, ' '));  /* It confuses parsers */

#if FORCED_DEBUG

         /*

          * Enable redzoning and last user accounting, except for caches with

          * large objects, if the increased size would increase the object size

          * above the next power of two: caches with object sizes just above a

          * power of two have a significant amount of internal fragmentation.

          */

         if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +

                                                        2 * sizeof(unsigned long long)))

                   flags |= SLAB_RED_ZONE | SLAB_STORE_USER;

         if (!(flags & SLAB_DESTROY_BY_RCU))

                   flags |= SLAB_POISON;

#endif

         if (flags & SLAB_DESTROY_BY_RCU)

                   BUG_ON(flags & SLAB_POISON);

#endif

         /*

          * Always checks flags, a caller might be expecting debug support which

          * isn't available.

          */

         BUG_ON(flags & ~CREATE_MASK);

 

         /*

          * Check that size is in terms of words.  This is needed to avoid

          * unaligned accesses for some archs when redzoning is used, and makes

          * sure any on-slab bufctl's are also correctly aligned.

          */

/*下面是一堆关于对齐的内容*/

         if (size & (BYTES_PER_WORD - 1)) {

                   size += (BYTES_PER_WORD - 1);

                   size &= ~(BYTES_PER_WORD - 1);

         }

 

         /* calculate the final buffer alignment: */

 

         /* 1) arch recommendation: can be overridden for debug */

         if (flags & SLAB_HWCACHE_ALIGN) {

                   /*

                    * Default alignment: as specified by the arch code.  Except if

                    * an object is really small, then squeeze multiple objects into

                    * one cacheline.

                    */

                   ralign = cache_line_size();

                   while (size <= ralign / 2)

                            ralign /= 2;

         } else {

                   ralign = BYTES_PER_WORD;

         }

 

         /*

          * Redzoning and user store require word alignment or possibly larger.

          * Note this will be overridden by architecture or caller mandated

          * alignment if either is greater than BYTES_PER_WORD.

          */

         if (flags & SLAB_STORE_USER)

                   ralign = BYTES_PER_WORD;

 

         if (flags & SLAB_RED_ZONE) {

                   ralign = REDZONE_ALIGN;

                   /* If redzoning, ensure that the second redzone is suitably

                    * aligned, by adjusting the object size accordingly. */

                   size += REDZONE_ALIGN - 1;

                   size &= ~(REDZONE_ALIGN - 1);

         }

 

         /* 2) arch mandated alignment */

         if (ralign < ARCH_SLAB_MINALIGN) {

                   ralign = ARCH_SLAB_MINALIGN;

         }

         /* 3) caller mandated alignment */

         if (ralign < align) {

                   ralign = align;

         }

         /* disable debug if necessary */

         if (ralign > __alignof__(unsigned long long))

                   flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);

         /*

          * 4) Store it.

          */

         align = ralign;

 

         if (slab_is_available())

                   gfp = GFP_KERNEL;

         else

                   gfp = GFP_NOWAIT;

 

         /* Get cache's description obj. */

/*cache_cache缓存中分配一个kmem_cache新实例*/

         cachep = kmem_cache_zalloc(&cache_cache, gfp);

         if (!cachep)

                   goto oops;

 

#if DEBUG

         cachep->obj_size = size;

 

         /*

          * Both debugging options require word-alignment which is calculated

          * into align above.

          */

         if (flags & SLAB_RED_ZONE) {

                  /* add space for red zone words */

                   cachep->obj_offset += sizeof(unsigned long long);

                   size += 2 * sizeof(unsigned long long);

         }

         if (flags & SLAB_STORE_USER) {

                   /* user store requires one word storage behind the end of

                    * the real object. But if the second red zone needs to be

                    * aligned to 64 bits, we must allow that much space.

                    */

                   if (flags & SLAB_RED_ZONE)

                            size += REDZONE_ALIGN;

                   else

                            size += BYTES_PER_WORD;

         }

#if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)

         if (size >= malloc_sizes[INDEX_L3 + 1].cs_size

             && cachep->obj_size > cache_line_size() && size < PAGE_SIZE) {

                   cachep->obj_offset += PAGE_SIZE - size;

                   size = PAGE_SIZE;

         }

#endif

#endif

 

         /*

          * Determine if the slab management is 'on' or 'off' slab.

          * (bootstrapping cannot cope with offslab caches so don't do

          * it too early on.)

          */

         /*确定slab管理对象的存储方式:内置还是外置。通常,当对象大于等于512时,使用外置方式。初始化阶段采用内置式(kmem_cache_init中创建两个普通高速缓存之后就把变量slab_early_init0)*/

         if ((size >= (PAGE_SIZE >> 3)) && !slab_early_init)

                   /*

                    * Size is large, assume best to place the slab management obj

                    * off-slab (should allow better packing of objs).

                    */

                   flags |= CFLGS_OFF_SLAB;

         size = ALIGN(size, align);

    /*计算碎片大小,计算slab由几个页面组成,同时计算每个slab中有多少对象*/

         left_over = calculate_slab_order(cachep, size, align, flags);

         if (!cachep->num) {

                   printk(KERN_ERR

                          "kmem_cache_create: couldn't create cache %s.\n", name);

                   kmem_cache_free(&cache_cache, cachep);

                   cachep = NULL;

                   goto oops;

         }

    /*计算slab管理对象的大小,包括struct slab对象和kmem_bufctl_t数组 */

         slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t)

                              + sizeof(struct slab), align);

         /*

          * If the slab has been placed off-slab, and we have enough space then

          * move it on-slab. This is at the expense of any extra colouring.

          */

         /*如果这是一个外置式slab,并且碎片大小大于slab管理对象的大小,则可将slab管理对象移到slab中,改造成一个内置式slab*/

         if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {

        /*去除外置标志*/

                   flags &= ~CFLGS_OFF_SLAB;

        /*碎片可以减小了!*/

                   left_over -= slab_size;

         }

    /*对于实际的外置slab,无需对齐管理对象,恢复其对齐前长度*/

         if (flags & CFLGS_OFF_SLAB) {

                   /* really off slab. No need for manual alignment */

                   slab_size =

                       cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);

#ifdef CONFIG_PAGE_POISONING

                   /* If we're going to use the generic kernel_map_pages()

                    * poisoning, then it's going to smash the contents of

                    * the redzone and userword anyhow, so switch them off.

                    */

                   if (size % PAGE_SIZE == 0 && flags & SLAB_POISON)

                            flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);

#endif

         }

    /*着色块单位,为32字节*/

         cachep->colour_off = cache_line_size();

         /* Offset must be a multiple of the alignment. */

    /*着色块单位必须是对齐单位的整数倍*/

         if (cachep->colour_off < align)

                   cachep->colour_off = align;

    /*得出碎片区域需要多少个着色块*/

         cachep->colour = left_over / cachep->colour_off;

    /*管理对象大小*/

         cachep->slab_size = slab_size;

    cachep->flags = flags;

         cachep->gfpflags = 0;

    /*对于arm无需关注下面的if,因为不需考虑DMA*/

         if (CONFIG_ZONE_DMA_FLAG && (flags & SLAB_CACHE_DMA))

                   cachep->gfpflags |= GFP_DMA;

    /*slab对象的大小*/

         cachep->buffer_size = size;

    /*slab对象的大小的倒数,计算对象在slab中索引时用,参见obj_to_index函数 */

         cachep->reciprocal_buffer_size = reciprocal_value(size);

    /*外置slab,这里分配一个slab管理对象,保存在slabp_cache中,如果是内置式的slab,此指针为空*/

         if (flags & CFLGS_OFF_SLAB) {

                   cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

                   /*

                    * This is a possibility for one of the malloc_sizes caches.

                    * But since we go off slab only for object size greater than

                    * PAGE_SIZE/8, and malloc_sizes gets created in ascending order,

                    * this should not happen at all.

                    * But leave a BUG_ON for some lucky dude.

                    */

                   BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));

         }

    /*cache的构造函数和名字*/

         cachep->ctor = ctor;

         cachep->name = name;

 

    /*设置每个cpu上的local cache,配置local cacheslab三链*/

         if (setup_cpu_cache(cachep, gfp)) {

                   __kmem_cache_destroy(cachep);

                   cachep = NULL;

                   goto oops;

         }

 

         /* cache setup completed, link it into the list */

         list_add(&cachep->next, &cache_chain);

oops:

         if (!cachep && (flags & SLAB_PANIC))

                   panic("kmem_cache_create(): failed to create slab `%s'\n",

                         name);

         if (slab_is_available()) {

                   mutex_unlock(&cache_chain_mutex);

                   put_online_cpus();

         }

         return cachep;

}

---------------------------------------------------------------------------------------------------------------------------------

直到函数中的“if (slab_is_available()) gfp = GFP_KERNEL;”这里,前面的都可以不用关注,分别是运行环境和参数的检查(需要注意本函数会可能睡眠,所以绝不能在中断中调用本函数)、一堆对齐机制的东西,看看这一段:

if (slab_is_available())

         gfp = GFP_KERNEL;

else

         gfp = GFP_NOWAIT;

到这里首先根据当前slab是否初始化完成确定变量gfp的值,gfp并不陌生,它规定了从伙伴系统寻找内存的地点和方式,这里的在slab初始化完成时gfp值为GFP_KERNEL说明了为什么可能会睡眠,而slab初始化完成之前gfp值为GFP_NOWAIT说明不会睡眠;

---------------------------------------------------------------------------------------------------------------------------------

接下来是获取一个kmem_cache结构,调用kmem_cache_zalloc,它和kmem_cache_zalloc唯一区别就是会对所分配区域进行清零操作,即在kmem_cache_alloc函数的gfp参数中加入标志__GFP_ZERO,其他没有区别;

由前面2.1节的分析已知,如果想要通过slab分配器获取某长度的内存,必须创建这样的“规则”,那么现在需要一个kmem_cache结构体长度的内存,同样也是需要一个该长度的“规则”,没错该长度的“规则”也是在初始化函数kmem_cache_init中创建,而我们创建这个“规则”的结果就是全局变量cache_cache,所以现在需要申请一个kmem_cache结构体长度的内存时就通过全局变量cache_cache这样一个已创建好的kmem_cache结构变量。

不过全局变量cache_cache并不是一个理解slab创建的好例子,原因在后面就会明白,理解slab还是继续观察函数kmem_cache_create,接下来是确定slab管理对象的存储方式:

if ((size >= (PAGE_SIZE >> 3)) && !slab_early_init)

         /*

          * Size is large, assume best to place the slab management obj

          * off-slab (should allow better packing of objs).

          */

         flags |= CFLGS_OFF_SLAB;

这里引出了slab管理对象的存储方式,分为内置和外置,简单的说,内置就是说slab管理部分的内容和实际供使用的内存都在申请到的内存区域中,外置slab管理部分的内容自己再单独申请一个内存区域,和实际申请的内存区域分开,所谓slab管理部分,包括slab结构体、对象描述符,后面会细致描述,这里的if的意思是,当slab初始化完成后,如果创建的“规则”的内存长度大于(PAGE_SIZE >> 3)512字节时,就使用外置方式,否则使用内置方式,初始化完成之前均使用内置方式。

---------------------------------------------------------------------------------------------------------------------------------

接下来是left_over = calculate_slab_order(cachep, size, align, flags);这是在计算,所创建的“规则”的内存长度size,最终创建的slab将应该有多少个物理页面、有多少个这样size的对象、有多少碎片(碎片就是说申请的内存长度除了对象以外剩下的不能用的内存的长度)

static size_t calculate_slab_order(struct kmem_cache *cachep,

                            size_t size, size_t align, unsigned long flags)

{

         unsigned long offslab_limit;

         size_t left_over = 0;

         int gfporder;

         for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {

                   unsigned int num;

                   size_t remainder;

       

        /*计算每个slab中对象的数目、浪费空间大小

          参数: gfporder: slab大小为2^gfporder个页面

                buffer_size: 对象大小

                align: 对象的对齐方式

                flags: 是外置slab还是内置slab

                remainder: slab中浪费的空间(碎片)是多少

                num: slab中对象个数*/

                   cache_estimate(gfporder, size, align, flags, &remainder, &num);

                   if (!num)

                            continue;

                   if (flags & CFLGS_OFF_SLAB) {

                            /*

                             * Max number of objs-per-slab for caches which

                             * use off-slab slabs. Needed to avoid a possible

                             * looping condition in cache_grow().

                             */

                            offslab_limit = size - sizeof(struct slab);

                            offslab_limit /= sizeof(kmem_bufctl_t);

                           if (num > offslab_limit)

                                     break;

                   }

                   /* Found something acceptable - save it away */

        /*统计slab的对象个数、页面个数(最大为2^1 = 2)、浪费空间大小*/

                   cachep->num = num;

                   cachep->gfporder = gfporder;

                   left_over = remainder;

                   /*

                    * A VFS-reclaimable slab tends to have most allocations

                    * as GFP_NOFS and we really don't want to have to be allocating

                    * higher-order pages when we are unable to shrink dcache.

                    */

                   /*SLAB_RECLAIM_ACCOUNT表示此slab所占页面为可回收的,当内核检测是否有足够的页面满足用户态的需求时,此类页面将被计算在内,通过调用kmem_freepages()函数可以释放分配给slab的页框。由于是可回收的,所以不需要做后面的碎片检测了*/

                   if (flags & SLAB_RECLAIM_ACCOUNT)

                            break;

                   /*

                    * Large number of objects is good, but very large slabs are

                    * currently bad for the gfp()s.

                    */

                   /*slab_break_gfp_order初始化后为1,即slab最多是2^1 = 2个页*/

                   if (gfporder >= slab_break_gfp_order)

                            break;

                   /*

                    * Acceptable internal fragmentation?

                    */

                   /*slab所占页面的大小是碎片大小的8倍以上,页面利用率较高,可以接受这样的order */

                   if (left_over * 8 <= (PAGE_SIZE << gfporder))

                            break;

         }

    /*返回碎片大小*/

         return left_over;

}

 

static void cache_estimate(unsigned long gfporder, size_t buffer_size,

                               size_t align, int flags, size_t *left_over,

                               unsigned int *num)

{

         int nr_objs;

         size_t mgmt_size;

    /*slab大小为2^gfporder个页面*/

         size_t slab_size = PAGE_SIZE << gfporder;

         /*

          * The slab management structure can be either off the slab or

          * on it. For the latter case, the memory allocated for a

          * slab is used for:

          *

          * - The struct slab

          * - One kmem_bufctl_t for each object

          * - Padding to respect alignment of @align

          * - @buffer_size bytes for each object

          *

          * If the slab management structure is off the slab, then the

          * alignment will already be calculated into the size. Because

          * the slabs are all pages aligned, the objects will be at the

          * correct alignment when allocated.

          */

         /*对于外置slab,没有slab管理对象问题,直接用申请空间除以对象大小就是对象个数*/

         if (flags & CFLGS_OFF_SLAB) {

        /*外置slab不存在管理对象,全部用于存储slab对象*/

                   mgmt_size = 0;

        /*所以slab对象个数 = slab大小除以对象大小*/

                   nr_objs = slab_size / buffer_size;

        /*对象个数不许超限*/

                   if (nr_objs > SLAB_LIMIT)

                            nr_objs = SLAB_LIMIT;

         }

         /*对于内置slab,需要减去slab管理对象所占空间,slab管理对象包括slab结构、每个对象一个对象描述符*/

else {

                   /*

                    * Ignore padding for the initial guess. The padding

                    * is at most @align-1 bytes, and @buffer_size is at

                    * least @align. In the worst case, this result will

                    * be one greater than the number of objects that fit

                    * into the memory allocation when taking the padding

                    * into account.

                    */

                   /*内置式slabslab管理对象与slab对象在一起,

                     此时slab页面中包含:一个struct slab对象,

                                         一个kmem_bufctl_t数组(kmem_bufctl_t数组大小与slab对象数目相同)

                                         slab对象。

                     slab大小需要减去管理对象大小,所以对象个数为剩余大小除以每个对象大小加上kmem_bufctl_t结构大小*/

                   nr_objs = (slab_size - sizeof(struct slab)) /

                              (buffer_size + sizeof(kmem_bufctl_t));

                   /*

                    * This calculated number will be either the right

                    * amount, or one greater than what we want.

                    */

                   /*如果对齐后大小超过slab总大小,需要减去一个对象*/

                   if (slab_mgmt_size(nr_objs, align) + nr_objs*buffer_size

                          > slab_size)

                            nr_objs--;

        /*对象个数不许超限*/

                   if (nr_objs > SLAB_LIMIT)

                            nr_objs = SLAB_LIMIT;

        /*得出slab管理对象对齐后总大小*/

                   mgmt_size = slab_mgmt_size(nr_objs, align);

         }

        /*得出slab最终对象个数*/

         *num = nr_objs;

    /*前面已经得到了slab管理对象大小(外置为0,内置也已计算),这样就可以最终得出slab最终浪费空间大小*/

         *left_over = slab_size - nr_objs*buffer_size - mgmt_size;

}

calculate_slab_order通过for循环调用函数cache_estimate就是最终得出了所要创建的“规则”的内存长度size,也就是创建这样的slab,每个slab有多少物理页,每个slab有多少个这样的对象,每个slab的碎片是多大;每个slab其实最多2个物理页,所能容纳的size大小的对象个数与外置还是内置相关,外置情况下slab管理对象不占用所申请的空间,内置则占用,slab管理对象包括slab结构长度和“对象个数”个对象描述符;

小节:在调用完calculate_slab_order后,能算出这样的slab应该从伙伴系统申请多少物理页(最多2)(cache->gfporder),里边有多少个期望长度(size)的对象(cache->num),每个slab的碎片是多大(变量left_over)

---------------------------------------------------------------------------------------------------------------------------------

接下来是一部分根据碎片大小情况,可能的把外置slab改造成内置slab的情况,不用特别关注,这往往出现在由于申请长度size和对齐单位align的值的原因,实际改为内置的话可省下很多空间即碎片可减小很多的情况;最终得出内置/外置的管理对象大小slab_size和碎片大小left_over(源码不贴了就)

接下来是对该“规则”的slab的一些属性设置:

/*着色块单位,为32字节*/

cachep->colour_off = cache_line_size();

/* Offset must be a multiple of the alignment. */

/*着色块单位必须是对齐单位的整数倍*/

if (cachep->colour_off < align)

         cachep->colour_off = align;

/*得出碎片区域需要多少个着色块*/

cachep->colour = left_over / cachep->colour_off;

/*管理对象大小*/

cachep->slab_size = slab_size;

cachep->flags = flags;

cachep->gfpflags = 0;

/*对于arm无需关注下面的if,因为不需考虑DMA*/

if (CONFIG_ZONE_DMA_FLAG && (flags & SLAB_CACHE_DMA))

         cachep->gfpflags |= GFP_DMA;

/*slab对象的大小*/

cachep->buffer_size = size;

/*slab对象的大小的倒数,计算对象在slab中索引时用,参见obj_to_index函数 */

cachep->reciprocal_buffer_size = reciprocal_value(size);

/*外置slab,这里分配一个slab管理对象,保存在slabp_cache中,如果是内置式的slab,此指针为空*/

if (flags & CFLGS_OFF_SLAB) {

         cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

         /*

          * This is a possibility for one of the malloc_sizes caches.

          * But since we go off slab only for object size greater than

          * PAGE_SIZE/8, and malloc_sizes gets created in ascending order,

          * this should not happen at all.

          * But leave a BUG_ON for some lucky dude.

          */

         BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));

}

/*cache的构造函数和名字*/

cachep->ctor = ctor;

cachep->name = name;

前面几个是关于着色的,着色将在后面描述一下,但个人认为对于着色不用特别关注,要知道它的原理和作用方式,但着色在事实上有它的缺点,并且导致slab的管理非常复杂,在linux后续版本更多是通过slub来替代slab的着色机制;

---------------------------------------------------------------------------------------------------------------------------------

接下来是管理对象大小slab_sizeslab的分配flag、对象大小buffer_size(及其倒数)、伙伴系统接口gfpflag、构造函数ctorslab名称name的设置;

需要注意一下对于外置slabslab管理对象的位置,已经知道外置slabslab管理对象不在所申请的空间内而是另外再申请一段空间,源码就是对外置slabslabp_cache专门再申请管理对象slab_size大小的一段空间用于存储外置slab的管理对象,对于内置slab无需关注该成员默认为NULL

---------------------------------------------------------------------------------------------------------------------------------

接下来是一个重点内容,为该slab创建其本地缓存(local cache)slab三链,函数setup_cpu_cache

现在可以先看看结构体kmem_cache,如下:

struct kmem_cache {

         struct array_cache *array[NR_CPUS];

/* 2) Cache tunables. Protected by cache_chain_mutex */

         unsigned int batchcount;

         unsigned int limit;    

         unsigned int shared;   

         unsigned int buffer_size;

         u32 reciprocal_buffer_size;

/* 3) touched by every alloc & free from the backend */

         unsigned int flags;            /* constant flags */

         unsigned int num;             /* # of objs per slab */

/* 4) cache_grow/shrink */

         /* order of pgs per slab (2^n) */

         unsigned int gfporder;

         /* force GFP flags, e.g. GFP_DMA */

         gfp_t gfpflags;

         size_t colour;                     /* cache colouring range */

         unsigned int colour_off;  /* colour offset */

         struct kmem_cache *slabp_cache;

         unsigned int slab_size;

         unsigned int dflags;          /* dynamic flags */

         /* constructor func */

         void (*ctor)(void *obj);

/* 5) cache creation/removal */

         const char *name;

         struct list_head next;

/* 6) statistics */

#ifdef CONFIG_DEBUG_SLAB

         unsigned long num_active;

         unsigned long num_allocations;

         unsigned long high_mark;

         unsigned long grown;

         unsigned long reaped;

         unsigned long errors;

         unsigned long max_freeable;

         unsigned long node_allocs;

         unsigned long node_frees;

         unsigned long node_overflow;

         atomic_t allochit;

         atomic_t allocmiss;

         atomic_t freehit;

         atomic_t freemiss;

         /*

          * If debugging is enabled, then the allocator can add additional

          * fields and/or padding to every object. buffer_size contains the total

          * object size including these internal fields, the following two

          * variables contain the offset to the user object and its size.

          */

         int obj_offset;

         int obj_size;

#endif /* CONFIG_DEBUG_SLAB */

         /*

          * We put nodelists[] at the end of kmem_cache, because we want to size

          * this array to nr_node_ids slots instead of MAX_NUMNODES

          * (see kmem_cache_init())

          * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache

          * is statically defined, so we reserve the max number of nodes.

          */

         struct kmem_list3 *nodelists[MAX_NUMNODES];

         /*

          * Do not add fields after nodelists[]

          */

};

到目前为止还未设置的成员有:

struct array_cache *array[NR_CPUS];

unsigned int batchcount;         /*批量转入转出对象的个数*/

unsigned int limit;                /*本地高速缓存中空闲对象的数量*/

unsigned int shared;              /*是否存在共享CPU高速缓存*/

unsigned int dflags;                   /* dynamic flags (可先不关心)*/

struct kmem_list3 *nodelists[MAX_NUMNODES];

batchcountlimit与实际分配内存相关,shared只在多CPU情况下有意义,dflags暂无需关注,重点关注arraynodelists,它们涉及了所申请内存的分配机制:

在实际开始分配内存时,每个CPU都从kmem_cache结构体中的array中获取需要的内存,如果这里没有内存(用光或第一次用,第一次都是没有内存的需从buddy获取),需要从buddy获取,从buddy获取的方式是通过slab三链的成员nodelists(slab三链这个名字是发现某个文章中这么叫的,所谓slab三链,是指全空、半空、全满三种slab链表)buddy获取到物理页,然后把相关的物理页地址再传给array,可以看到在kmem_cache结构中array是每个CPU都有一个的(NR_CPUS代表CPU个数),之所以有这种机制是因为如果都是通过slab三链获取物理页,那么在多CPU的情况下就会出现多个CPU抢占slab自旋锁的情况,这样会导致效率比较低,发没发现,这个array的机制和伙伴系统buddy的冷热页框机制很像,关于arrayslab三链是如何分配内存的细节后面详细讨论;现在只要知道它们的大概道理即可,继续观察函数setup_cpu_cache

static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)

{

    /*此时初始化已经完毕(start_kernelmm_init后调用kmem_cache_init_late)      直接使能local cache*/

         if (g_cpucache_up == FULL)

                   return enable_cpucache(cachep, gfp);

    /*这说明当前还处于初始化阶段,

      g_cpucache_up记录general cache初始化的进度,比如PARTIAL_AC表示struct array_cache所在的cache已经创建,

      PARTIAL_L3表示struct kmem_list3所在的cache已经创建,                                                      注意创建这两个cache的先后顺序。在初始化阶段只需配置主cpulocal cacheslab三链

     g_cpucache_upNONE,说明sizeof(struct array)大小的cache还没有创建,      初始化阶段创建sizeof(struct array)大小的cache时进入这个流程,此时struct arraycache_init所在的general cache还未创建,只能使用静态分配的全局变量initarray_generic表示的local cache*/

         if (g_cpucache_up == NONE) {

                   /*

                    * Note: the first kmem_cache_create must create the cache

                    * that's used by kmalloc(24), otherwise the creation of

                    * further caches will BUG().

                    */

                   cachep->array[smp_processor_id()] = &initarray_generic.cache;

                   /*

                    * If the cache that's used by kmalloc(sizeof(kmem_list3)) is

                    * the first cache, then we need to set up all its list3s,

                    * otherwise the creation of further caches will BUG().

                    */

                   /*创建struct kmem_list3所在的cache是在struct array_cache所在cache之后,  所以此时struct kmem_list3所在的cache也一定没有创建,也需要使用全局变量initkmem_list3*/

                   set_up_list3s(cachep, SIZE_AC);

 

        /*执行到这struct array_cache所在的cache创建完毕,如果struct kmem_list3struct array_cache位于同一个general cache中,不会再重复创建了(不过显然不可能)       g_cpucache_up表示的进度更进一步*/

                   if (INDEX_AC == INDEX_L3)

                            g_cpucache_up = PARTIAL_L3;

                   else

                            g_cpucache_up = PARTIAL_AC;

         }

    /*g_cpucache_up至少为PARTIAL_AC时进入这个流程,struct arraycache_init所在的general cache已经建立起来,可以通过kmalloc分配了*/

    else {

                   cachep->array[smp_processor_id()] =

                            kmalloc(sizeof(struct arraycache_init), gfp);

        /*struct kmem_list3所在cache仍未创建完毕,还需使用全局的slab三链*/

                   if (g_cpucache_up == PARTIAL_AC) {

            set_up_list3s(cachep, SIZE_L3);

                            g_cpucache_up = PARTIAL_L3;

                   }

        /*struct kmem_list3所在的cachestruct array_cache所在cache都已经创建完毕,无需全局变量*/

        else {

                            int node;

                            for_each_online_node(node) {

                /*通过kmalloc分配struct kmem_list3对象*/

                                     cachep->nodelists[node] =

                                         kmalloc_node(sizeof(struct kmem_list3),

                                                        gfp, node);

                                     BUG_ON(!cachep->nodelists[node]);

                /*初始化slab三链*/

                                     kmem_list3_init(cachep->nodelists[node]);

                            }

                   }

         }

         cachep->nodelists[numa_node_id()]->next_reap =

                            jiffies + REAPTIMEOUT_LIST3 +

                            ((unsigned long)cachep) % REAPTIMEOUT_LIST3;

         cpu_cache_get(cachep)->avail = 0;

         cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;

         cpu_cache_get(cachep)->batchcount = 1;

         cpu_cache_get(cachep)->touched = 0;

         cachep->batchcount = 1;

         cachep->limit = BOOT_CPUCACHE_ENTRIES;

         return 0;

}

首先分析最开始的“if (g_cpucache_up == FULL)”,这里涉及了slab初始化进度的内容,静态全局变量g_cpucache_up就定义在slab.c文件中,它记录着slab初始化的情况流程如下:

NONE:最开始;

PARTIAL_AC:创建sizeof(struct arraycache_init)大小的cache之后;

PARTIAL_L3:创建sizeof(struct kmem_list3)大小的cache之后;

EARLYkmem_cache_init函数末尾;

FULLstart_kernel调用kmem_cache_init_late时;

先说在达到FULL之后,调用的是enable_cpucache函数,这个函数根据我们需要申请的对象的大小确定上限limit,然后先后调用do_tune_cpucachealloc_kmemlist函数创建该cachearrayslab三链,这两个函数我也分析了,但个人认为可以不作为分析重点,因为它们这时都是已经可以轻松的通过kmalloc申请sizeof(struct arraycache_init)大小的空间和sizeof(struct kmem_list3)大小的空间分别存储自己的arrayslab三链,这里应该重点看下在slab初始化完成之前的情况:

我们已经知道g_cpucache_up还为NONE说明内核还没有创建sizeof(struct arraycache_init)大小的cache,所以第一个if (g_cpucache_up == NONE)里边就是创建sizeof(struct arraycache_init)大小的cache的过程,如下:

cachep->array[smp_processor_id()] = &initarray_generic.cache;

发现没有,cachearray是被赋值一个全局变量,为什么?这是因为在这时候内核还没有创建过sizeof(struct arraycache_init)大小的cache,所以第一个array没法通过kmalloc创建,只能借助一个全局变量模拟一下,看看这个全局变量initarray_generic

static struct arraycache_init initarray_generic = { {0, BOOT_CPUCACHE_ENTRIES, 1, 0} };

其结构类型为:

struct arraycache_init {

         struct array_cache cache;

         void *entries[BOOT_CPUCACHE_ENTRIES];

};

BOOT_CPUCACHE_ENTRIES值为1,也就是全局变量initarray_generic的成员cache(struct array_cache结构类型)赋初值为{0, 1, 1, 0}entriesNULL,把它作为第一个sizeof(struct arraycache_init)大小的cache;紧接着函数set_up_list3s如法炮制,用全局变量initkmem_list3实现了第一个slab三链;这之后,g_cpucache_up进度升级到PARTIAL_ACINDEX_ACINDEX_L3不可能相等的,顺便看看INDEX_ACINDEX_L3是什么:

#define INDEX_AC index_of(sizeof(struct arraycache_init))

#define INDEX_L3 index_of(sizeof(struct kmem_list3))

static __always_inline int index_of(const size_t size)

{

         extern void __bad_size(void);

         if (__builtin_constant_p(size)) {

                   int i = 0;

#define CACHE(x) \

         if (size <=x) \

                   return i; \

         else \

                   i++;

#include <linux/kmalloc_sizes.h>

#undef CACHE

                  __bad_size();

         } else

                   __bad_size();

         return 0;

}

这里再次借用include/linux/kmalloc_sizes.h文件的CACHE(X)宏声明,只是重新定义了宏定义,如果展开就是21if else判断,它实际上判断的是sizeof(struct arraycache_init)sizeof(struct kmem_list3)即这两个结构体大小在20个长度分档中属于哪个分档,事实上在初始化中即函数kmem_cache_init中是会特意创建这两个长度的“规则”的cache,回到函数setup_cpu_cache,这里是比较这两个结构体在20个长度分档中的分档编号是否一样,应该说肯定不一样,所以g_cpucache_up在这时肯定赋值为PARTIAL_AC

可见,执行到这里内核已经有了struct arraycache_init结构体长度的“规则”的cache,以后创建下一个新的长度的cache时,当申请其array成员时不需要借助什么全局变量了,直接可以kmalloc;事实上在初始化时,马上就会创建struct kmem_list3结构体长度的“规则”的cache,将会执行本函数的else,当申请其array成员时,就直接kmalloc即可;

并且此时,当g_cpucache_up如果为PARTIAL_AC,说明处于正在创建struct arraycache_init)结构体长度的“规则”的cache,这时内核还没有sizeof(struct kmem_list3)结构体长度的“规则”的cache,还得借助全局变量即调用函数set_up_list3s申请nodelists成员,然后g_cpucache_up初始化进度更新为PARTIAL_L3;当然如果g_cpucache_up值为EARLY的话说明已经kmem_cache_init函数已调用结束即sizeof(struct kmem_list3)结构体长度的“规则”的cache已经创建,则直接通过调用kmalloc申请nodelists成员;

最后注意一下底下的一些初始化操作,注意这些只是在slab没有完全初始化完毕即g_cpucache_up还不为FULL时调用,nodelistsnext_reap成员个人暂时可以不关注;arrayavail成员表示目前可用的slab初始化为0,这说明我们虽然创建了这样“规则”的长度的cache,但并没有实际从伙伴系统申请物理页;limit是指slab的个数上限为1batchcount是指批量移入/移出的个数,slab的申请/释放的单位是batchcount值,这在后面会显而易见的发现;touchedslab是否被动过,个人认为暂无需关注;最后是cachelimitbatchcount

至此,应该能对内核slab分配器工作原理有个初步的认识了:它需要根据所需长度创建相应长度的“规则”的cache,这样今后在申请这样长度的内存,就可以直接用kmalloc/kmem_cache_alloc即可使用slab的服务了;仅仅创建这样长度的“规则”的cache并没有真正分配内存创建相应的slab,这将在调用kmalloc/kmem_cache_alloc函数时去真正分配内存创建slab;不同长度的cache会有不同长度的物理页、slab个数、碎片大小,slab分为内置和外置方式存储,主要体现在slab管理对象存储位置的不同(内置下和slab实际内存在一起,外置则另外申请内存存储)slab管理对象包括slab结构体和每个slab的对象描述符
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值