在系统启动阶段,buddy系统和slab分配器建立之前,系统的每个节点都拥有自己的bootmem allocator来实现内存的分配,当启动阶段结束后,bootmem allocator将被销毁,而相应的空闲内存会提交给buddy系统来管理,因此bootmem allocator所存在的时间是短暂的,它的宗旨是简单,而非高效!bootmem allocator的基本思想是在一个节点中建立一片位图区域,每一位对应该节点的低端内存的一个页框,通过一个bit来标记一个页的状态,实现页面的分配与回收。
首先了解一下bootmem的核心数据结构
typedef struct bootmem_data {
unsigned long node_min_pfn;
unsigned long node_low_pfn;
void *node_bootmem_map;
unsigned long last_end_off;
unsigned long hint_idx;
struct list_head list;
} bootmem_data_t;
-
node_min_pfn:节点的最小页框编号
-
node_low_pfn:节点的低端内存最大页框编号
-
node_bootmem_map:节点的位图起始地址
-
last_end_off:上次分配内存的最后一个字节相对于其所属页面末端的偏移,这个变量内存分配的时候用到,用于防止产生碎片
-
hint_idx:用于内存分配时确定分配的起始地址
-
list:用于将该节点的bootmem链入所有节点的bootmem链表
下面结合具体的代码就以下几个主要的方面介绍bootmem allocator的工作过程
1.bootmem allocator的初始化
2.bootmem allocator保留内存和释放内存
3.bootmem allocator分配内存
4.bootmem allocator的销毁
1.bootmem allocator的初始化
在arch_setup(),通过initmem_init()-->setup_bootmem_allocator()-->setup_node_bootmem()-->init_bootmem_node()来建立节点中的bootmem allocator. 还有一个初始化的函数是init_bootmem(),其和init_bootmem_node()一样,都是对init_bootmem_core()的封装,区别是前者只针对单节点系统,而后者指定了一个节点,在后面其他操作中都用到了类似的封装方法。
unsigned long __init init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn,
unsigned long startpfn, unsigned long endpfn)
{
return init_bootmem_core(pgdat->bdata, freepfn, startpfn, endpfn);
}
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
max_low_pfn = pages;
min_low_pfn = start;
return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
}
下面来看看bootmem初始化的核心函数init_bootmem_core()
static unsigned long __init init_bootmem_core(bootmem_data_t *bdata,
unsigned long mapstart, unsigned long start, unsigned long end)
{
unsigned long mapsize;
mminit_validate_memmodel_limits(&start, &end);
bdata->node_bootmem_map = phys_to_virt(PFN_PHYS(mapstart));/*存储位图起始地址的虚拟地址*/
bdata->node_min_pfn = start;/*节点中的起始页*/
bdata->node_low_pfn = end; /*节点中的终止页*/
link_bootmem(bdata);/*将该bdata按顺序链入bdata_list中*/
/*
* Initially all pages are reserved - setup_arch() has to
* register free RAM areas explicitly.
*/
mapsize = bootmap_bytes(end - start);
memset(bdata->node_bootmem_map, 0xff, mapsize);/*将位图全部置1,保留所有页*/
bdebug("nid=%td start=%lx map=%lx end=%lx mapsize=%lx\n",
bdata - bootmem_node_data, start, mapstart, end, mapsize);
return mapsize;/*返回位图大小*/
}
我们可以看到在init_bootmem_core()中,主要的工作就是初始化bdata中的变量,以及将位图全部置1,这些参数的确定是在前面列举的函数中完成的。
2.bootmem allocator保留内存和释放内存
保留内存和释放内存是两个相对的概念,bootmem allocator分配出去的内存的会被标记为保留状态,也就是对应的位图区域都为1,这些内存在bootmem allocator销毁后是不会被buddy系统接管的,而释放内存很好理解,就是将相应的页面置于空闲状态,这些页面可以被bootmem allocator分配,空闲的页面在bootmem allocator销毁后会被buddy系统接管。
先来看看保留内存的处理,调用reserve_bootmem_node()函数可以将指定节点中的指定范围页面置为保留状态
int __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
unsigned long size, int flags)
{
unsigned long start, end;
start = PFN_DOWN(physaddr); /*获得起始页框*/
end = PFN_UP(physaddr + size); /*获得终止页框*/
return mark_bootmem_node(pgdat->bdata, start, end, 1, flags);
}
下面来看核心函数mark_bootmem_node()
static int __init mark_bootmem_node(bootmem_data_t *bdata,
unsigned long start, unsigned long end,
int reserve, int flags)
{
unsigned long sidx, eidx;
bdebug("nid=%td start=%lx end=%lx reserve=%d flags=%x\n",
bdata - bootmem_node_data, start, end, reserve, flags);
/*条件判断*/
BUG_ON(start < bdata->node_min_pfn);
BUG_ON(end > bdata->node_low_pfn);
/*计算出start index,end index,即start和end相对于节点最小页框号的偏移量*/
sidx = start - bdata->node_min_pfn;
eidx = end - bdata->node_min_pfn;
if (reserve) /*如果选择保留页框*/
return __reserve(bdata, sidx, eidx, flags);
else /*选择释放页框*/
__free(bdata, sidx, eidx);
return 0;
}
再看__reserve()
static int __init __reserve(bootmem_data_t *bdata, unsigned long sidx,
unsigned long eidx, int flags)
{
unsigned long idx;
int exclusive = flags & BOOTMEM_EXCLUSIVE;
bdebug("nid=%td start=%lx end=%lx flags=%x\n",
bdata - bootmem_node_data,
sidx + bdata->node_min_pfn,
eidx + bdata->node_min_pfn,
flags);
for (idx = sidx; idx < eidx; idx++)/*遍历sidx-->eidx的页框对应的位图区域*/
if (test_and_set_bit(idx, bdata->node_bootmem_map)) {/*把位图的相关位置1*/
if (exclusive) {
__free(bdata, sidx, idx);
return -EBUSY;
}
bdebug("silent double reserve of PFN %lx\n",
idx + bdata->node_min_pfn);
}
return 0;
}
可以看到,保留页面的关键操作就是调用test_and_set_bit()将位图的相关区域置1.
释放内存和保留内存的过程基本相同,只不过传递给mark_bootmem_node()的reserve参数为0,表示释放相应页面,因此在mark_bootmem_node()中会调用__free()
static void __init __free(bootmem_data_t *bdata,
unsigned long sidx, unsigned long eidx)
{
unsigned long idx;
bdebug("nid=%td start=%lx end=%lx\n", bdata - bootmem_node_data,
sidx + bdata->node_min_pfn,
eidx + bdata->node_min_pfn);
if (bdata->hint_idx > sidx)
bdata->hint_idx = sidx;/*保证hint_idx指向最低的空闲页*/
for (idx = sidx; idx < eidx; idx++)/*遍历相关的位图区域*/
if (!test_and_clear_bit(idx, bdata->node_bootmem_map))/*清零*/
BUG();
}
__free()相较__reserve()多了一处对bdata->hint_idx的操作,这个地方是为了保证hint_idx指向最低的空闲页,因为在进行分配的时候,boot allocator是保证从最低的空闲页开始分配
3.bootmem allocator分配内存
bootmem allocator分配内存相对于前面的操作来说要复杂一些,这里面主要考虑的一个问题就是内存碎片。设我们的页面大小为4KB,假如我们上一次分配内存的范围是从第4个页面开始到第8个页面的2KB处,而这次要求分配的起始地址处于第九个页面,如果从第九个页面开始分配的话,那么至少会产生2KB的内存碎片,这样无疑会产生大量的浪费。这也是为什么我们之前介绍的bootmem关键数据结构中引入last_end_off这个变量,它记录了上次分配的末端地址离页尾的偏移,在我们这个例子中该值为2KB,那么如果这次我们从第9个页面开始分配,我们就要考虑将这2KB整合到这次分配中去。
分配内存的核心函数是alloc_bootmem_core(),具体代码如下:
static void * __init alloc_bootmem_core(struct bootmem_data *bdata,
unsigned long size, unsigned long align,
unsigned long goal, unsigned long limit)
{
unsigned long fallback = 0;
unsigned long min, max, start, sidx, midx, step;
bdebug("nid=%td size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n",
bdata - bootmem_node_data, size, PAGE_ALIGN(size) >> PAGE_SHIFT,
align, goal, limit);
BUG_ON(!size); /*检测size*/
BUG_ON(align & (align - 1)); /*检测对齐数是否为2的指数幂*/
BUG_ON(limit && goal + size > limit); /*如果limit不为0则检测goal+size是否超过limit*/
if (!bdata->node_bootmem_map)
return NULL;
/*得到该节点的最小最大低端内存页框号*/
min = bdata->node_min_pfn;
max = bdata->node_low_pfn;
/*将goal和limit从地址转化为页框号*/
goal >>= PAGE_SHIFT;
limit >>= PAGE_SHIFT;
if (limit && max > limit)
max = limit;
if (max <= min)
return NULL;
/*设定步进,以页面为单位*/
step = max(align >> PAGE_SHIFT, 1UL);
/*确定起始页框*/
if (goal && min < goal && goal < max)
start = ALIGN(goal, step);
else
start = ALIGN(min, step);
/*确定起始页框和最大页框的偏移量*/
sidx = start - bdata->node_min_pfn;
midx = max - bdata->node_min_pfn;
if (bdata->hint_idx > sidx) { /*sidx小于hint_idx的话则要下调至hint_idx对齐后的结果*/
/*
* Handle the valid case of sidx being zero and still
* catch the fallback below.
*/
fallback = sidx + 1;
sidx = align_idx(bdata, bdata->hint_idx, step);
}
while (1) {
int merge;
void *region;
unsigned long eidx, i, start_off, end_off;
find_block:
sidx = find_next_zero_bit(bdata->node_bootmem_map, midx, sidx); /*找到下一个0位作为起始地址*/
sidx = align_idx(bdata, sidx, step); /*按step进行对齐*/
eidx = sidx + PFN_UP(size);
if (sidx >= midx || eidx > midx)
break;
for (i = sidx; i < eidx; i++)
if (test_bit(i, bdata->node_bootmem_map)) { /*遇到了保留位,则表明无法找到一块连续的空闲区域*/
sidx = align_idx(bdata, i, step); /*调整sidx*/
if (sidx == i)
sidx += step;
goto find_block; /*重新开始检索bitmap*/
}
/*如果 1.上次分配的PAGE还有剩余的空间
2.PAGE_SIZE-1>0
3.上次分配的PAGE是在这次要求分配的PAGE的相邻并在前面*/
if (bdata->last_end_off & (PAGE_SIZE - 1) &&
PFN_DOWN(bdata->last_end_off) + 1 == sidx)
start_off = align_off(bdata, bdata->last_end_off, align);/*start_off从上次的PAGE剩余处开始,取对齐后的结果,将上次分配的页面剩余的部分整合到这次分配的内存中来*/
else
start_off = PFN_PHYS(sidx);/*不满足上述条件,则从要求的起始PAGE开始*/
merge = PFN_DOWN(start_off) < sidx; /*确定merge的值为0或1*/
end_off = start_off + size;
/*重新确定last_end_off和hint_idx*/
bdata->last_end_off = end_off;
bdata->hint_idx = PFN_UP(end_off);
/*
* Reserve the area now:
*/
if (__reserve(bdata, PFN_DOWN(start_off) + merge, /*保留相关的区域*/
PFN_UP(end_off), BOOTMEM_EXCLUSIVE))
BUG();
region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + /*得到起始地址的虚拟地址*/
start_off);
memset(region, 0, size);/*将申请到的区域清空*/
/*
* The min_count is set to 0 so that bootmem allocated blocks
* are never reported as leaks.
*/
kmemleak_alloc(region, size, 0, 0);
return region;
}
if (fallback) {
sidx = align_idx(bdata, fallback - 1, step);
fallback = 0;
goto find_block;
}
return NULL;
}
4.bootmem allocator的销毁
bootmem allocator销毁后,其空闲的内存将交由buddy system接管,核心函数为free_all_bootmem_core()
static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
{
int aligned;
struct page *page;
unsigned long start, end, pages, count = 0;
if (!bdata->node_bootmem_map)/*bitmap不存在,表示该节点已经释放*/
return 0;
/*获得低端内存的起始页框和终止页框*/
start = bdata->node_min_pfn;
end = bdata->node_low_pfn;
/*
* If the start is aligned to the machines wordsize, we might
* be able to free pages in bulks of that order.
*/
aligned = !(start & (BITS_PER_LONG - 1));/*得到start是否为2的指数幂*/
bdebug("nid=%td start=%lx end=%lx aligned=%d\n",
bdata - bootmem_node_data, start, end, aligned);
/*************************************
* 第一步:释放空闲页 *
*************************************/
while (start < end) {
unsigned long *map, idx, vec;
map = bdata->node_bootmem_map;
idx = start - bdata->node_min_pfn;
vec = ~map[idx / BITS_PER_LONG];/*将idx所处的long字段的位图部分进行取反*/
/*如果:1.起始地址是2的整数幂
2.该long字段的位图全为0,即空闲状态
3.start+BITS_PER_LONG未超过范围*/
if (aligned && vec == ~0UL && start + BITS_PER_LONG < end) {
int order = ilog2(BITS_PER_LONG);/*得到Long的长度为2的多少次幂*/
__free_pages_bootmem(pfn_to_page(start), order);/*直接将整块内存释放*/
count += BITS_PER_LONG;
} else {/*否则只能逐页释放*/
unsigned long off = 0;
while (vec && off < BITS_PER_LONG) {/*判断该字段内的空闲页是否已经释放完*/
if (vec & 1) { /*vec的最低位为1,也就是说start+off对应的page为空闲*/
page = pfn_to_page(start + off);
__free_pages_bootmem(page, 0);
count++;
}
vec >>= 1;
off++;
}
}
start += BITS_PER_LONG;
}
/*****************************
* 第二步:释放保存bitmap的页 *
******************************/
page = virt_to_page(bdata->node_bootmem_map);/*得到bitmap起始地址的所属页*/
pages = bdata->node_low_pfn - bdata->node_min_pfn;
pages = bootmem_bootmap_pages(pages);/*得到bitmap的大小,以页为单位*/
count += pages;
while (pages--)/*逐页释放*/
__free_pages_bootmem(page++, 0);
bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
return count;/*返回释放的页框数*/
}