释放页面的核心函数是free_pages(), 最终还是调用__free_pages()函数。
__free_pages()首先判断所需释放的内存是单页还是较大页的内存块?如果释放单页,则不还给伙伴系统,而是置于per-CPU缓存中,对很可能出现在CPU高速缓存的页,则放置到热页的列表中。处于该目的,
如果free_hot_cold_page判断per-CPU缓存中页的数目超出了pcp->count, 则将数量为pcp->batch的一批内存页还给伙伴系统。该策略称之为惰性合并(lazy coalescing).如果单页直接返回给伙伴系统,那么会发生合并,而为了满足后来的分配请求又需要进行拆分。因为惰性合并策略阻止了大量可能白费时间的合并操作。free_page_bulk用于将页还给伙伴系统。
如果不超出惰性合并的限制,则页只是保存在per-CPU缓存中。但重要的是将private成员设置为页的迁移类型。根据前文所述,这使得可以从per-CPU缓存分配单页,并选择正确的迁移类型。
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
if (order == 0)
free_hot_cold_page(page, false);
else
__free_pages_ok(page, order);
}
如果释放多个页,那么__free_pages()将工作委托给__free_pages_ok(),最后到__free_one_page()。该函数不仅处理单页的释放,也处理复合页释放。该函数不仅可以释放内存页面到伙伴系统,还会处理空闲页面的合并工作。
释放内存页面的核心功能是把页面添加到伙伴系统中适当的free_area链表中。在释放内存块时,会查询相邻的内存块是否空闲,如果也空闲,那么就会合并成一个大的内存块,放置到高一阶的空闲链表free_area中。如果还能继续合并邻近的内存块,那么就会继续合并,转移到更高阶的空闲链表中,这个过程会一直重复下去,直到所有可能合并的内存块都已经合并。
首先来看两个辅助函数:
*
* Locate the struct page for both the matching buddy in our
* pair (buddy1) and the combined O(n+1) page they form (page).
*
* 1) Any buddy B1 will have an order O twin B2 which satisfies
* the following equation:
* B2 = B1 ^ (1 << O)
* For example, if the starting buddy (buddy2) is #8 its order
* 1 buddy is #10:
* B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
*
* 2) Any buddy B will have an order O+1 parent P which
* satisfies the following equation:
* P = B & ~(1 << O)
*
* Assumption: *_mem_map is contiguous at least up to MAX_ORDER
计算伙伴的index,不晓得咋个理解,反正可以计算伙伴的index
*/
static inline unsigned long
__find_buddy_index(unsigned long page_idx, unsigned int order)
{
return page_idx ^ (1 << order);
}
*
* This function checks whether a page is free && is the buddy
* we can do coalesce a page and its buddy if
* (a) the buddy is not in a hole &&
* (b) the buddy is in the buddy system &&
* (c) a page and its buddy have the same order &&
* (d) a page and its buddy are in the same zone.
*
* For recording whether a page is in the buddy system, we set ->_mapcount
* PAGE_BUDDY_MAPCOUNT_VALUE.
* Setting, clearing, and testing _mapcount PAGE_BUDDY_MAPCOUNT_VALUE is
* serialized by zone->lock.
*
* For recording page's order, we use page_private(page).
此功能检查页面是否空闲 并且是否为伙伴内存块
我们可以合并伙伴页面,如果
a) 伙伴不存在空洞
b) 伙伴页面在伙伴系统中
c) 页面和伙伴页面的order相同
d) 页面和伙伴页面位于同一个zone
必须同时满足以上四个要求
为了记录页面是否在伙伴系统中,我们设置 ->_mapcount = PAGE_BUDDY_MAPCOUNT_VALUE
设置/清除和测试 _mapcount PAGE_BUDDY_MAPCOUNT_VALUE 通过zone->lock进行序列化
为了记录页面的顺序,我们使用page_private(page)
*/
static inline int page_is_buddy(struct page *page, struct page *buddy,
unsigned int order)
{
if (!pfn_valid_within(page_to_pfn(buddy)))
return 0;
if (page_is_guard(buddy) && page_order(buddy) == order) {
if (page_zone_id(page) != page_zone_id(buddy))
return 0;
VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
return 1;
}
if (PageBuddy(buddy) && page_order(buddy) == order) {
/*
* zone check is done late to avoid uselessly
* calculating zone/node ids for pages that could
* never merge.
*/
if (page_zone_id(page) != page_zone_id(buddy))
return 0;
VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
return 1;
}
return 0;
}
/*
* Freeing function for a buddy system allocator.
*
* The concept of a buddy system is to maintain direct-mapped table
* (containing bit values) for memory blocks of various "orders".
* The bottom level table contains the map for the smallest allocatable
* units of memory (here, pages), and each level above it describes
* pairs of units from the levels below, hence, "buddies".
* At a high level, all that happens here is marking the table entry
* at the bottom level available, and propagating the changes upward
* as necessary, plus some accounting needed to play nicely with other
* parts of the VM system.
* At each level, we keep a list of pages, which are heads of continuous
* free pages of length of (1 << order) and marked with _mapcount
* PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
* field.
* So when we are allocating or freeing one, we can derive the state of the
* other. That is, if we allocate a small block, and both were
* free, the remainder of the region must be split into blocks.
* If a block is freed, and its buddy is also free, then this
* triggers coalescing into a block of larger size.
*
* -- nyc
*/
static inline void __free_one_page(struct page *page,
unsigned long pfn,
struct zone *zone, unsigned int order,
int migratetype)
{
unsigned long page_idx;
unsigned long combined_idx;
unsigned long uninitialized_var(buddy_idx);
struct page *buddy;
int max_order = MAX_ORDER;
VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1);
if (is_migrate_isolate(migratetype)) {
/*
* We restrict max order of merging to prevent merge
* between freepages on isolate pageblock and normal
* pageblock. Without this, pageblock isolation
* could cause incorrect freepage accounting.
*/
max_order = min(MAX_ORDER, pageblock_order + 1);
} else {
__mod_zone_freepage_state(zone, 1 << order, migratetype);
}
page_idx = pfn & ((1 << max_order) - 1);
VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
while (order < max_order - 1) {
buddy_idx = __find_buddy_index(page_idx, order);
buddy = page + (buddy_idx - page_idx);
if (!page_is_buddy(page, buddy, order))
break;
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
if (page_is_guard(buddy)) {
clear_page_guard(zone, buddy, order, migratetype);
} else {
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
rmv_page_order(buddy);
}
combined_idx = buddy_idx & page_idx;
page = page + (combined_idx - page_idx);
page_idx = combined_idx;
order++;
}
set_page_order(page, order);
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
* that pages are being freed that will coalesce soon. In case,
* that is happening, add the free page to the tail of the list
* so it's less likely to be used soon and more likely to be merged
* as a higher order page
*/
if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
struct page *higher_page, *higher_buddy;
combined_idx = buddy_idx & page_idx;
higher_page = page + (combined_idx - page_idx);
buddy_idx = __find_buddy_index(combined_idx, order + 1);
higher_buddy = higher_page + (buddy_idx - combined_idx);
if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
list_add_tail(&page->lru,
&zone->free_area[order].free_list[migratetype]);
goto out;
}
}
list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
out:
zone->free_area[order].nr_free++;
上诉代码是合并相邻伙伴块的核心代码,我们以一个实际例子来说明这段代码的逻辑,假设现在要释放一个内存块A,大小为2个page,内存块的page的开始页帧号是0x8e010,order为1,如图:
-
首先计算得出page_idx等于0x10, 也就是说,这块内存位于pageblock的0x10的位置。
-
在第一次while循环中,计算buddy_idx, 通过__find_buddy_index()函数可以计算出结果为0x12.
-
那么buddy就是内存块A的临近内存块B了,内存块B在pageblock的起始地址为0x12
-
接下来通过page_is_buddy()函数来检查内存块B是不是空闲的内存块。内存块在buddy中并且order也相同,该函数返回1.
-
如果发现内存块B也是空闲内存,并且order也等于1,那么我们找到了一块志同道合的空闲伙伴块,把它从空闲链表中摘取下来,以便和内存块A合并到高一阶的空闲链表中。
-
这时combined_idx指向内存块A的起始地址。order++表示继续在附近寻找有没有可能合并的相邻的内存块,这次要查找order等于2, 也就是4个page大小的内存块。
-
重复步骤2,查找附近有没有志同道合的order为2的内存块
-
如何在0x14位置的内存块C不满足合并条件,例如内存块C不是空闲页面,或者内存块C的order不等于2. 如上图,内存块C的order等于3,显然不符合我们的条件。如果没找到order为2的内存块,那么只能合并内存A和B了,然后把这个内存块添加到空闲页表中。
list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
__free_pages()对于order等于0的情况,作为特殊情况来处理,zone中有一个变量zone->pageset为每个CPU初始化一个percpu变量struct per_cpu_pageset。当释放order等于0的页面时,首先页面释放到per_cpu_page->list对应的链表中。
__free_pages()->free_hot_cold_page()
/*
* Free a 0-order page
* cold == true ? free a cold page : free a hot page
*/
void free_hot_cold_page(struct page *page, bool cold)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
int migratetype;
if (!free_pages_prepare(page, 0))
return;
migratetype = get_pfnblock_migratetype(page, pfn);
set_freepage_migratetype(page, migratetype);
local_irq_save(flags);
__count_vm_event(PGFREE);
/*
* We only track unmovable, reclaimable and movable on pcp lists.
* Free ISOLATE pages back to the allocator because they are being
* offlined but treat RESERVE as movable pages so we can get those
* areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
*/
if (migratetype >= MIGRATE_PCPTYPES) {
if (unlikely(is_migrate_isolate(migratetype))) {
free_one_page(zone, page, pfn, 0, migratetype);
goto out;
}
migratetype = MIGRATE_MOVABLE;
}
pcp = &this_cpu_ptr(zone->pageset)->pcp;
if (!cold)
list_add(&page->lru, &pcp->lists[migratetype]);
else
list_add_tail(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
if (pcp->count >= pcp->high) {
unsigned long batch = ACCESS_ONCE(pcp->batch);
free_pcppages_bulk(zone, batch, pcp);
pcp->count -= batch;
}
out:
local_irq_restore(flags);
}
per_cpu_pageset和per_cpu_pages数据结构定义如下:
struct per_cpu_pageset {
struct per_cpu_pages pcp;
#ifdef CONFIG_NUMA
s8 expire;
#endif
#ifdef CONFIG_SMP
s8 stat_threshold;
s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
};
struct per_cpu_pages {
int count; /* number of pages in the list 表示当前zone中的per_cpu_pages的页面*/
int high; /* high watermark, emptying needed 表示当前缓存的页面高于这个水位时,会回收页面到伙伴系统*/
int batch; /* chunk size for buddy add/remove 表示一次回收页面到系统的页面数量, batch的翻译是批量的意思*/
/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};
batch的值是通过zone_batchsize()计算出来的。在ARM Vexpress平台上,batch等于31,high等于186。
setup_zone_pageset()->zone_pageset_init()->pageset_set_high_and_batch()
static int zone_batchsize(struct zone *zone)
{
#ifdef CONFIG_MMU
int batch;
/*
* The per-cpu-pages pools are set to around 1000th of the
* size of the zone. But no more than 1/2 of a meg.
*
* OK, so we don't know how big the cache is. So guess.
*/
batch = zone->managed_pages / 1024;
if (batch * PAGE_SIZE > 512 * 1024)
batch = (512 * 1024) / PAGE_SIZE;
batch /= 4; /* We effectively *= 4 below */
if (batch < 1)
batch = 1;
/*
* Clamp the batch to a 2^n - 1 value. Having a power
* of 2 value was found to be more likely to have
* suboptimal cache aliasing properties in some cases.
*
* For example if 2 tasks are alternately allocating
* batches of pages, one task can end up with a lot
* of pages of one half of the possible page colors
* and the other with pages of the other colors.
*/
batch = rounddown_pow_of_two(batch + batch/2) - 1;
return batch;
#else
/* The deferral and batching of frees should be suppressed under NOMMU
* conditions.
*
* The problem is that NOMMU needs to be able to allocate large chunks
* of contiguous memory as there's no hardware page translation to
* assemble apparent contiguous memory from discontiguous pages.
*
* Queueing large contiguous runs of pages for batching, however,
* causes the pages to actually be freed in smaller chunks. As there
* can be a significant delay between the individual batches being
* recycled, this leads to the once large chunks of space being
* fragmented and becoming unavailable for high-order allocations.
*/
return 0;
#endif
}
回到free_hot_cold_page()函数中,当count大于等于high时,会调用free_pcppages_bulk()函数把per_cpu_pages的页面添加到伙伴系统中。
__free_pages()->free_hot_cold_page()->free_pcppages_bulk()->__free_one_page()
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
*
* If the zone was previously in an "all pages pinned" state then look to
* see if this freeing clears that state.
*
* And clear the zone's pages_scanned counter, to hold off the "all pages are
* pinned" detection logic.
*/
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
int migratetype = 0;
int batch_free = 0;
int to_free = count;
unsigned long nr_scanned;
spin_lock(&zone->lock);
nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
if (nr_scanned)
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
while (to_free) {
struct page *page;
struct list_head *list;
/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
* empty list is encountered. This is so more pages are freed
* off fuller lists instead of spinning excessively around empty
* lists
*/
do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
} while (list_empty(list));
/* This is the only non-empty list. Free them all. */
if (batch_free == MIGRATE_PCPTYPES)
batch_free = to_free;
do {
int mt; /* migratetype of the to-be-freed page */
page = list_entry(list->prev, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
mt = get_freepage_migratetype(page);
if (unlikely(has_isolate_pageblock(zone)))
mt = get_pageblock_migratetype(page);
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
} while (--to_free && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
}
最终还是调用__free_one_page()函数来释放页面添加到伙伴系统中。
小结:
页面分配器是linux内核内存管理中最基本的分配器。基于伙伴系统算法和zone-base的设计理念,要理解页面分配器需要关注如下几个方面:
-
理解伙伴系统的基本原理
-
从分配掩码中知道可以从哪些zone中分配内存,分配内存的属性是属于哪些MIGRATE_TYPES类型。
-
页面分配时从哪个方向来扫描zone。
-
zone水位的判断。