4.1 分配物理页面（释放页面）linux4.0

最新推荐文章于 2022-08-26 22:39:27 发布

byd yes

最新推荐文章于 2022-08-26 22:39:27 发布

阅读量530

点赞数

分类专栏： linux.mm

本文链接：https://blog.csdn.net/dai_xiangjun/article/details/118788440

版权

linux.mm 专栏收录该内容

59 篇文章 25 订阅

订阅专栏

释放页面的核心函数是free_pages()，最终还是调用__free_pages()函数。

__free_pages()首先判断所需释放的内存是单页还是较大页的内存块？如果释放单页，则不还给伙伴系统，而是置于per-CPU缓存中，对很可能出现在CPU高速缓存的页，则放置到热页的列表中。处于该目的，

如果free_hot_cold_page判断per-CPU缓存中页的数目超出了pcp->count, 则将数量为pcp->batch的一批内存页还给伙伴系统。该策略称之为惰性合并(lazy coalescing).如果单页直接返回给伙伴系统，那么会发生合并，而为了满足后来的分配请求又需要进行拆分。因为惰性合并策略阻止了大量可能白费时间的合并操作。free_page_bulk用于将页还给伙伴系统。

如果不超出惰性合并的限制，则页只是保存在per-CPU缓存中。但重要的是将private成员设置为页的迁移类型。根据前文所述，这使得可以从per-CPU缓存分配单页，并选择正确的迁移类型。

void __free_pages(struct page *page, unsigned int order)
{
    if (put_page_testzero(page)) {
        if (order == 0)
            free_hot_cold_page(page, false);
        else
            __free_pages_ok(page, order);
    }

如果释放多个页，那么__free_pages()将工作委托给__free_pages_ok()，最后到__free_one_page()。该函数不仅处理单页的释放，也处理复合页释放。该函数不仅可以释放内存页面到伙伴系统，还会处理空闲页面的合并工作。

释放内存页面的核心功能是把页面添加到伙伴系统中适当的free_area链表中。在释放内存块时，会查询相邻的内存块是否空闲，如果也空闲，那么就会合并成一个大的内存块，放置到高一阶的空闲链表free_area中。如果还能继续合并邻近的内存块，那么就会继续合并，转移到更高阶的空闲链表中，这个过程会一直重复下去，直到所有可能合并的内存块都已经合并。

首先来看两个辅助函数：

*
 * Locate the struct page for both the matching buddy in our
 * pair (buddy1) and the combined O(n+1) page they form (page).
 *
 * 1) Any buddy B1 will have an order O twin B2 which satisfies
 * the following equation:
 *     B2 = B1 ^ (1 << O)
 * For example, if the starting buddy (buddy2) is #8 its order
 * 1 buddy is #10:
 *     B2 = 8 ^ (1 << 1) = 8 ^ 2 = 10
 *
 * 2) Any buddy B will have an order O+1 parent P which
 * satisfies the following equation:
 *     P = B & ~(1 << O)
 *
 * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
计算伙伴的index，不晓得咋个理解，反正可以计算伙伴的index
 */
static inline unsigned long
__find_buddy_index(unsigned long page_idx, unsigned int order)
{
    return page_idx ^ (1 << order);
}

*
 * This function checks whether a page is free && is the buddy
 * we can do coalesce a page and its buddy if
 * (a) the buddy is not in a hole &&
 * (b) the buddy is in the buddy system &&
 * (c) a page and its buddy have the same order &&
 * (d) a page and its buddy are in the same zone.
 *
 * For recording whether a page is in the buddy system, we set ->_mapcount
 * PAGE_BUDDY_MAPCOUNT_VALUE.
 * Setting, clearing, and testing _mapcount PAGE_BUDDY_MAPCOUNT_VALUE is
 * serialized by zone->lock.
 *
 * For recording page's order, we use page_private(page).
    此功能检查页面是否空闲 并且是否为伙伴内存块
我们可以合并伙伴页面，如果
a) 伙伴不存在空洞
b) 伙伴页面在伙伴系统中
c) 页面和伙伴页面的order相同
d) 页面和伙伴页面位于同一个zone
必须同时满足以上四个要求
为了记录页面是否在伙伴系统中，我们设置 ->_mapcount = PAGE_BUDDY_MAPCOUNT_VALUE
设置/清除和测试 _mapcount PAGE_BUDDY_MAPCOUNT_VALUE 通过zone->lock进行序列化
为了记录页面的顺序，我们使用page_private（page）
 */
static inline int page_is_buddy(struct page *page, struct page *buddy,
                            unsigned int order)
{
    if (!pfn_valid_within(page_to_pfn(buddy)))
        return 0;

    if (page_is_guard(buddy) && page_order(buddy) == order) {
        if (page_zone_id(page) != page_zone_id(buddy))
            return 0;

        VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);

        return 1;
    }

    if (PageBuddy(buddy) && page_order(buddy) == order) {
        /*
         * zone check is done late to avoid uselessly
         * calculating zone/node ids for pages that could
         * never merge.
         */
        if (page_zone_id(page) != page_zone_id(buddy))
            return 0;

        VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);

        return 1;
    }
    return 0;
}


/*
 * Freeing function for a buddy system allocator.
 *
 * The concept of a buddy system is to maintain direct-mapped table
 * (containing bit values) for memory blocks of various "orders".
 * The bottom level table contains the map for the smallest allocatable
 * units of memory (here, pages), and each level above it describes
 * pairs of units from the levels below, hence, "buddies".
 * At a high level, all that happens here is marking the table entry
 * at the bottom level available, and propagating the changes upward
 * as necessary, plus some accounting needed to play nicely with other
 * parts of the VM system.
 * At each level, we keep a list of pages, which are heads of continuous
 * free pages of length of (1 << order) and marked with _mapcount
 * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
 * field.
 * So when we are allocating or freeing one, we can derive the state of the
 * other.  That is, if we allocate a small block, and both were
 * free, the remainder of the region must be split into blocks.
 * If a block is freed, and its buddy is also free, then this
 * triggers coalescing into a block of larger size.
 *
 * -- nyc
 */

static inline void __free_one_page(struct page *page,
        unsigned long pfn,
        struct zone *zone, unsigned int order,
        int migratetype)
{
    unsigned long page_idx;
    unsigned long combined_idx;
    unsigned long uninitialized_var(buddy_idx);
    struct page *buddy;
    int max_order = MAX_ORDER;

    VM_BUG_ON(!zone_is_initialized(zone));
    VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

    VM_BUG_ON(migratetype == -1);
    if (is_migrate_isolate(migratetype)) {
        /*
         * We restrict max order of merging to prevent merge
         * between freepages on isolate pageblock and normal
         * pageblock. Without this, pageblock isolation
         * could cause incorrect freepage accounting.
         */
        max_order = min(MAX_ORDER, pageblock_order + 1);
    } else {
        __mod_zone_freepage_state(zone, 1 << order, migratetype);
    }

    page_idx = pfn & ((1 << max_order) - 1);

    VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
    VM_BUG_ON_PAGE(bad_range(zone, page), page);

    while (order < max_order - 1) {
        buddy_idx = __find_buddy_index(page_idx, order);
        buddy = page + (buddy_idx - page_idx);
        if (!page_is_buddy(page, buddy, order))
            break;
        /*
         * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
         * merge with it and move up one order.
         */
        if (page_is_guard(buddy)) {
            clear_page_guard(zone, buddy, order, migratetype);
        } else {
            list_del(&buddy->lru);
            zone->free_area[order].nr_free--;
            rmv_page_order(buddy);
        }
        combined_idx = buddy_idx & page_idx;
        page = page + (combined_idx - page_idx);
        page_idx = combined_idx;
        order++;
    }
    set_page_order(page, order);

    /*
     * If this is not the largest possible page, check if the buddy
     * of the next-highest order is free. If it is, it's possible
     * that pages are being freed that will coalesce soon. In case,
     * that is happening, add the free page to the tail of the list
     * so it's less likely to be used soon and more likely to be merged
     * as a higher order page
     */
    if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
        struct page *higher_page, *higher_buddy;
        combined_idx = buddy_idx & page_idx;
        higher_page = page + (combined_idx - page_idx);
        buddy_idx = __find_buddy_index(combined_idx, order + 1);
        higher_buddy = higher_page + (buddy_idx - combined_idx);
        if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
            list_add_tail(&page->lru,
                &zone->free_area[order].free_list[migratetype]);
            goto out;
        }
    }

    list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
out:
    zone->free_area[order].nr_free++;

上诉代码是合并相邻伙伴块的核心代码，我们以一个实际例子来说明这段代码的逻辑，假设现在要释放一个内存块A，大小为2个page，内存块的page的开始页帧号是0x8e010,order为1，如图：

首先计算得出page_idx等于0x10, 也就是说，这块内存位于pageblock的0x10的位置。
在第一次while循环中，计算buddy_idx，通过__find_buddy_index()函数可以计算出结果为0x12.
那么buddy就是内存块A的临近内存块B了，内存块B在pageblock的起始地址为0x12
接下来通过page_is_buddy()函数来检查内存块B是不是空闲的内存块。内存块在buddy中并且order也相同，该函数返回1.
如果发现内存块B也是空闲内存，并且order也等于1，那么我们找到了一块志同道合的空闲伙伴块，把它从空闲链表中摘取下来，以便和内存块A合并到高一阶的空闲链表中。
这时combined_idx指向内存块A的起始地址。order++表示继续在附近寻找有没有可能合并的相邻的内存块，这次要查找order等于2，也就是4个page大小的内存块。
重复步骤2，查找附近有没有志同道合的order为2的内存块
如何在0x14位置的内存块C不满足合并条件，例如内存块C不是空闲页面，或者内存块C的order不等于2. 如上图，内存块C的order等于3，显然不符合我们的条件。如果没找到order为2的内存块，那么只能合并内存A和B了，然后把这个内存块添加到空闲页表中。

list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);

__free_pages()对于order等于0的情况，作为特殊情况来处理，zone中有一个变量zone->pageset为每个CPU初始化一个percpu变量struct per_cpu_pageset。当释放order等于0的页面时，首先页面释放到per_cpu_page->list对应的链表中。

__free_pages()->free_hot_cold_page()

/*
 * Free a 0-order page
 * cold == true ? free a cold page : free a hot page
 */
void free_hot_cold_page(struct page *page, bool cold)
{
    struct zone *zone = page_zone(page);
    struct per_cpu_pages *pcp;
    unsigned long flags;
    unsigned long pfn = page_to_pfn(page);
    int migratetype;

    if (!free_pages_prepare(page, 0))
        return;

    migratetype = get_pfnblock_migratetype(page, pfn);
    set_freepage_migratetype(page, migratetype);
    local_irq_save(flags);
    __count_vm_event(PGFREE);

    /*
     * We only track unmovable, reclaimable and movable on pcp lists.
     * Free ISOLATE pages back to the allocator because they are being
     * offlined but treat RESERVE as movable pages so we can get those
     * areas back if necessary. Otherwise, we may have to free
     * excessively into the page allocator
     */
    if (migratetype >= MIGRATE_PCPTYPES) {
        if (unlikely(is_migrate_isolate(migratetype))) {
            free_one_page(zone, page, pfn, 0, migratetype);
            goto out;
        }
        migratetype = MIGRATE_MOVABLE;
    }

    pcp = &this_cpu_ptr(zone->pageset)->pcp;
    if (!cold)
        list_add(&page->lru, &pcp->lists[migratetype]);
    else
        list_add_tail(&page->lru, &pcp->lists[migratetype]);
    pcp->count++;
    if (pcp->count >= pcp->high) {
        unsigned long batch = ACCESS_ONCE(pcp->batch);
        free_pcppages_bulk(zone, batch, pcp);
        pcp->count -= batch;
    }

out:
    local_irq_restore(flags);
}

per_cpu_pageset和per_cpu_pages数据结构定义如下：

struct per_cpu_pageset {
    struct per_cpu_pages pcp;
#ifdef CONFIG_NUMA
    s8 expire;
#endif
#ifdef CONFIG_SMP
    s8 stat_threshold;
    s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
};

struct per_cpu_pages {
    int count;      /* number of pages in the list 表示当前zone中的per_cpu_pages的页面*/
    int high;       /* high watermark, emptying needed 表示当前缓存的页面高于这个水位时，会回收页面到伙伴系统*/
    int batch;      /* chunk size for buddy add/remove 表示一次回收页面到系统的页面数量， batch的翻译是批量的意思*/

    /* Lists of pages, one per migrate type stored on the pcp-lists */
    struct list_head lists[MIGRATE_PCPTYPES];
};

batch的值是通过zone_batchsize()计算出来的。在ARM Vexpress平台上，batch等于31，high等于186。

setup_zone_pageset()->zone_pageset_init()->pageset_set_high_and_batch()

static int zone_batchsize(struct zone *zone)
{
#ifdef CONFIG_MMU
    int batch;

    /*
     * The per-cpu-pages pools are set to around 1000th of the
     * size of the zone.  But no more than 1/2 of a meg.
     *
     * OK, so we don't know how big the cache is.  So guess.
     */
    batch = zone->managed_pages / 1024;
    if (batch * PAGE_SIZE > 512 * 1024)
        batch = (512 * 1024) / PAGE_SIZE;
    batch /= 4;     /* We effectively *= 4 below */
    if (batch < 1)
        batch = 1;

    /*
     * Clamp the batch to a 2^n - 1 value. Having a power
     * of 2 value was found to be more likely to have
     * suboptimal cache aliasing properties in some cases.
     *
     * For example if 2 tasks are alternately allocating
     * batches of pages, one task can end up with a lot
     * of pages of one half of the possible page colors
     * and the other with pages of the other colors.
     */
    batch = rounddown_pow_of_two(batch + batch/2) - 1;

    return batch;

#else
    /* The deferral and batching of frees should be suppressed under NOMMU
     * conditions.
     *
     * The problem is that NOMMU needs to be able to allocate large chunks
     * of contiguous memory as there's no hardware page translation to
     * assemble apparent contiguous memory from discontiguous pages.
     *
     * Queueing large contiguous runs of pages for batching, however,
     * causes the pages to actually be freed in smaller chunks.  As there
     * can be a significant delay between the individual batches being
     * recycled, this leads to the once large chunks of space being
     * fragmented and becoming unavailable for high-order allocations.
     */
    return 0;
#endif
}

回到free_hot_cold_page()函数中，当count大于等于high时，会调用free_pcppages_bulk()函数把per_cpu_pages的页面添加到伙伴系统中。

__free_pages()->free_hot_cold_page()->free_pcppages_bulk()->__free_one_page()

/*
 * Frees a number of pages from the PCP lists
 * Assumes all pages on list are in same zone, and of same order.
 * count is the number of pages to free.
 *
 * If the zone was previously in an "all pages pinned" state then look to
 * see if this freeing clears that state.
 *
 * And clear the zone's pages_scanned counter, to hold off the "all pages are
 * pinned" detection logic.
 */
static void free_pcppages_bulk(struct zone *zone, int count,
                    struct per_cpu_pages *pcp)
{
    int migratetype = 0;
    int batch_free = 0;
    int to_free = count;
    unsigned long nr_scanned;

    spin_lock(&zone->lock);
    nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
    if (nr_scanned)
        __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);

    while (to_free) {
        struct page *page;
        struct list_head *list;

        /*
         * Remove pages from lists in a round-robin fashion. A
         * batch_free count is maintained that is incremented when an
         * empty list is encountered.  This is so more pages are freed
         * off fuller lists instead of spinning excessively around empty
         * lists
         */
        do {
            batch_free++;
            if (++migratetype == MIGRATE_PCPTYPES)
                migratetype = 0;
            list = &pcp->lists[migratetype];
        } while (list_empty(list));

        /* This is the only non-empty list. Free them all. */
        if (batch_free == MIGRATE_PCPTYPES)
            batch_free = to_free;

        do {
            int mt; /* migratetype of the to-be-freed page */

            page = list_entry(list->prev, struct page, lru);
            /* must delete as __free_one_page list manipulates */
            list_del(&page->lru);
            mt = get_freepage_migratetype(page);
            if (unlikely(has_isolate_pageblock(zone)))
                mt = get_pageblock_migratetype(page);

            /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
            __free_one_page(page, page_to_pfn(page), zone, 0, mt);
            trace_mm_page_pcpu_drain(page, 0, mt);
        } while (--to_free && --batch_free && !list_empty(list));
    }
    spin_unlock(&zone->lock);
}

最终还是调用__free_one_page()函数来释放页面添加到伙伴系统中。

小结：

页面分配器是linux内核内存管理中最基本的分配器。基于伙伴系统算法和zone-base的设计理念，要理解页面分配器需要关注如下几个方面：

理解伙伴系统的基本原理
从分配掩码中知道可以从哪些zone中分配内存，分配内存的属性是属于哪些MIGRATE_TYPES类型。
页面分配时从哪个方向来扫描zone。
zone水位的判断。

byd yes

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
4.1 分配物理页面（释放页面）linux4.0

释放页面的核心函数是free_pages()，最终还是调用__free_pages()函数。 __free_pages()首先判断所需释放的内存是单页还是较大页的内存块？如果释放单页，则不还给伙伴系统，而是置于per-CPU缓存中，对很可能出现在CPU高速缓存的页，则放置到热页的列表中。处于该目的，如果free_hot_cold_page判断per-CPU缓存中页的数目超出了pcp->count,则将数量为pcp->batch的一批内存页还给伙伴系统。该策略称之为惰性合...
复制链接

扫一扫