14.7 跟踪LRU活动情况和Refault Distance算法

跟踪LRU活动情况:

    如果在LRU链表中,页面被其他的进程释放了,那么LRU链表如何知道页面已经被释放了?

    LRU只是一个双向链表,如何保护链表中的成员不被其他内核路径释放时在设计页面回收功能需要考虑的并发问题。在这个过程中,struct page数据结构中的_count引用计数起到重要的作用。

    以shrink_active_list()中分离页面到临时链表l_hold为例:

shrink_active_list()
    ->isolate_lru_pages()
        ->page = lru_to_page() //从LRU链表中摘取一个页面
        ->get_page_unless_zero(page) //对page->_count引用计数加1
        ->ClearPageLRU(page) //清楚PG_LRU标志位

    这里从LRU摘取一个页面时,对该页的page->_count 引用计数加1。

    把分离好的页面放回LRU链表的情况如下。

shrink_active_list()
    ->move_active_pages_to_lru()
        ->list_move(&page->lru, &lruvec->lists[lru]); //把该页面添加会LRU链表
        ->put_page_testzero(page)
这里对page->_count计数减1,如果减1等于0,说明这个page已经被其他进程释放,清楚PG_LRU并从LRU链表删除该页。
Refault Distance算法:
    在学术界和linux内核社区,页面回收算法的优化一直没有停止过,其中Refault Distance算法在Linux 3.15版本中被加入,作为是社区专家johannes Weiner,该算法目前只针对page cache类型的页面。对于page cache类型的LRU链表来说,有两个链表值得关注,分别是活跃链表和不活跃链表。新产生的page总是加入到不活跃链表的头部,页面回收也总是从不活跃链表的尾部开始回收。不活跃链表的页面第二次访问时会升级(promote)到活跃链表,防止被回收;另一方面如果活跃链表增长太快,那么活跃的页面会被降级(demote)到不活跃链表中。
实际上有一些场景,某些页面经常被访问,但是它们在下一次被访问之前就在不活跃链表中被回收并释放了,那么又必须从存储系统中读取这些page cache页面,这些场景下产生颠簸现象(thrashing)。
    当我们观察文件缓存不活跃链表的行为特征时,会发现如下有趣特征。
  • 有一个page cache页面第一次访问时,它加入到不活跃链表头,然后慢慢从链表头向链表尾移动,链表尾的page cache 会被踢出LRU链表并且释放页面,这个过程叫做eviction(移出)。
  • 当第二次访问时,page cache被升级到活跃LRU链表,这样不活跃链表也空出一个位置,这个过程叫作activation(激活)。
  • 从宏观时间轴来看,eviction过程处理的页面数量与activation过程处理的页数量的和等于不活跃链表的长度NR_inactive。
  • 要从不活跃链表中释放一个页面,需要移动N个页面(N = 不活跃链表长度)。
    综合上面的一些行为特征,定义了Refault Distance的概念。第一次访问page cache称为fault,第二次访问该页面称为refault。page cache页面第一次被踢出LRU链表并回收(eviction)的时刻称为E,第二次再访问该页的时刻称为R,那么R-E的时间里需要移动的页面个数称为Refault Distance。
    把Refault Distance概念再加上第一次读的时刻,可以用一个公式来概括第一次和第二次读之间的距离(read_distance)。
                            read_distance = nr_inactive + (R-E)
    如果page想一直保持在LRU链表中,那么read_distance不应该比内存的大小还长,否则该page永远都会被踢出LRU链表,因此公式可以推导为:
                            NR_inactive+(R-E) <= NR_inactive+NR_active
                                (R-E) <= NR_active
    换句话说,Refault Distance 可以理解为不活跃链表的"财政赤字",如果不活跃链表常熟至少再延长Refault Distance,那么就可以保证该page cache在第二次读之前不会被踢出LRU链表并释放内存,否则就要把该page cache重新加入活跃链表加以保护,以防内存颠簸。在理想情况下,page cache的平均访问距离要大于不活跃链表,小于总的内存大小。
    上述内容讨论了两次读的距离小于等于内存大小的情况,即NR_inactive+(R-E)<= NR_inactive+NR_active,如果两次读的距离大于内存大小呢?这种特殊情况不是Refault Distance算法能解决的问题,因为它在第二次读时永远已经被踢出LRU链表,因为可以假设第二次读发生在遥远的未来,但谁都无法保证它在LRU链表中。其实Refault Distance算法是为了解决前者,在第二次读时,人为地把page cache添加到活跃链表从而防止该page cache被踢出LRU链表而带来的内存颠簸。
如上图,T0时刻表示一个page cache第一次访问,这时会调用 add_to_page_cache_lru()函数来分配一个shadow用来存储zone->inactive_age值,每当有页面被promote到活跃链表时,zone->inactive_age值会加1,每当有页面被踢出不活跃链表时,zone->inactive_age也会加1。T1时刻表示该页被踢出LRU链表并从LRU链表中回收释放,这时把当前T1时刻的zone->inactive_age的值编码存放到shadow中。T2时刻是该页第二次读,这时要计算Refault Distance,Refault Distance = T2 - T1, 如果Refault Distance <= NR_active,说明该page cache极有可能在下一次读时已经被踢出LRU链表,因此要人为地actived该页面并且加入活跃链表中。
    上面是Refault Distance算法的全部描述,下面来看代码实现。
(1) 在struct zone数据结构中新增一个inactive_age原子变量成员,用于记录文件缓存不活跃链表中的eviction操作和activation操作的计数。
struct zone {
    ......
    /* Evictions & activations on the inactive file list */
    atomic_long_t       inactive_age;
    ......
}

(2) page cache第一次加入不活跃链表时代码如下:

page cache第一次加入radix tree时会分配一个slot来存放inactive_age,这里使用shadow指向slot。因此第一次加入时shadow值为空,还没有Refault Distance,因此加入不活跃LRU链表。

int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
                pgoff_t offset, gfp_t gfp_mask)
{
    void *shadow = NULL;
    int ret;

    __set_page_locked(page);
    ret = __add_to_page_cache_locked(page, mapping, offset,
                     gfp_mask, &shadow);
    if (unlikely(ret))
        __clear_page_locked(page);
    else {
        /*
         * The page might have been evicted from cache only
         * recently, in which case it should be activated like
         * any other repeatedly accessed page.
         */
        if (shadow && workingset_refault(shadow)) {
            SetPageActive(page);
            workingset_activation(page);
        } else
            ClearPageActive(page);
        lru_cache_add(page);
    }
    return ret;
}

(3) 当在文件缓存不活跃链表里的页面被再一次读取时,会调用mark_page_accessed()函数。

/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced    ->  inactive,referenced
 * inactive,referenced      ->  active,unreferenced
 * active,unreferenced      ->  active,referenced
 *
 * When a newly allocated page is not yet visible, so safe for non-atomic ops,
 * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
 */
void mark_page_accessed(struct page *page)
{
    if (!PageActive(page) && !PageUnevictable(page) &&
            PageReferenced(page)) {

        /*
         * If the page is on the LRU, queue it for activation via
         * activate_page_pvecs. Otherwise, assume the page is on a
         * pagevec, mark it active and it'll be moved to the active
         * LRU on the next drain.
         */
        if (PageLRU(page))
            activate_page(page);
        else
            __lru_cache_activate_page(page);
        ClearPageReferenced(page);
        if (page_is_file_cache(page))
            workingset_activation(page);
    } else if (!PageReferenced(page)) {
        SetPageReferenced(page);
    }
}

第二次读时会调用workingset_activation()函数来增加zone->inactive_age计数。

/**
 * workingset_activation - note a page activation
 * @page: page that is being activated
 */
void workingset_activation(struct page *page)
{
    atomic_long_inc(&page_zone(page)->inactive_age);
}

(4)在不活跃链表末尾的页面会被踢出LRU链表并被释放。

/*
 * Same as remove_mapping, but if the page is removed from the mapping, it
 * gets returned with a refcount of 0.
 */
static int __remove_mapping(struct address_space *mapping, struct page *page,
                bool reclaimed)
{
    BUG_ON(!PageLocked(page));
    BUG_ON(mapping != page_mapping(page));

    spin_lock_irq(&mapping->tree_lock);
    /*
     * The non racy check for a busy page.
     *
     * Must be careful with the order of the tests. When someone has
     * a ref to the page, it may be possible that they dirty it then
     * drop the reference. So if PageDirty is tested before page_count
     * here, then the following race may occur:
     *
     * get_user_pages(&page);
     * [user mapping goes away]
     * write_to(page);
     *              !PageDirty(page)    [good]
     * SetPageDirty(page);
     * put_page(page);
     *              !page_count(page)   [good, discard it]
     *
     * [oops, our write_to data is lost]
     *
     * Reversing the order of the tests ensures such a situation cannot
     * escape unnoticed. The smp_rmb is needed to ensure the page->flags
     * load is not satisfied before that of page->_count.
     *
     * Note that if SetPageDirty is always performed via set_page_dirty,
     * and thus under tree_lock, then this ordering is not required.
     */
    if (!page_freeze_refs(page, 2))
        goto cannot_free;
    /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
    if (unlikely(PageDirty(page))) {
        page_unfreeze_refs(page, 2);
        goto cannot_free;
    }

    if (PageSwapCache(page)) {
        swp_entry_t swap = { .val = page_private(page) };
        mem_cgroup_swapout(page, swap);
        __delete_from_swap_cache(page);
        spin_unlock_irq(&mapping->tree_lock);
        swapcache_free(swap);
    } else {
        void (*freepage)(struct page *);
        void *shadow = NULL;

        freepage = mapping->a_ops->freepage;
        /*
         * Remember a shadow entry for reclaimed file cache in
         * order to detect refaults, thus thrashing, later on.
         *
         * But don't store shadows in an address space that is
         * already exiting.  This is not just an optizimation,
         * inode reclaim needs to empty out the radix tree or
         * the nodes are lost.  Don't plant shadows behind its
         * back.
         */
        if (reclaimed && page_is_file_cache(page) &&
            !mapping_exiting(mapping))
            shadow = workingset_eviction(mapping, page);
        __delete_from_page_cache(page, shadow);
        spin_unlock_irq(&mapping->tree_lock);

        if (freepage != NULL)
            freepage(page);
    }

    return 1;

cannot_free:
    spin_unlock_irq(&mapping->tree_lock);
    return 0;
}

    在被踢出LRU链表时,通过workingset_eviction()函数把当前的zone->inactive_age计数保存到该页对应的radix_tree的shadow中。

/**
 * workingset_eviction - note the eviction of a page from memory
 * @mapping: address space the page was backing
 * @page: the page being evicted
 *
 * Returns a shadow entry to be stored in @mapping->page_tree in place
 * of the evicted @page so that a later refault can be detected.
 */
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
    struct zone *zone = page_zone(page);
    unsigned long eviction;

    eviction = atomic_long_inc_return(&zone->inactive_age);
    return pack_shadow(eviction, zone);
}

/*
 *      Double CLOCK lists
 *
 * Per zone, two clock lists are maintained for file pages: the
 * inactive and the active list.  Freshly faulted pages start out at
 * the head of the inactive list and page reclaim scans pages from the
 * tail.  Pages that are accessed multiple times on the inactive list
 * are promoted to the active list, to protect them from reclaim,
 * whereas active pages are demoted to the inactive list when the
 * active list grows too big.
 *
 *   fault ------------------------+
 *                                 |
 *              +--------------+   |            +-------------+
 *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
 *              +--------------+                +-------------+    |
 *                     |                                           |
 *                     +-------------- promotion ------------------+
 *
 *
 *      Access frequency and refault distance
 *
 * A workload is thrashing when its pages are frequently used but they
 * are evicted from the inactive list every time before another access
 * would have promoted them to the active list.
 *
 * In cases where the average access distance between thrashing pages
 * is bigger than the size of memory there is nothing that can be
 * done - the thrashing set could never fit into memory under any
 * circumstance.
 *
 * However, the average access distance could be bigger than the
 * inactive list, yet smaller than the size of memory.  In this case,
 * the set could fit into memory if it weren't for the currently
 * active pages - which may be used more, hopefully less frequently:
 *
 *      +-memory available to cache-+
 *      |                           |
 *      +-inactive------+-active----+
 *  a b | c d e f g h i | J K L M N |
 *      +---------------+-----------+
 *
 * It is prohibitively expensive to accurately track access frequency
 * of pages.  But a reasonable approximation can be made to measure
 * thrashing on the inactive list, after which refaulting pages can be
 * activated optimistically to compete with the existing active pages.
 *
 * Approximating inactive page access frequency - Observations:
 *
 * 1. When a page is accessed for the first time, it is added to the
 *    head of the inactive list, slides every existing inactive page
 *    towards the tail by one slot, and pushes the current tail page
 *    out of memory.
 *
 * 2. When a page is accessed for the second time, it is promoted to
 *    the active list, shrinking the inactive list by one slot.  This
 *    also slides all inactive pages that were faulted into the cache
 *    more recently than the activated page towards the tail of the
 *    inactive list.
 *
 * Thus:
 *
 * 1. The sum of evictions and activations between any two points in
 *    time indicate the minimum number of inactive pages accessed in
 *    between.
 *
 * 2. Moving one inactive page N page slots towards the tail of the
 *    list requires at least N inactive page accesses.
 *
 * Combining these:
 *
 * 1. When a page is finally evicted from memory, the number of
 *    inactive pages accessed while the page was in cache is at least
 *    the number of page slots on the inactive list.
 *
 * 2. In addition, measuring the sum of evictions and activations (E)
 *    at the time of a page's eviction, and comparing it to another
 *    reading (R) at the time the page faults back into memory tells
 *    the minimum number of accesses while the page was not cached.
 *    This is called the refault distance.
 *
 * Because the first access of the page was the fault and the second
 * access the refault, we combine the in-cache distance with the
 * out-of-cache distance to get the complete minimum access distance
 * of this page:
 *
 *      NR_inactive + (R - E)
 *
 * And knowing the minimum access distance of a page, we can easily
 * tell if the page would be able to stay in cache assuming all page
 * slots in the cache were available:
 *
 *   NR_inactive + (R - E) <= NR_inactive + NR_active
 *
 * which can be further simplified to
 *
 *   (R - E) <= NR_active
 *
 * Put into words, the refault distance (out-of-cache) can be seen as
 * a deficit in inactive list space (in-cache).  If the inactive list
 * had (R - E) more page slots, the page would not have been evicted
 * in between accesses, but activated instead.  And on a full system,
 * the only thing eating into inactive list space is active pages.
 *
 *
 *      Activating refaulting pages
 *
 * All that is known about the active list is that the pages have been
 * accessed more than once in the past.  This means that at any given
 * time there is actually a good chance that pages on the active list
 * are no longer in active use.
 *
 * So when a refault distance of (R - E) is observed and there are at
 * least (R - E) active pages, the refaulting page is activated
 * optimistically in the hope that (R - E) active pages are actually
 * used less frequently than the refaulting page - or even not used at
 * all anymore.
 *
 * If this is wrong and demotion kicks in, the pages which are truly
 * used more frequently will be reactivated while the less frequently
 * used once will be evicted from memory.
 *
 * But if this is right, the stale pages will be pushed out of memory
 * and the used pages get to stay in cache.
 *
 *
 *      Implementation
 *
 * For each zone's file LRU lists, a counter for inactive evictions
 * and activations is maintained (zone->inactive_age).
 *
 * On eviction, a snapshot of this counter (along with some bits to
 * identify the zone) is stored in the now empty page cache radix tree
 * slot of the evicted page.  This is called a shadow entry.
 *
 * On cache misses for which there are shadow entries, an eligible
 * refault distance will immediately activate the refaulting page.
 */

static void *pack_shadow(unsigned long eviction, struct zone *zone)
{
    eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
    eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
    eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

    return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}

(5) 当page cache第二次读取时,还会调用add_to_page_cache_lru()函数。代码中的workingset_refault()会计算Refault Distance,并且判断是否需要把page cache加入到活跃链表中,以避免下一次读之前被踢出LRU链表。

/**
 * workingset_refault - evaluate the refault of a previously evicted page
 * @shadow: shadow entry of the evicted page
 *
 * Calculates and evaluates the refault distance of the previously
 * evicted page in the context of the zone it was allocated in.
 *
 * Returns %true if the page should be activated, %false otherwise.
 */
bool workingset_refault(void *shadow)
{
    unsigned long refault_distance;
    struct zone *zone;

    unpack_shadow(shadow, &zone, &refault_distance);
    inc_zone_state(zone, WORKINGSET_REFAULT);

    /*判断refault_distance是否小于活跃LRU链表的长度,如果是,则说明该页在下一次访问
    前极有可能会踢出LRU链表,因此返回true。在add_to_page_cache_lru()函数中调用
    SetPageActive(page)设置该页的PG_active标志位并加入到活跃LRU链表中,从而避免低
    三次访问时该页被踢出LRU链表锁产生的内存颠簸*/
    if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
        inc_zone_state(zone, WORKINGSET_ACTIVATE);
        return true;
    }
    return false;
}

    unpack_shadow()函数只把该page cache之前存放的shadow值重新解码,得出了图中T1时刻的inactive_age值,然后把当前的inactive_age减去T1,得到Refault Distance。

static void unpack_shadow(void *shadow,
              struct zone **zone,
              unsigned long *distance)
{
    unsigned long entry = (unsigned long)shadow;
    unsigned long eviction;
    unsigned long refault;
    unsigned long mask;
    int zid, nid;

    entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
    zid = entry & ((1UL << ZONES_SHIFT) - 1);
    entry >>= ZONES_SHIFT;
    nid = entry & ((1UL << NODES_SHIFT) - 1);
    entry >>= NODES_SHIFT;
    eviction = entry;

    *zone = NODE_DATA(nid)->node_zones + zid;

    refault = atomic_long_read(&(*zone)->inactive_age);
    mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
            RADIX_TREE_EXCEPTIONAL_SHIFT);
    /*
     * The unsigned subtraction here gives an accurate distance
     * across inactive_age overflows in most cases.
     *
     * There is a special case: usually, shadow entries have a
     * short lifetime and are either refaulted or reclaimed along
     * with the inode before they get too old.  But it is not
     * impossible for the inactive_age to lap a shadow entry in
     * the field, which can then can result in a false small
     * refault distance, leading to a false activation should this
     * old entry actually refault again.  However, earlier kernels
     * used to deactivate unconditionally with *every* reclaim
     * invocation for the longest time, so the occasional
     * inappropriate activation leading to pressure on the active
     * list is not a problem.
     */
    *distance = (refault - eviction) & mask;
}

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

byd yes

你的鼓励是我最大的动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值