跟踪LRU活动情况:
如果在LRU链表中,页面被其他的进程释放了,那么LRU链表如何知道页面已经被释放了?
LRU只是一个双向链表,如何保护链表中的成员不被其他内核路径释放时在设计页面回收功能需要考虑的并发问题。在这个过程中,struct page数据结构中的_count引用计数起到重要的作用。
以shrink_active_list()中分离页面到临时链表l_hold为例:
shrink_active_list()
->isolate_lru_pages()
->page = lru_to_page() //从LRU链表中摘取一个页面
->get_page_unless_zero(page) //对page->_count引用计数加1
->ClearPageLRU(page) //清楚PG_LRU标志位
这里从LRU摘取一个页面时,对该页的page->_count 引用计数加1。
把分离好的页面放回LRU链表的情况如下。
shrink_active_list()
->move_active_pages_to_lru()
->list_move(&page->lru, &lruvec->lists[lru]); //把该页面添加会LRU链表
->put_page_testzero(page)
-
有一个page cache页面第一次访问时,它加入到不活跃链表头,然后慢慢从链表头向链表尾移动,链表尾的page cache 会被踢出LRU链表并且释放页面,这个过程叫做eviction(移出)。
-
当第二次访问时,page cache被升级到活跃LRU链表,这样不活跃链表也空出一个位置,这个过程叫作activation(激活)。
-
从宏观时间轴来看,eviction过程处理的页面数量与activation过程处理的页数量的和等于不活跃链表的长度NR_inactive。
-
要从不活跃链表中释放一个页面,需要移动N个页面(N = 不活跃链表长度)。
struct zone {
......
/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;
......
}
(2) page cache第一次加入不活跃链表时代码如下:
page cache第一次加入radix tree时会分配一个slot来存放inactive_age,这里使用shadow指向slot。因此第一次加入时shadow值为空,还没有Refault Distance,因此加入不活跃LRU链表。
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
void *shadow = NULL;
int ret;
__set_page_locked(page);
ret = __add_to_page_cache_locked(page, mapping, offset,
gfp_mask, &shadow);
if (unlikely(ret))
__clear_page_locked(page);
else {
/*
* The page might have been evicted from cache only
* recently, in which case it should be activated like
* any other repeatedly accessed page.
*/
if (shadow && workingset_refault(shadow)) {
SetPageActive(page);
workingset_activation(page);
} else
ClearPageActive(page);
lru_cache_add(page);
}
return ret;
}
(3) 当在文件缓存不活跃链表里的页面被再一次读取时,会调用mark_page_accessed()函数。
/*
* Mark a page as having seen activity.
*
* inactive,unreferenced -> inactive,referenced
* inactive,referenced -> active,unreferenced
* active,unreferenced -> active,referenced
*
* When a newly allocated page is not yet visible, so safe for non-atomic ops,
* __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
*/
void mark_page_accessed(struct page *page)
{
if (!PageActive(page) && !PageUnevictable(page) &&
PageReferenced(page)) {
/*
* If the page is on the LRU, queue it for activation via
* activate_page_pvecs. Otherwise, assume the page is on a
* pagevec, mark it active and it'll be moved to the active
* LRU on the next drain.
*/
if (PageLRU(page))
activate_page(page);
else
__lru_cache_activate_page(page);
ClearPageReferenced(page);
if (page_is_file_cache(page))
workingset_activation(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
}
第二次读时会调用workingset_activation()函数来增加zone->inactive_age计数。
/**
* workingset_activation - note a page activation
* @page: page that is being activated
*/
void workingset_activation(struct page *page)
{
atomic_long_inc(&page_zone(page)->inactive_age);
}
(4)在不活跃链表末尾的页面会被踢出LRU链表并被释放。
/*
* Same as remove_mapping, but if the page is removed from the mapping, it
* gets returned with a refcount of 0.
*/
static int __remove_mapping(struct address_space *mapping, struct page *page,
bool reclaimed)
{
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
spin_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
*
* Must be careful with the order of the tests. When someone has
* a ref to the page, it may be possible that they dirty it then
* drop the reference. So if PageDirty is tested before page_count
* here, then the following race may occur:
*
* get_user_pages(&page);
* [user mapping goes away]
* write_to(page);
* !PageDirty(page) [good]
* SetPageDirty(page);
* put_page(page);
* !page_count(page) [good, discard it]
*
* [oops, our write_to data is lost]
*
* Reversing the order of the tests ensures such a situation cannot
* escape unnoticed. The smp_rmb is needed to ensure the page->flags
* load is not satisfied before that of page->_count.
*
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
if (!page_freeze_refs(page, 2))
goto cannot_free;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page))) {
page_unfreeze_refs(page, 2);
goto cannot_free;
}
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
swapcache_free(swap);
} else {
void (*freepage)(struct page *);
void *shadow = NULL;
freepage = mapping->a_ops->freepage;
/*
* Remember a shadow entry for reclaimed file cache in
* order to detect refaults, thus thrashing, later on.
*
* But don't store shadows in an address space that is
* already exiting. This is not just an optizimation,
* inode reclaim needs to empty out the radix tree or
* the nodes are lost. Don't plant shadows behind its
* back.
*/
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping))
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
spin_unlock_irq(&mapping->tree_lock);
if (freepage != NULL)
freepage(page);
}
return 1;
cannot_free:
spin_unlock_irq(&mapping->tree_lock);
return 0;
}
在被踢出LRU链表时,通过workingset_eviction()函数把当前的zone->inactive_age计数保存到该页对应的radix_tree的shadow中。
/**
* workingset_eviction - note the eviction of a page from memory
* @mapping: address space the page was backing
* @page: the page being evicted
*
* Returns a shadow entry to be stored in @mapping->page_tree in place
* of the evicted @page so that a later refault can be detected.
*/
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
struct zone *zone = page_zone(page);
unsigned long eviction;
eviction = atomic_long_inc_return(&zone->inactive_age);
return pack_shadow(eviction, zone);
}
/*
* Double CLOCK lists
*
* Per zone, two clock lists are maintained for file pages: the
* inactive and the active list. Freshly faulted pages start out at
* the head of the inactive list and page reclaim scans pages from the
* tail. Pages that are accessed multiple times on the inactive list
* are promoted to the active list, to protect them from reclaim,
* whereas active pages are demoted to the inactive list when the
* active list grows too big.
*
* fault ------------------------+
* |
* +--------------+ | +-------------+
* reclaim <- | inactive | <-+-- demotion | active | <--+
* +--------------+ +-------------+ |
* | |
* +-------------- promotion ------------------+
*
*
* Access frequency and refault distance
*
* A workload is thrashing when its pages are frequently used but they
* are evicted from the inactive list every time before another access
* would have promoted them to the active list.
*
* In cases where the average access distance between thrashing pages
* is bigger than the size of memory there is nothing that can be
* done - the thrashing set could never fit into memory under any
* circumstance.
*
* However, the average access distance could be bigger than the
* inactive list, yet smaller than the size of memory. In this case,
* the set could fit into memory if it weren't for the currently
* active pages - which may be used more, hopefully less frequently:
*
* +-memory available to cache-+
* | |
* +-inactive------+-active----+
* a b | c d e f g h i | J K L M N |
* +---------------+-----------+
*
* It is prohibitively expensive to accurately track access frequency
* of pages. But a reasonable approximation can be made to measure
* thrashing on the inactive list, after which refaulting pages can be
* activated optimistically to compete with the existing active pages.
*
* Approximating inactive page access frequency - Observations:
*
* 1. When a page is accessed for the first time, it is added to the
* head of the inactive list, slides every existing inactive page
* towards the tail by one slot, and pushes the current tail page
* out of memory.
*
* 2. When a page is accessed for the second time, it is promoted to
* the active list, shrinking the inactive list by one slot. This
* also slides all inactive pages that were faulted into the cache
* more recently than the activated page towards the tail of the
* inactive list.
*
* Thus:
*
* 1. The sum of evictions and activations between any two points in
* time indicate the minimum number of inactive pages accessed in
* between.
*
* 2. Moving one inactive page N page slots towards the tail of the
* list requires at least N inactive page accesses.
*
* Combining these:
*
* 1. When a page is finally evicted from memory, the number of
* inactive pages accessed while the page was in cache is at least
* the number of page slots on the inactive list.
*
* 2. In addition, measuring the sum of evictions and activations (E)
* at the time of a page's eviction, and comparing it to another
* reading (R) at the time the page faults back into memory tells
* the minimum number of accesses while the page was not cached.
* This is called the refault distance.
*
* Because the first access of the page was the fault and the second
* access the refault, we combine the in-cache distance with the
* out-of-cache distance to get the complete minimum access distance
* of this page:
*
* NR_inactive + (R - E)
*
* And knowing the minimum access distance of a page, we can easily
* tell if the page would be able to stay in cache assuming all page
* slots in the cache were available:
*
* NR_inactive + (R - E) <= NR_inactive + NR_active
*
* which can be further simplified to
*
* (R - E) <= NR_active
*
* Put into words, the refault distance (out-of-cache) can be seen as
* a deficit in inactive list space (in-cache). If the inactive list
* had (R - E) more page slots, the page would not have been evicted
* in between accesses, but activated instead. And on a full system,
* the only thing eating into inactive list space is active pages.
*
*
* Activating refaulting pages
*
* All that is known about the active list is that the pages have been
* accessed more than once in the past. This means that at any given
* time there is actually a good chance that pages on the active list
* are no longer in active use.
*
* So when a refault distance of (R - E) is observed and there are at
* least (R - E) active pages, the refaulting page is activated
* optimistically in the hope that (R - E) active pages are actually
* used less frequently than the refaulting page - or even not used at
* all anymore.
*
* If this is wrong and demotion kicks in, the pages which are truly
* used more frequently will be reactivated while the less frequently
* used once will be evicted from memory.
*
* But if this is right, the stale pages will be pushed out of memory
* and the used pages get to stay in cache.
*
*
* Implementation
*
* For each zone's file LRU lists, a counter for inactive evictions
* and activations is maintained (zone->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
* identify the zone) is stored in the now empty page cache radix tree
* slot of the evicted page. This is called a shadow entry.
*
* On cache misses for which there are shadow entries, an eligible
* refault distance will immediately activate the refaulting page.
*/
static void *pack_shadow(unsigned long eviction, struct zone *zone)
{
eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}
(5) 当page cache第二次读取时,还会调用add_to_page_cache_lru()函数。代码中的workingset_refault()会计算Refault Distance,并且判断是否需要把page cache加入到活跃链表中,以避免下一次读之前被踢出LRU链表。
/**
* workingset_refault - evaluate the refault of a previously evicted page
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
* evicted page in the context of the zone it was allocated in.
*
* Returns %true if the page should be activated, %false otherwise.
*/
bool workingset_refault(void *shadow)
{
unsigned long refault_distance;
struct zone *zone;
unpack_shadow(shadow, &zone, &refault_distance);
inc_zone_state(zone, WORKINGSET_REFAULT);
/*判断refault_distance是否小于活跃LRU链表的长度,如果是,则说明该页在下一次访问
前极有可能会踢出LRU链表,因此返回true。在add_to_page_cache_lru()函数中调用
SetPageActive(page)设置该页的PG_active标志位并加入到活跃LRU链表中,从而避免低
三次访问时该页被踢出LRU链表锁产生的内存颠簸*/
if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
inc_zone_state(zone, WORKINGSET_ACTIVATE);
return true;
}
return false;
}
unpack_shadow()函数只把该page cache之前存放的shadow值重新解码,得出了图中T1时刻的inactive_age值,然后把当前的inactive_age减去T1,得到Refault Distance。
static void unpack_shadow(void *shadow,
struct zone **zone,
unsigned long *distance)
{
unsigned long entry = (unsigned long)shadow;
unsigned long eviction;
unsigned long refault;
unsigned long mask;
int zid, nid;
entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
zid = entry & ((1UL << ZONES_SHIFT) - 1);
entry >>= ZONES_SHIFT;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
eviction = entry;
*zone = NODE_DATA(nid)->node_zones + zid;
refault = atomic_long_read(&(*zone)->inactive_age);
mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
RADIX_TREE_EXCEPTIONAL_SHIFT);
/*
* The unsigned subtraction here gives an accurate distance
* across inactive_age overflows in most cases.
*
* There is a special case: usually, shadow entries have a
* short lifetime and are either refaulted or reclaimed along
* with the inode before they get too old. But it is not
* impossible for the inactive_age to lap a shadow entry in
* the field, which can then can result in a false small
* refault distance, leading to a false activation should this
* old entry actually refault again. However, earlier kernels
* used to deactivate unconditionally with *every* reclaim
* invocation for the longest time, so the occasional
* inappropriate activation leading to pressure on the active
* list is not a problem.
*/
*distance = (refault - eviction) & mask;
}