linux内核内存页,linux内核探索之内存管理（2）：linux系统中的内存组织-结点、内存域和页帧...

最新推荐文章于 2022-04-21 19:00:14 发布

啃老师

最新推荐文章于 2022-04-21 19:00:14 发布

阅读量301

点赞数

文章标签： linux内核内存页

linux内核探索之内存管理(二)：linux系统中的内存组织--结点、内存域和页帧

本文主要参考《深入linux内核架构》(3.2节)及Linux3.18.3内核源码

概述：本文主要描述了内存管理相关的数据结构：结点pg_data_t、内存域struct zone以及页帧(物理页)：struct page，以及该结构相关的一些基本概念。

1. 概述

内存划分为接点，每个结点关联到系统中的一个处理器，在内核中表示为pg_data_t.

各个结点又划分为内存域，比如DMA内存域，高端内存域，普通内存域。

内核内存域的宏：

enum zone_type {

#ifdef CONFIG_ZONE_DMA

ZONE_DMA,

#endif

#ifdef CONFIG_ZONE_DMA32

ZONE_DMA32,

#endif

ZONE_NORMAL,

#ifdef CONFIG_HIGHMEM

ZONE_HIGHMEM,

#endif

ZONE_MOVABLE,

__MAX_NR_ZONES

};

ZONE_DMA：标记用于DMA的内存区。该区域的长度依赖于处理器类型。在IA-32平台，一般为16MB。

ZONE_DMA32：标记了使用32位地址字可寻址、适合DMA的区域，只有在64位系统上，两种DMA内存域才有差别，在32位计算机上，这里是空的。

ZONE_NORMAL：普通内存域。所有体系结构上都保证存在。但无法保证该地址范围对应了实际的物理内存。

ZONE_HIGHMEM：高端内存区。标记超出内核段的物理内存(比如大于896M，内核地址空间无法全部映射物理内存)。64位系统不需要高端内存。

ZONE_MOVABLE：这是一个伪内存域，用于防止物理内存碎片。

各个内存域都关联了一个数组，用来组织属于该内存域的物理内存页(页帧)。对于每个页帧，都分配了一个struct page实例以及所需的管理数据。

各个内存节点保存在一个单链表中，供内核遍历。

2. 数据结构

(1)结点和节点状态

结点管理

pg_data_t表示节点，定义如下：

include/linux/mmzone.h

typedef struct pglist_data {

struct zone node_zones[MAX_NR_ZONES];

struct zonelist node_zonelists[MAX_ZONELISTS];

int nr_zones;

struct page *node_mem_map;

struct page_cgroup *node_page_cgroup;

struct bootmem_data *bdata;

spinlock_t node_size_lock;

unsigned long node_start_pfn;

unsigned long node_present_pages; /* total number of physical pages */

unsigned long node_spanned_pages; /* total size of physical page

range, including holes */

int node_id;

wait_queue_head_t kswapd_wait;

wait_queue_head_t pfmemalloc_wait;

struct task_struct *kswapd; /* Protected by

mem_hotplug_begin/end() */

int kswapd_max_order;

enum zone_type classzone_idx;

spinlock_t numabalancing_migrate_lock;

unsigned long numabalancing_migrate_next_window;

unsigned long numabalancing_migrate_nr_pages;

} pg_data_t;

node_zones是一个数组，包含了节点中各内存域的数据结构

node_zonelists指定了备用结点及其内存区的列表，以便在当前结点没有可用空间时，在备用结点分配内存。

nr_zones保存结点中不同内存区的数目

node_mem_map是指向page实例数组的指针，用于描述结点的所有物理内存页，它包含了结点中所有内存区的页。

bdata指向自举内存分配器数据结构的实例。在系统启动时，内存管理子系统初始化之前，内核也需要使用内存，此时使用了自举内存分配器。

node_start_pfn是该NUMA结点第一个页帧的逻辑编号。系统中所有结点的页帧是依次编号的，每个页帧的号码都是全局(不止本结点)唯一的。在UMA系统中，该值总是0.

node_present_pages指定了结点中页帧的数目

node_spanned_pages则给出了该结点以页帧为单位计算的长度，包含空洞。

node_id是一个全局结点ID

pgdat_next连接到下一个内存结点，系统中所有的结点都通过单链表连接，其末尾通过空指针标记。

kswapd_wait是交换守护进程的等待队列，在将页帧换出结点时会用到。Kswapd指向负责该结点的交换守护进程的task_struct。Kswapd_max_order用于页交换子系统的实现，用来定义需要释放的区域的长度。

结点状态管理

include/linux/nodemask.h：

* Bitmasks that are kept for all the nodes.

enum node_states {

N_POSSIBLE, /* The node could become online at some point */

N_ONLINE, /* The node is online */

N_NORMAL_MEMORY, /* The node has regular memory */

#ifdef CONFIG_HIGHMEM

N_HIGH_MEMORY, /* The node has regular or high memory */

#else

N_HIGH_MEMORY = N_NORMAL_MEMORY,

#endif

#ifdef CONFIG_MOVABLE_NODE

N_MEMORY, /* The node has memory(regular, high, movable) */

#else

N_MEMORY = N_HIGH_MEMORY,

#endif

N_CPU, /* The node has one or more cpus */

NR_NODE_STATES

};

状态N_POSSIBLE，N_ONLINE和N_CPU用于CPU和内存的热插拔。对于内存管理有必要的标志是N_HIGH_MEMORY、N_NORMAL_MEMORY。如果结点有普通或高端内存则使用N_HIGH_MEMORY，仅当结点没有高端内存时才设置N_NORMAL_MEMROY.

static inline void node_set_state(int node, enum node_states state);用于设置位于特定结点中的一个比特位。

static inline void node_clear_state(int node, enum node_states state);用于设置位于特定结点中的一个比特位。

#define for_each_node_state(__node, __state) for_each_node_mask((__node), node_states[__state]) 该宏用于遍历出于特定状态的所有结点

#define for_each_online_node(node) for_each_node_state(node, N_ONLINE) 该宏用于遍历所有活动结点。

如果内核只支持单个结点，上述操作为空操作。

2. 内存域

内核使用struct zone来表述内存域，其定义如下：

enum zone_watermarks {

WMARK_MIN,

WMARK_LOW,

WMARK_HIGH,

NR_WMARK

};

#define min_wmark_pages(z) (z->watermark[WMARK_MIN])

#define low_wmark_pages(z) (z->watermark[WMARK_LOW])

#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

struct zone {

unsigned long watermark[NR_WMARK];

long lowmem_reserve[MAX_NR_ZONES];

int node;

unsigned int inactive_ratio;

struct pglist_data *zone_pgdat;

struct per_cpu_pageset __percpu *pageset;

unsigned long dirty_balance_reserve;

unsigned long *pageblock_flags;

unsigned long min_unmapped_pages;

unsigned long min_slab_pages;

unsigned long zone_start_pfn;

unsigned long managed_pages;

unsigned long spanned_pages;

unsigned long present_pages;

const char *name;

int nr_migrate_reserve_block;

unsigned long nr_isolate_pageblock;

seqlock_t span_seqlock;

wait_queue_head_t *wait_table;

unsigned long wait_table_hash_nr_entries;

unsigned long wait_table_bits;

ZONE_PADDING(_pad1_)

spinlock_t lock;

struct free_area free_area[MAX_ORDER];

unsigned long flags;

ZONE_PADDING(_pad2_)

spinlock_t lru_lock;

struct lruvec lruvec;

atomic_long_t inactive_age;

unsigned long percpu_drift_mark;

unsigned long compact_cached_free_pfn;

unsigned long compact_cached_migrate_pfn[2];

unsigned int compact_considered;

unsigned int compact_defer_shift;

int compact_order_failed;

bool compact_blockskip_flush;

ZONE_PADDING(_pad3_)

atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

} ____cacheline_internodealigned_in_smp;

该结构由ZONE_PENDDING分隔为几个部分，这是因为对zone结构的访问非常频繁。在多处理器系统上，通常会有不同的CPU试图访问结构成员。因此使用锁防止他们彼此干扰，避免错误和不一致。由于内核对该结构的访问非常频繁，因此会经常性的获取该结构的两个自旋锁zone->lock和zone->lru_lock。

如果数据保存在CPU高速缓存中，那么会处理的更快。高速缓存分为行，每一行负责不同的内存区。内核使用ZONE_PADDING宏生成填充字段添加到结构中，以确保每个自旋锁都出于自身的缓存行中。关键字____cacheline_internodealigned_in_smp，用来实现最优的高速缓存对齐方式。

该结构的最后两个部分也通过填充字段彼此分隔，主要目的是将数据保留在一个缓存行中，便于快速访问。

第一部分主要成员：

watermark[NR_WMARK]是页换出时使用的水印。如果内存不足，内核可以将页写到硬盘，这三个成员会影响到交换守护进程的行为。

如果空闲页多于watermark[WMARK_HIGH]，则内存域的状态是理想的。

如果空闲页的数目低于watermark[WMARK_LOW]，则内核开始将页换出到硬盘

如果空闲页的数目低于watermark[WMARK_MIN]，那么页回收工作的压力就会比较大，因为内存域中急需空闲页。

lowmem_reserve[MAX_NR_ZONES]分别为各种内存域指定了若干页，用于一些无论如何都不能失败的关键性内存分配。

Pageset是一个数组，用于实现每个CPU的冷/热页帧列表。内核使用这些列表来保存可用于满足实现的“新鲜”页。但冷热页帧对应的高速缓存状态不同：有些页帧很可能仍然在高速缓存中，因此可以快速访问，这些成为热页帧；为缓存的页帧成为冷页帧。

free_area[MAX_ORDER]是同名数据结构的数组，用于实现伙伴系统。每个数组元素(#define MAX_ORDER 11)都表示某种固定长度的一些连续内存区域。对于包含在每个区域中的空闲内存页的管理，free_area是一个起点。

第二部分主要成员：

第二部涉及的成员，用来根据活动情况对内存域中的使用的页进行编目。如果也访问频繁，则内核认为它是活动的。在需要换出页时，这种区别是很重要的：如果可能的话，频繁使用的页应该保持不动，而多于的页则可以换出。

Flasgs描述内存域的当前状态，允许使用如下标志：

/* zone flags, see below */

unsigned long flags;

enum zone_flags {

ZONE_RECLAIM_LOCKED, //防止并发回收

ZONE_OOM_LOCKED, // zone is in OOM killer zonelist，内存域即刻被回收

ZONE_CONGESTED, // zone has many dirty pages backed by a congested BDI

ZONE_DIRTY, //reclaim scanning has recently found many dirty file pages at the tail of the LRU.

ZONE_WRITEBACK, //reclaim scanning has recently found many pages under writeback

ZONE_FAIR_DEPLETED,//fair zone policy batch depleted

};

vm_stat[NR_VM_ZONE_STAT_ITEMS]维护了大量有关该内存区的统计信息。

wait_queue_head_t *wait_table;

unsigned long wait_table_hash_nr_entries;

unsigned long wait_table_bits;

这三个成员实现一个等待队列，可供等待某一页变为可用的进程使用。进程排成一个队列，等待某些条件，该条件为真时，内核会通知进程恢复工作。

struct pglist_data *zone_pgdat;内存域和父节点之间的关联由zone_pgdat建立，zone_pgdat指向对应的pg_list_data实例。

unsigned long zone_start_pfn;是内存域第一个页帧的索引。zone_start_pfn == zone_start_paddr >> PAGE_SHIFT。

const char *name;是一个字符串，保存该内存域惯用名称。目前3个选项：Normal、DMA和HighMem.

unsigned long spanned_pages;指定内存域中页的总数，但并非所有页都是可用的，可能存在内存空洞。

unsigned long present_pages;是实际可用的页数目，一般与spanned_pages相同。

3. 内存域水印的计算

在计算各种水印之前，内核首先需要确定需要为关键性分配保留的内存空间的大小值。该值随可用内存的大小而非线性增长，并保存在全局变量min_free_kbytes中。

用户层可以通过文件/proc/sys/vm/min_free_kbytes来读取和修改关键性分配内存空间最小值。如下是主内存域min_fre_kbytes的一个经验值：

主内存大小

Min_free_kbytes

16MB

512KB

32MB

724KB

256MB

2MB

512MB

2896KB

1024MB

4MB

2048MB

5792KB

4096MB

8MB

结构体中水印值的填充由init_per_zone_pages_min处理，该函数由内核在启动期间调用，无需显式调用。

__setup_per_zone_wmark设置struct zone的watermark[WMARK_MIN]、watermark[WMARK_LOW]、watermark[WMARK_HIGH]

static void __setup_per_zone_wmarks(void)//page_alloc.c

{

unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);

unsigned long lowmem_pages = 0;

struct zone *zone;

unsigned long flags;

/* Calculate total number of !ZONE_HIGHMEM pages */

for_each_zone(zone) {

if (!is_highmem(zone))

lowmem_pages += zone->managed_pages;

}

for_each_zone(zone) {

u64 tmp;

spin_lock_irqsave(&zone->lock, flags);

tmp = (u64)pages_min * zone->managed_pages;

do_div(tmp, lowmem_pages);

if (is_highmem(zone)) {

unsigned long min_pages;

min_pages = zone->managed_pages / 1024;

min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);

zone->watermark[WMARK_MIN] = min_pages;

} else {

zone->watermark[WMARK_MIN] = tmp;

}

zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);

zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);

__mod_zone_page_state(zone, NR_ALLOC_BATCH,

high_wmark_pages(zone) - low_wmark_pages(zone) -

atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));

setup_zone_migrate_reserve(zone);

spin_unlock_irqrestore(&zone->lock, flags);

}

/* update totalreserve_pages */

calculate_totalreserve_pages();

}

高端内存域的下界SWAP_CLUSTER_MAX，对整个页面回收子系统来说，是一个重要的数值。该子系统的代码经常对页进行分组式批处理操作，SWAP_CLUSTER_MAX定义了分组的大小。

lowmem_reserve[MAX_NR_ZONES]的计算由setup_per_zone_lowmem_reserve();完成。内核遍历系统的所有结点，对每个结点的各个内存域分别计算预留内存最小值，具体的算法是将内存域中页帧的总数除以sysctl_lowmem_reserve_ratio[idx]。除数的默认设置对低端内存域是256,对高端内存域是32。

* setup_per_zone_lowmem_reserve - called whenever

* sysctl_lower_zone_reserve_ratio changes. Ensures that each zone

* has a correct pages reserved value, so an adequate number of

* pages are left in the zone after a successful __alloc_pages().

static void setup_per_zone_lowmem_reserve(void)//page_alloc.c

{

struct pglist_data *pgdat;

enum zone_type j, idx;

for_each_online_pgdat(pgdat) {

for (j = 0; j < MAX_NR_ZONES; j++) {

struct zone *zone = pgdat->node_zones + j;

unsigned long managed_pages = zone->managed_pages;

zone->lowmem_reserve[j] = 0;

idx = j;

while (idx) {

struct zone *lower_zone;

idx--;

if (sysctl_lowmem_reserve_ratio[idx] < 1)

sysctl_lowmem_reserve_ratio[idx] = 1;

lower_zone = pgdat->node_zones + idx;

lower_zone->lowmem_reserve[j] = managed_pages /

sysctl_lowmem_reserve_ratio[idx];

managed_pages += lower_zone->managed_pages;

}

/* update totalreserve_pages */

calculate_totalreserve_pages();

}

4. 冷热页

Struct zone的pageset成员用于实现冷热分配器(hot-n-cold allocator)。页时热的，指页已经加载到CPU高速缓存，与在内存中的也相比，其数据能够更快地访问。页是冷的，指页不在高速缓存中。在多处理器系统上每个CPU都有一个或多个高速缓存，各个CPU的管理必须是独立的。

尽管内存域可能属于一个特定的NUMA结点，因而关联到某个特定的CPU。但是其他CPU的高速缓存仍然可能包含该内存域中的页。每个处理器都可以访问系统中所有的页，尽管速度不同。因此，特定于内存域的数据结构不仅要考虑到所属NUMA结点的CPU，还必须考虑到系统中其他的CPU。

老的内核中pageset是一个数组，在最新的3.18.3内核中是指针：struct per_cpu_pageset __percpu *pageset。但是无论数组还是指针，在单处理器系统上都是只有一个元素，而SMP系统编译的内核中，其值可能在2~32中之间，该值并不是系统中实际存在的CPU数目，而是内内核支持的CPU的最大数目。

数组元素的类型为

include/linux/mmzone.h：

struct per_cpu_pageset {

struct per_cpu_pages pcp;

#ifdef CONFIG_NUMA

s8 expire;

#endif

#ifdef CONFIG_SMP

s8 stat_threshold;

s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];

#endif

};

该结构主要由struct per_cpu_pages结构体构成：

struct per_cpu_pages {

int count; /* number of pages in the list */

int high; /* high watermark, emptying needed */

int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */

struct list_head lists[MIGRATE_PCPTYPES];

};

Count记录了与该列表相关的页的数目，high是一个水印。如果count值超出了high，则表明列表中的页太多了。对容量过低的状态没有显式使用水印：如果列表中没有成员，则冲洗填充。

List是一个双向链表，保存了当前CPU的冷页或热页，可以使用内核的标准方法处理。

CPU的高速缓存不是用单个页来填充的，而是用多个页组成的块，batch是每次添加页数的一个参考值。

5. 页帧

页帧代表系统内存的最小单位，对内存中的每个页都会创建struct page的一个实例。所以struct page需要保持尽可能小。因为系统内存会分解为大量的页：即使主内存为384MB，一个page为4KB大小的话，也大概有10000个页。这就是为什么尽力保持struct page尽可能小的原因。在典型应用中，页的数量巨大，对page结构的小改动，也可能导致保存所有page实例所需的物理内存暴涨。

页的广泛使用，增加了保持结构长度的难度。内存管理的许多部分使用页，用于各种不同的用途。内核的一个部分可能完全依赖于struct page提供的信息，而该信息对内核的另一部分可能完全无用。所以struct page中使用了union类型。

struct page结构的定义：

Include/linux/mm_types.h

* Each physical page in the system has a struct page associated with

* it to keep track of whatever it is we are using the page for at the

* moment. Note that we have no way to track which tasks are using

* a page, though if it is a pagecache page, rmap structures can tell us

* who is mapping it.

* The objects in struct page are organized in double word blocks in

* order to allows us to use atomic double word operations on portions

* of struct page. That is currently only used by slub but the arrangement

* allows the use of atomic double word operations on the flags/mapping

* and lru list pointers also.

struct page {

/* First double word block */

unsigned long flags; //原子标记，有些情况下会异步更新

union {

struct address_space *mapping; //如果最低位为0，则指向inode address_space，或为NULL

//如果页映射为匿名内存，最低位置位，而且该指针指向anon_vma对象，

//则参考PAGE_MAPPING_ANON.

void *s_mem; /* slab first object */

};

/* Second double word */

struct {

union {

pgoff_t index; //在映射内的偏移量

void *freelist; //SLUB：freelist req. Slab lock

bool pfmemalloc;

};

union {

#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)

/* Used for cmpxchg_double in slub */

unsigned long counters;

#else

unsigned counters;

#endif

struct {

union {

atomic_t _mapcount;// 内存管理子系统中映射的页表项计数，用于表示

//页是否已经映射，还用于限制逆向映射搜索

struct { /* SLUB */ //用于slab分配器

unsigned inuse:16;//对象的数目

unsigned objects:15;

unsigned frozen:1;

};

int units; /* SLOB */

};

atomic_t _count; //使用计数

};

unsigned int active; /* SLAB */

};

/* Third double word block */

union {

struct list_head lru; //换出页列表，例如由zone->lru_lock保护的active_list

struct {

struct page *next;

#ifdef CONFIG_64BIT

int pages; /* Nr of partial slabs left */

int pobjects; /* Approximate # of objects */

#else

short int pages;

short int pobjects;

#endif

};

struct slab *slab_page; /* slab fields */

struct rcu_head rcu_head; /* Used by SLAB

* when destroying via RCU

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS

pgtable_t pmd_huge_pte; /* protected by page->ptl */

#endif

};

/* Remainder is not double word aligned */

union {

unsigned long private; //由映射私有，不透明数据：

//如果设置了PagePrivate，通常用于buffer_heads；

//如果设置了PageSwapCache，则用于swp_entry_t;

//如果设置了PG_buddy，则用于表示伙伴系统中的阶

#if USE_SPLIT_PTE_PTLOCKS

#if ALLOC_SPLIT_PTLOCKS

spinlock_t *ptl;

#else

spinlock_t ptl;

#endif

struct kmem_cache *slab_cache; //用于slub分配器，指向slab的指针

struct page *first_page; //用于复合页的页尾，指向首页

};

#if defined(WANT_PAGE_VIRTUAL)

void *virtual; //内核虚拟地址(如果没有映射则为NULL，即高端内存)

#endif /* WANT_PAGE_VIRTUAL */

#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS

unsigned long debug_flags; /* Use atomic bitops on this */

#endif

#ifdef CONFIG_KMEMCHECK

void *shadow;

#endif

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS

int _last_cpupid;

#endif

}

这个结构体的定义是非常复杂的，原因就是内核中每一个物理页帧都要对应这么一个结构，所以需要很多个这样的结构，所以这里需要保持该结构很小。而该结构用处又非常多，用处多导致每个地方有不同的需求，导致成员多。所以就出来这么纠结的一个定义。这里只保留非常重要的部分。

该结构的格式是体系结构无关的，不依赖于CPU类型，每个页帧都由该结构描述。出了slub相关成员以外(slab、freelist和inuse)，page结构也包含了若干其他成员。这里只是概述一些内容，后面还会有介绍。

flags:存储了体系结构无关的标志，用于描述页的属性

_count:引用计数，表示内核中引用该页的次数。在其值为0时，内核就知道page实例当前未使用，因此可以删除。如果其值大于0，该实例决不会从内存删除。

_mapcount表示在页表中有多少项指向该页

lru：是一个表头，用于在各种链表上维护该页，以便将页按不同类别分组，以便将页按照不同类别分组，最重要的是活动页和不活动页

内核将多个毗连的页合并为较大的复合页(compound page)。分组中的第一个页称为首页(head page)，而所有其余页叫做尾页(tail page)。所有尾页对应的page实例中，都将first_pag设为指向首页。

Mapping指定了页帧所在的地址空间。Index是页帧在映射内部的偏移量。地址空间是一个非常一般的概念。例如可以用在向向内存读取文件时，地址空间用于将文件的内容与装载数据的内存区关联起来。Mapping不仅能够保存一个指针，而且还能包含一些额外的信息，用于判断页是否属于未关联到地址空间的某个匿名内存区。

Private是一个指向“私有”数据的指针，虚拟内存管理会忽略该数据。根据页的用途，可以用不同的方式使用该指针。大多数情况下它将用于页与数据缓冲区关联起来。

Virtual用于高端内存区域中的页，无法直接映射到内核内存中的页，virtual用于存储该页的虚拟地址。

页的不同属性通过一系列页标志描述，存储为struct page的flags成员中的各个比特位。这些标志独立于使用的体系结构。各个标志是由page-types.h中的宏定义的，此外还有一些宏用于标志的设置、删除、查询。

* Various page->flags bits:

* PG_reserved is set for special pages, which can never be swapped out. Some

* of them might not even exist (eg empty_bad_page)...

* The PG_private bitflag is set on pagecache pages if they contain filesystem

* specific data (which is normally at page->private). It can be used by

* private allocations for its own usage.

* During initiation of disk I/O, PG_locked is set. This bit is set before I/O

* and cleared when writeback _starts_ or when read _completes_. PG_writeback

* is set before writeback starts and cleared when it finishes.

* PG_locked also pins a page in pagecache, and blocks truncation of the file

* while it is held.

* page_waitqueue(page) is a wait queue of all tasks waiting for the page

* to become unlocked.

* PG_uptodate tells whether the page's contents is valid. When a read

* completes, the page becomes uptodate, unless a disk I/O error happened.

* PG_referenced, PG_reclaim are used for page reclaim for anonymous and

* file-backed pagecache (see mm/vmscan.c).

* PG_error is set to indicate that an I/O error occurred on this page.

* PG_arch_1 is an architecture specific page state bit. The generic code

* guarantees that this bit is cleared for a page when it first is entered into

* the page cache.

* PG_highmem pages are not permanently mapped into the kernel virtual address

* space, they need to be kmapped separately for doing IO on the pages. The

* struct page (these bits with information) are always mapped into kernel

* address space...

* PG_hwpoison indicates that a page got corrupted in hardware and contains

* data with incorrect ECC bits that triggered a machine check. Accessing is

* not safe since it may cause another machine check. Don't touch!

* Don't use the *_dontuse flags. Use the macros. Otherwise you'll break

* locked- and dirty-page accounting.

* The page flags field is split into two parts, the main flags area

* which extends from the low bits upwards, and the fields area which

* extends from the high bits downwards.

* | FIELD | ... | FLAGS |

* N-1 ^ 0

* (NR_PAGEFLAGS)

* The fields area is reserved for fields mapping zone, node (for NUMA) and

* SPARSEMEM section (for variants of SPARSEMEM that require section ids like

* SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP).

enum pageflags {

PG_locked, /* Page is locked. Don't touch. */

PG_error,

PG_referenced,

PG_uptodate,

PG_dirty,

PG_lru,

PG_active,

PG_slab,

PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/

PG_arch_1,

PG_reserved,

PG_private, /* If pagecache, has fs-private data */

PG_private_2, /* If pagecache, has fs aux data */

PG_writeback, /* Page is under writeback */

#ifdef CONFIG_PAGEFLAGS_EXTENDED

PG_head, /* A head page */

PG_tail, /* A tail page */

#else

PG_compound, /* A compound page */

#endif

PG_swapcache, /* Swap page: swp_entry_t in private */

PG_mappedtodisk, /* Has blocks allocated on-disk */

PG_reclaim, /* To be reclaimed asap */

PG_swapbacked, /* Page is backed by RAM/swap */

PG_unevictable, /* Page is "unevictable" */

#ifdef CONFIG_MMU

PG_mlocked, /* Page is vma mlocked */

#endif

#ifdef CONFIG_ARCH_USES_PG_UNCACHED

PG_uncached, /* Page has been mapped as uncached */

#endif

#ifdef CONFIG_MEMORY_FAILURE

PG_hwpoison, /* hardware poisoned page. Don't touch */

#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

PG_compound_lock,

#endif

__NR_PAGEFLAGS,

/* Filesystems */

PG_checked = PG_owner_priv_1,

/* Two page bits are conscripted by FS-Cache to maintain local caching

* state. These bits are set on pages belonging to the netfs's inodes

* when those inodes are being locally cached.

PG_fscache = PG_private_2, /* page backed by cache */

/* XEN */

PG_pinned = PG_owner_priv_1,

PG_savepinned = PG_dirty,

/* SLOB */

PG_slob_free = PG_private,

};

内核定义了一些标准宏，用于检查页是否设置了某个特定的比特位，或者操作某个比特位。这些宏的名称有一定的模式，如下所述：

PageXXX(page)会检查页是否设置了PG_XXX位，比如PageDirty检查PG_dirty位，而PageActive检查PG_active位等等。

SetPageXXX在某个比特位没有设置的情况下，设置该Bit

ClearPageXXX无条件清除某个谁知的Bit

这些操作的试下是原子的。很多情况下，需要等待页的状态改变，然后才能恢复工作。内核提供了两个辅助函数等待状态的改变：

* Wait for a page to be unlocked.

* This must be called with the caller "holding" the page,

* ie with increased "page->count" so that the page won't

* go away during the wait..

static inline void wait_on_page_locked(struct page *page)

{

if (PageLocked(page))

wait_on_page_bit(page, PG_locked);

}

* Wait for a page to complete writeback

static inline void wait_on_page_writeback(struct page *page)

{

if (PageWriteback(page))

wait_on_page_bit(page, PG_writeback);

}

假定内核的一部分在等待一个被锁定的页面，直至页面解锁。wait_on_page_locked提供了该功能。在页面锁定的情况下调用该函数，内核将进入睡眠，在页面解锁之后，睡眠进程被自动唤醒并继续工作。

wait_on_page_writeback会等待到与页面相关的所有待决回写操作结束，将页面包含的数据同步到块设备为止。

总结：本文主要描述了内存管理相关的数据结构：结点pg_data_t、内存域struct zone以及页帧(物理页)：struct page，以及该结构相关的一些基本概念。

啃老师

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
linux内核内存页,linux内核探索之内存管理（2）：linux系统中的内存组织-结点、内存域和页帧...

linux内核探索之内存管理(二)：linux系统中的内存组织--结点、内存域和页帧本文主要参考《深入linux内核架构》(3.2节)及Linux3.18.3内核源码概述：本文主要描述了内存管理相关的数据结构：结点pg_data_t、内存域structzone以及页帧(物理页)：structpage，以及该结构相关的一些基本概念。1.概述内存划分为接点，每个结点关联到系统中的一个处理器，在内...
复制链接

扫一扫