再读内核存储管理(2)：相关的数据结构

最新推荐文章于 2022-04-13 09:31:15 发布

嵌云阁主

最新推荐文章于 2022-04-13 09:31:15 发布

阅读量1.8k

点赞数

分类专栏： bf561-uclinux 文章标签：数据结构存储 struct table list initialization

本文链接：https://blog.csdn.net/lights_joy/article/details/2556433

版权

bf561-uclinux 专栏收录该内容

390 篇文章 1 订阅

订阅专栏

快乐虾

http://blog.csdn.net/lights_joy/

lights@hb165.com

本文适用于

ADI bf561 DSP

uclinux-2008r1-rc8 (移植到vdsp5)

Visual DSP++ 5.0

欢迎转载，但请保留作者信息

1.1 相关的数据结构

1.1.1 pglist_data

pglist_data的定义在include/linux/mmzone.h中：

* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM

* (mostly NUMA machines?) to denote a higher-level memory zone than the

* zone denotes.

* On NUMA machines, each NUMA node would have a pg_data_t to describe

* it's memory layout.

* Memory statistics and page replacement data structures are maintained on a

* per-zone basis.

struct bootmem_data;

typedef struct pglist_data {

struct zone node_zones[MAX_NR_ZONES];

struct zonelist node_zonelists[MAX_NR_ZONES];

int nr_zones;

struct page *node_mem_map;

struct bootmem_data *bdata;

unsigned long node_start_pfn;

unsigned long node_present_pages; /* total number of physical pages */

unsigned long node_spanned_pages; /* total size of physical page

range, including holes */

int node_id;

wait_queue_head_t kswapd_wait;

struct task_struct *kswapd;

int kswapd_max_order;

} pg_data_t;

这个结构体用于描述可用存储空间的情况。

l bdata

static bootmem_data_t contig_bootmem_data;

struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };

从这个定义还可以看出在这个结构体中，bdata实际将指向一个固定的位置 contig_bootmem_data且在 mem_init函数调用后此成员将不再使用。

l zone

对于这个结构体中的zone，内核实际只使用了ZONE_DMA(0)这个区域，它的范围从内核代码结束一直到物理内存结束。

l node_id

因为整个内核只使用了一个NODE，因此在这个结构体中node_id的值将为0。

l node_start_pfn

将为0。

l node_spanned_pages 和 node_present_pages

两个成员的初始化在calculate_node_totalpages函数中完成，它们的值为SDRAM的页表数量，包含未用的区域和内核代码等，其值相等。对于64M内存而言(实际限制到60M)，其值为0x3bff。

l node_mem_map

在内核中每个4K的页都有一个struct page结构体与之对应，这个成员指向这个page数组的首地址，它将在初始化时由alloc_node_mem_map函数进行空间分配(使用bootmem)。

l nr_zones

这个值用于表示可用的zone的最高序号+1。对于BF561而言，只使用了ZONE_DMA，因此这个值将为1。

1.1.2 per_cpu_pageset

这个结构体的定义在include/linux/mmzone.h中：

enum zone_stat_item {

/* First 128 byte cacheline (assuming 64 bit words) */

NR_FREE_PAGES,

NR_INACTIVE,

NR_ACTIVE,

NR_ANON_PAGES, /* Mapped anonymous pages */

NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.

only modified from process context */

NR_FILE_PAGES,

NR_FILE_DIRTY,

NR_WRITEBACK,

/* Second 128 byte cacheline */

NR_SLAB_RECLAIMABLE,

NR_SLAB_UNRECLAIMABLE,

NR_PAGETABLE, /* used for pagetables */

NR_UNSTABLE_NFS, /* NFS unstable pages */

NR_BOUNCE,

NR_VMSCAN_WRITE,

NR_VM_ZONE_STAT_ITEMS };

struct per_cpu_pages {

int count; /* number of pages in the list */

int high; /* high watermark, emptying needed */

int batch; /* chunk size for buddy add/remove */

struct list_head list; /* the list of pages */

};

struct per_cpu_pageset {

struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */

s8 stat_threshold;

s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];

} ____cacheline_aligned_in_smp;

这个结构体将只用在struct zone当中，且通过zone_pcp这个宏来进行访问：

#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])

内核经常需要请求和释放单个页面。为了提升系统性能，每个内存管理区zone定义了一个“每CPU”页面高速缓存。所有高速缓存包含一些预先分配的页，它们被用于满足本地CPU发出的单一内存请求。

实际上，这里为每个内存管理区和每个CPU提供了两个高速缓存：一个热高速缓存，它存放的页框中所包含的内容很可能就在CPU硬件高速缓存中；还有一个冷高速缓存。

内核使用两个位标来监视热高速缓存和冷高速缓存的大小：如果页个数低于下界low，内核通过buddy系统分配batch个单一页面来补充对应的高速缓存；否则，如果页框个数高过上界high，内核从高速缓存中释放batch个页框到buddy系统中。

这个结构体的初始化由setup_pageset函数完成：

inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

{

struct per_cpu_pages *pcp;

memset(p, 0, sizeof(*p));

pcp = &p->pcp[0]; /* hot */

pcp->count = 0;

pcp->high = 6 * batch;

pcp->batch = max(1UL, 1 * batch);

INIT_LIST_HEAD(&pcp->list);

pcp = &p->pcp[1]; /* cold*/

pcp->count = 0;

pcp->high = 2 * batch;

pcp->batch = max(1UL, batch/2);

INIT_LIST_HEAD(&pcp->list);

}

对于64M的SDRAM而言，batch的值为3。

1.1.3 zone

这个结构体的定义位于include/linux/mmzone.h：

struct zone {

/* Fields commonly accessed by the page allocator */

unsigned long pages_min, pages_low, pages_high;

* We don't know if the memory that we're going to allocate will be freeable

* or/and it will be released eventually, so to avoid totally wasting several

* GB of ram we must reserve some of the lower zone memory (otherwise we risk

* to run OOM on the lower zones despite there's tons of freeable ram

* on the higher zones). This array is recalculated at runtime if the

* sysctl_lowmem_reserve_ratio sysctl changes.

unsigned long lowmem_reserve[MAX_NR_ZONES];

struct per_cpu_pageset pageset[NR_CPUS];

* free areas of different sizes

spinlock_t lock;

struct free_area free_area[MAX_ORDER];

ZONE_PADDING(_pad1_)

/* Fields commonly accessed by the page reclaim scanner */

spinlock_t lru_lock;

struct list_head active_list;

struct list_head inactive_list;

unsigned long nr_scan_active;

unsigned long nr_scan_inactive;

unsigned long pages_scanned; /* since last reclaim */

int all_unreclaimable; /* All pages pinned */

/* A count of how many reclaimers are scanning this zone */

atomic_t reclaim_in_progress;

/* Zone statistics */

atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

* prev_priority holds the scanning priority for this zone. It is

* defined as the scanning priority at which we achieved our reclaim

* target at the previous try_to_free_pages() or balance_pgdat()

* invokation.

* We use prev_priority as a measure of how much stress page reclaim is

* under - it drives the swappiness decision: whether to unmap mapped

* pages.

* Access to both this field is quite racy even on uniprocessor. But

* it is expected to average out OK.

int prev_priority;

ZONE_PADDING(_pad2_)

/* Rarely used or read-mostly fields */

* wait_table -- the array holding the hash table

* wait_table_hash_nr_entries -- the size of the hash table array

* wait_table_bits -- wait_table_size == (1 << wait_table_bits)

* The purpose of all these is to keep track of the people

* waiting for a page to become available and make them

* runnable again when possible. The trouble is that this

* consumes a lot of space, especially when so few things

* wait on pages at a given time. So instead of using

* per-page waitqueues, we use a waitqueue hash table.

* The bucket discipline is to sleep on the same queue when

* colliding and wake all in that wait queue when removing.

* When something wakes, it must check to be sure its page is

* truly available, a la thundering herd. The cost of a

* collision is great, but given the expected load of the

* table, they should be so rare as to be outweighed by the

* benefits from the saved space.

* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the

* primary users of these fields, and in mm/page_alloc.c

* free_area_init_core() performs the initialization of them.

wait_queue_head_t * wait_table;

unsigned long wait_table_hash_nr_entries;

unsigned long wait_table_bits;

* Discontig memory support fields.

struct pglist_data *zone_pgdat;

/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */

unsigned long zone_start_pfn;

* zone_start_pfn, spanned_pages and present_pages are all

* protected by span_seqlock. It is a seqlock because it has

* to be read outside of zone->lock, and it is done in the main

* allocator path. But, it is written quite infrequently.

* The lock is declared along with zone->lock because it is

* frequently read in proximity to zone->lock. It's good to

* give them a chance of being in the same cacheline.

unsigned long spanned_pages; /* total size, including holes */

unsigned long present_pages; /* amount of memory (excluding holes) */

* rarely used fields:

const char *name;

} ____cacheline_internodealigned_in_smp;

在内核中访问的zone全部是指contig_page_data->node_zones[0]，因为node_zones[1]没有可用空间，在内核中可以忽略它。

l spanned_pages 和 present_pages

表示可用的SDRAM的页的数量，其内存范围从0到60M。而 present_pages则在spanned_pages的基础上减去了page数组所占用的页数。对于64MSDRAM，不启用MTD而言，spanned_pages的值为0x3bff，而 present_pages的值则为0x3b6a。

l zone_pgdat

将指向全局唯一的一个pglist_data： contig_page_data

l prev_priority

被初始化为DEF_PRIORITY：

* The "priority" of VM scanning is how much of the queues we will scan in one

* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the

* queues ("queue_length >> 12") during an aging round.

#define DEF_PRIORITY 12

l wait_table_hash_nr_entries 和wait_table_bits

对于64M内存， wait_table_hash_nr_entries的值为0x40。而wait_table_bits的值则为6。而wait_table的大小将为wait_table_hash_nr_entries * sizeof(wait_queue_head_t)。

l free_area [MAX_ORDER]

struct free_area {

struct list_head free_list;

unsigned long nr_free;

};

在buddy算法中，将空闲页面分为11个块链表，每个块链表分别包含大小为1、2、4、8、16、32、64、128、256、512和1024个连续的页。这个成员就是为了表示此链表而设的。

1.1.4 zonelist

这个结构体的定义在include/linux/mmzone.h中：

* One allocation request operates on a zonelist. A zonelist

* is a list of zones, the first one is the 'goal' of the

* allocation, the other zones are fallback zones, in decreasing

* priority.

* If zlcache_ptr is not NULL, then it is just the address of zlcache,

* as explained above. If zlcache_ptr is NULL, there is no zlcache.

struct zonelist {

struct zonelist_cache *zlcache_ptr; // NULL or &zlcache

struct zone *zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited

};

在这里有

/* Maximum number of zones on a zonelist */

#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)

其值为2。

l zones

这个结构体只用在pglist_data中，在初始化后，zones实际只有一个元素，即指向ZONE_DMA，contig_page_data->zone[0]。

l zlcache_ptr

这个值实际没有使用，为NULL。

1.1.5 page

page的定义在include/linux/mm_types.h中：

* Each physical page in the system has a struct page associated with

* it to keep track of whatever it is we are using the page for at the

* moment. Note that we have no way to track which tasks are using

* a page, though if it is a pagecache page, rmap structures can tell us

* who is mapping it.

struct page {

unsigned long flags; /* Atomic flags, some possibly

* updated asynchronously */

atomic_t _count; /* Usage count, see below. */

union {

atomic_t _mapcount; /* Count of ptes mapped in mms,

* to show when page is mapped

* & limit reverse map searches.

struct { /* SLUB uses */

short unsigned int inuse;

short unsigned int offset;

};

union {

struct {

unsigned long private; /* Mapping-private opaque data:

* usually used for buffer_heads

* if PagePrivate set; used for

* swp_entry_t if PageSwapCache;

* indicates order in the buddy

* system if PG_buddy is set.

struct address_space *mapping; /* If low bit clear, points to

* inode address_space, or NULL.

* If page mapped as anonymous

* memory, low bit is set, and

* it points to anon_vma object:

* see PAGE_MAPPING_ANON below.

};

spinlock_t ptl;

struct { /* SLUB uses */

void **lockless_freelist;

struct kmem_cache *slab; /* Pointer to slab */

};

struct {

struct page *first_page; /* Compound pages */

};

union {

pgoff_t index; /* Our offset within mapping. */

void *freelist; /* SLUB: freelist req. slab lock */

};

struct list_head lru; /* Pageout list, eg. active_list

* protected by zone->lru_lock !

};

内核对每一个4K的内存页都用一个page结构进行描述，使用 virt_to_page这个宏定义可以快速地找到指定物理地址对应的page结构体。

#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))

在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配，但是与其它用bootmem分配的内存不同，这块内存空间永远不会被回收。

而这个结构体中的flags成员每个位的定义在include/linux/page_flags.h中：

* Various page->flags bits:

* PG_reserved is set for special pages, which can never be swapped out. Some

* of them might not even exist (eg empty_bad_page)...

* The PG_private bitflag is set on pagecache pages if they contain filesystem

* specific data (which is normally at page->private). It can be used by

* private allocations for its own usage.

* During initiation of disk I/O, PG_locked is set. This bit is set before I/O

* and cleared when writeback _starts_ or when read _completes_. PG_writeback

* is set before writeback starts and cleared when it finishes.

* PG_locked also pins a page in pagecache, and blocks truncation of the file

* while it is held.

* page_waitqueue(page) is a wait queue of all tasks waiting for the page

* to become unlocked.

* PG_uptodate tells whether the page's contents is valid. When a read

* completes, the page becomes uptodate, unless a disk I/O error happened.

* PG_referenced, PG_reclaim are used for page reclaim for anonymous and

* file-backed pagecache (see mm/vmscan.c).

* PG_error is set to indicate that an I/O error occurred on this page.

* PG_arch_1 is an architecture specific page state bit. The generic code

* guarantees that this bit is cleared for a page when it first is entered into

* the page cache.

* PG_highmem pages are not permanently mapped into the kernel virtual address

* space, they need to be kmapped separately for doing IO on the pages. The

* struct page (these bits with information) are always mapped into kernel

* address space...

* PG_buddy is set to indicate that the page is free and in the buddy system

* (see mm/page_alloc.c).

* Don't use the *_dontuse flags. Use the macros. Otherwise you'll break

* locked- and dirty-page accounting.

* The page flags field is split into two parts, the main flags area

* which extends from the low bits upwards, and the fields area which

* extends from the high bits downwards.

* | FIELD | ... | FLAGS |

* N-1 ^ 0

* (N-FLAGS_RESERVED)

* The fields area is reserved for fields mapping zone, node and SPARSEMEM

* section. The boundry between these two areas is defined by

* FLAGS_RESERVED which defines the width of the fields section

* (see linux/mmzone.h). New flags must _not_ overlap with this area.

#define PG_locked 0 /* Page is locked. Don't touch. */

#define PG_error 1

#define PG_referenced 2

#define PG_uptodate 3

#define PG_dirty 4

#define PG_lru 5

#define PG_active 6

#define PG_slab 7 /* slab debug (Suparna wants this) */

#define PG_owner_priv_1 8 /* Owner use. If pagecache, fs may use*/

#define PG_arch_1 9

#define PG_reserved 10

#define PG_private 11 /* If pagecache, has fs-private data */

#define PG_writeback 12 /* Page is under writeback */

#define PG_compound 14 /* Part of a compound page */

#define PG_swapcache 15 /* Swap page: swp_entry_t in private */

#define PG_mappedtodisk 16 /* Has blocks allocated on-disk */

#define PG_reclaim 17 /* To be reclaimed asap */

#define PG_buddy 19 /* Page is free, on buddy lists */

/* PG_owner_priv_1 users should have descriptive aliases */

#define PG_checked PG_owner_priv_1 /* Used by some filesystems */

初始化时，flags都将初始化为 PG_reserved 。

对page的初始化过程可参见 memmap_init_zone，位于mm/page_alloc.c：

* Initially all pages are reserved - free ones are freed

* up by free_all_bootmem() once the early boot process is

* done. Non-atomic initialization, single-pass.

void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,

unsigned long start_pfn, enum memmap_context context)

{

struct page *page;

unsigned long end_pfn = start_pfn + size;

unsigned long pfn;

for (pfn = start_pfn; pfn < end_pfn; pfn++) {

* There can be holes in boot-time mem_map[]s

* handed to this function. They do not

* exist on hotplugged memory.

if (context == MEMMAP_EARLY) {

if (!early_pfn_valid(pfn))

continue;

if (!early_pfn_in_nid(pfn, nid))

continue;

}

page = pfn_to_page(pfn);

set_page_links(page, zone, nid, pfn);

init_page_count(page);

reset_page_mapcount(page);

SetPageReserved(page);

INIT_LIST_HEAD(&page->lru);

}

在上述函数中，由于zone参数为0，故而实际只有

SetPageReserved(page)

这行语句起作用，将flags成员设置为0x400( PG_reserved)。

下面是几个对PAGE进行操作的宏：

1.1.5.1 virt_to_pfn

这个宏位于include/asm/page.h中：

#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)

#define __pa(vaddr) virt_to_phys((void *)(vaddr))

#define virt_to_phys(vaddr) ((unsigned long) (vaddr))

从这三个宏的定义可以看出，virt_to_pfn这个宏接受一个物理地址做为参数，并计算这个物理地址页在整个页表中的序号。

1.1.5.2 page_to_virt

这个宏位于include/asm/page.h中：

#define page_to_virt(page) ((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)

在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配。

在这里有

#define PAGE_SHIFT 12

而PAGE_OFFSET的值则为0。

在上述分析可以看出，这个宏接受一个page结构体指针做为参数，并计算得到这个page结构体所代表的物理内存页的地址。

1.1.5.3 page_to_pfn

这个宏位于include/asm/page.h中：

#define page_to_pfn(page) virt_to_pfn(page_to_virt(page))

根据virt_to_pfn和page_to_virt这两个宏定义可知，page_to_pfn这个宏接一个page结构体的指针做为参数，并取得这个页在整个页表中的序号。

其实在这里用

#define page_to_pfn(page) (page – mem_map)

也可以达到相同的目的。

1.1.5.4 pfn_to_virt

这个宏位于include/asm/page.h中：

#define pfn_to_virt(pfn) __va((pfn) << PAGE_SHIFT)

#define __va(paddr) phys_to_virt((unsigned long)(paddr))

#define phys_to_virt(vaddr) ((void *) (vaddr))

从这三个宏定义可以看出，这个宏接受页面序号做为参数，并返回指定序号的页面的物理地址。

1.1.5.5 virt_to_page

这个宏位于include/asm/page.h中：

#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))

在这里mem_map指向page数组的第一个元素。

在这里有

#define PAGE_SHIFT 12

而PAGE_OFFSET的值则为0。

在上述分析可以看出，这个宏接受一个物理地址做为参数，并计算得到表示这个物理地址所在页面的page结构体指针。

1.1.5.6 pfn_to_page

这个宏位于include/asm/page.h中：

#define pfn_to_page(pfn) virt_to_page(pfn_to_virt(pfn))

根据pfn_to_virt和virt_to_page这两个宏定义可知，pfn_to_page这个宏接受一个page序号做为参数，并取得这个序号所代表的page结构体指针。

其实在这里用

#define pfn_to_ page (pfn) (page + mem_map)

也可以达到相同的目的。

嵌云阁主

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
再读内核存储管理(2)：相关的数据结构

快乐虾http://blog.csdn.net/lights_joy/[email protected] 本文适用于ADI bf561 DSPuclinux-2008r1-rc8 (移植到vdsp5)Visual DSP++ 5.0 欢迎转载，但请保留作者信息 1.1 相关的数据结构1.1.1 pglist_data
复制链接

扫一扫