快乐虾
http://blog.csdn.net/lights_joy/
lights@hb165.com
本文适用于
ADI bf561 DSP
uclinux-2008r1-rc8 (移植到vdsp5)
Visual DSP++ 5.0
欢迎转载,但请保留作者信息
pglist_data的定义在include/linux/mmzone.h中:
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
* (mostly NUMA machines?) to denote a higher-level memory zone than the
* zone denotes.
*
* On NUMA machines, each NUMA node would have a pg_data_t to describe
* it's memory layout.
*
* Memory statistics and page replacement data structures are maintained on a
* per-zone basis.
*/
struct
bootmem_data;
typedef
struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_NR_ZONES];
int nr_zones;
struct page *node_mem_map;
struct bootmem_data *bdata;
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
} pg_data_t;
这个结构体用于描述可用存储空间的情况。
l
bdata
static
bootmem_data_t contig_bootmem_data;
struct
pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
从这个定义还可以看出在这个结构体中,bdata实际将指向一个固定的位置
contig_bootmem_data且在
mem_init函数调用后此成员将不再使用。
l
zone
对于这个结构体中的zone,内核实际只使用了ZONE_DMA(0)这个区域,它的范围从内核代码结束一直到物理内存结束。
l
node_id
因为整个内核只使用了一个NODE,因此在这个结构体中node_id的值将为0。
l
node_start_pfn
将为0。
l
node_spanned_pages
和
node_present_pages
两个成员的初始化在calculate_node_totalpages函数中完成,它们的值为SDRAM的页表数量,包含未用的区域和内核代码等,其值相等。对于64M内存而言(实际限制到60M),其值为0x3bff。
l
node_mem_map
在内核中每个4K的页都有一个struct page结构体与之对应,这个成员指向这个page数组的首地址,它将在初始化时由alloc_node_mem_map函数进行空间分配(使用bootmem)。
l
nr_zones
这个值用于表示可用的zone的最高序号+1。对于BF561而言,只使用了ZONE_DMA,因此这个值将为1。
这个结构体的定义在include/linux/mmzone.h中:
enum
zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
NR_INACTIVE,
NR_ACTIVE,
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
NR_FILE_PAGES,
NR_FILE_DIRTY,
NR_WRITEBACK,
/* Second 128 byte cacheline */
NR_SLAB_RECLAIMABLE,
NR_SLAB_UNRECLAIMABLE,
NR_PAGETABLE, /* used for pagetables */
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
NR_VM_ZONE_STAT_ITEMS };
struct
per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
struct list_head list; /* the list of pages */
};
struct
per_cpu_pageset {
struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
s8 stat_threshold;
s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
} ____cacheline_aligned_in_smp;
这个结构体将只用在struct zone当中,且通过zone_pcp这个宏来进行访问:
#define
zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
内核经常需要请求和释放单个页面。为了提升系统性能,每个内存管理区zone定义了一个“每CPU”页面高速缓存。所有高速缓存包含一些预先分配的页,它们被用于满足本地CPU发出的单一内存请求。
实际上,这里为每个内存管理区和每个CPU提供了两个高速缓存:一个热高速缓存,它存放的页框中所包含的内容很可能就在CPU硬件高速缓存中;还有一个冷高速缓存。
内核使用两个位标来监视热高速缓存和冷高速缓存的大小:如果页个数低于下界low,内核通过buddy系统分配batch个单一页面来补充对应的高速缓存;否则,如果页框个数高过上界high,内核从高速缓存中释放batch个页框到buddy系统中。
这个结构体的初始化由setup_pageset函数完成:
inline
void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp;
memset(p, 0, sizeof(*p));
pcp = &p->pcp[0]; /* hot */
pcp->count = 0;
pcp->high = 6 * batch;
pcp->batch = max(1UL, 1 * batch);
INIT_LIST_HEAD(&pcp->list);
pcp = &p->pcp[1]; /* cold*/
pcp->count = 0;
pcp->high = 2 * batch;
pcp->batch = max(1UL, batch/2);
INIT_LIST_HEAD(&pcp->list);
}
对于64M的SDRAM而言,batch的值为3。
这个结构体的定义位于include/linux/mmzone.h:
struct
zone {
/* Fields commonly accessed by the page allocator */
unsigned long pages_min, pages_low, pages_high;
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
/*
* free areas of different sizes
*/
spinlock_t lock;
struct free_area free_area[MAX_ORDER];
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct list_head active_list;
struct list_head inactive_list;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
/* A count of how many reclaimers are scanning this zone */
atomic_t reclaim_in_progress;
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
* target at the previous try_to_free_pages() or balance_pgdat()
* invokation.
*
* We use prev_priority as a measure of how much stress page reclaim is
* under - it drives the swappiness decision: whether to unmap mapped
* pages.
*
* Access to both this field is quite racy even on uniprocessor. But
* it is expected to average out OK.
*/
int prev_priority;
ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
/*
* wait_table -- the array holding the hash table
* wait_table_hash_nr_entries -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
/*
* zone_start_pfn, spanned_pages and present_pages are all
* protected by span_seqlock. It is a seqlock because it has
* to be read outside of zone->lock, and it is done in the main
* allocator path. But, it is written quite infrequently.
*
* The lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*/
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
/*
* rarely used fields:
*/
const char *name;
} ____cacheline_internodealigned_in_smp;
在内核中访问的zone全部是指contig_page_data->node_zones[0],因为node_zones[1]没有可用空间,在内核中可以忽略它。
l
spanned_pages
和
present_pages
表示可用的SDRAM的页的数量,其内存范围从0到60M。而
present_pages则在spanned_pages的基础上减去了page数组所占用的页数。对于64MSDRAM,不启用MTD而言,spanned_pages的值为0x3bff,而
present_pages的值则为0x3b6a。
l
zone_pgdat
将指向全局唯一的一个pglist_data:
contig_page_data
l
prev_priority
被初始化为DEF_PRIORITY:
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
* queues ("queue_length >> 12") during an aging round.
*/
#define
DEF_PRIORITY 12
l
wait_table_hash_nr_entries
和wait_table_bits
对于64M内存,
wait_table_hash_nr_entries的值为0x40。而wait_table_bits的值则为6。而wait_table的大小将为wait_table_hash_nr_entries * sizeof(wait_queue_head_t)。
l
free_area
[MAX_ORDER]
struct
free_area {
struct list_head free_list;
unsigned long nr_free;
};
在buddy算法中,将空闲页面分为11个块链表,每个块链表分别包含大小为1、2、4、8、16、32、64、128、256、512和1024个连续的页。这个成员就是为了表示此链表而设的。
这个结构体的定义在include/linux/mmzone.h中:
/*
* One allocation request operates on a zonelist. A zonelist
* is a list of zones, the first one is the 'goal' of the
* allocation, the other zones are fallback zones, in decreasing
* priority.
*
* If zlcache_ptr is not NULL, then it is just the address of zlcache,
* as explained above. If zlcache_ptr is NULL, there is no zlcache.
*/
struct
zonelist {
struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
struct zone *zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited
};
在这里有
/* Maximum number of zones on a zonelist */
#define
MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
其值为2。
l
zones
这个结构体只用在pglist_data中,在初始化后,zones实际只有一个元素,即指向ZONE_DMA,contig_page_data->zone[0]。
l
zlcache_ptr
这个值实际没有使用,为NULL。
page的定义在include/linux/mm_types.h中:
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page, though if it is a pagecache page, rmap structures can tell us
* who is mapping it.
*/
struct
page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
union {
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
*/
struct { /* SLUB uses */
short unsigned int inuse;
short unsigned int offset;
};
};
union {
struct {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
};
spinlock_t ptl;
struct { /* SLUB uses */
void **lockless_freelist;
struct kmem_cache *slab; /* Pointer to slab */
};
struct {
struct page *first_page; /* Compound pages */
};
};
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
*/
};
内核对每一个4K的内存页都用一个page结构进行描述,使用
virt_to_page这个宏定义可以快速地找到指定物理地址对应的page结构体。
#define
virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配,但是与其它用bootmem分配的内存不同,这块内存空间永远不会被回收。
而这个结构体中的flags成员每个位的定义在include/linux/page_flags.h中:
/*
* Various page->flags bits:
*
* PG_reserved is set for special pages, which can never be swapped out. Some
* of them might not even exist (eg empty_bad_page)...
*
* The PG_private bitflag is set on pagecache pages if they contain filesystem
* specific data (which is normally at page->private). It can be used by
* private allocations for its own usage.
*
* During initiation of disk I/O, PG_locked is set. This bit is set before I/O
* and cleared when writeback _starts_ or when read _completes_. PG_writeback
* is set before writeback starts and cleared when it finishes.
*
* PG_locked also pins a page in pagecache, and blocks truncation of the file
* while it is held.
*
* page_waitqueue(page) is a wait queue of all tasks waiting for the page
* to become unlocked.
*
* PG_uptodate tells whether the page's contents is valid. When a read
* completes, the page becomes uptodate, unless a disk I/O error happened.
*
* PG_referenced, PG_reclaim are used for page reclaim for anonymous and
* file-backed pagecache (see mm/vmscan.c).
*
* PG_error is set to indicate that an I/O error occurred on this page.
*
* PG_arch_1 is an architecture specific page state bit. The generic code
* guarantees that this bit is cleared for a page when it first is entered into
* the page cache.
*
* PG_highmem pages are not permanently mapped into the kernel virtual address
* space, they need to be kmapped separately for doing IO on the pages. The
* struct page (these bits with information) are always mapped into kernel
* address space...
*
* PG_buddy is set to indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
*
*/
/*
* Don't use the *_dontuse flags. Use the macros. Otherwise you'll break
* locked- and dirty-page accounting.
*
* The page flags field is split into two parts, the main flags area
* which extends from the low bits upwards, and the fields area which
* extends from the high bits downwards.
*
* | FIELD | ... | FLAGS |
* N-1 ^ 0
* (N-FLAGS_RESERVED)
*
* The fields area is reserved for fields mapping zone, node and SPARSEMEM
* section. The boundry between these two areas is defined by
* FLAGS_RESERVED which defines the width of the fields section
* (see linux/mmzone.h). New flags must _not_ overlap with this area.
*/
#define
PG_locked 0 /* Page is locked. Don't touch. */
#define
PG_error 1
#define
PG_referenced 2
#define
PG_uptodate 3
#define
PG_dirty 4
#define
PG_lru 5
#define
PG_active 6
#define
PG_slab 7 /* slab debug (Suparna wants this) */
#define
PG_owner_priv_1 8 /* Owner use. If pagecache, fs may use*/
#define
PG_arch_1 9
#define
PG_reserved 10
#define
PG_private 11 /* If pagecache, has fs-private data */
#define
PG_writeback 12 /* Page is under writeback */
#define
PG_compound 14 /* Part of a compound page */
#define
PG_swapcache 15 /* Swap page: swp_entry_t in private */
#define
PG_mappedtodisk 16 /* Has blocks allocated on-disk */
#define
PG_reclaim 17 /* To be reclaimed asap */
#define
PG_buddy 19 /* Page is free, on buddy lists */
/* PG_owner_priv_1 users should have descriptive aliases */
#define
PG_checked PG_owner_priv_1 /* Used by some filesystems */
初始化时,flags都将初始化为
PG_reserved
。
对page的初始化过程可参见
memmap_init_zone,位于mm/page_alloc.c:
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
void
__meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
* handed to this function. They do not
* exist on hotplugged memory.
*/
if (context == MEMMAP_EARLY) {
if (!early_pfn_valid(pfn))
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
}
page = pfn_to_page(pfn);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
reset_page_mapcount(page);
SetPageReserved(page);
INIT_LIST_HEAD(&page->lru);
}
}
在上述函数中,由于zone参数为0,故而实际只有
SetPageReserved(page)
这行语句起作用,将flags成员设置为0x400(
PG_reserved)。