了解linux 内存管理的基本概念后,本篇我们将介绍linux内核中物理内存的概念对应的数据结构
物理内存的数据结构
物理内存的管理分为三个层次,依次为:bank(node)、zone、page。一个bank包含1–3个zone,每个zone包含许多固定大小的page。下面来分别介绍linux 中对不同层次的内存结构的软件构造和描述
node
内存节点代表的是物理内存的一个bank,是最上层,也是最大的内存单元,uma架构的系统中只有一个内存节点,NUMA 系统架构则有多个节点,所有节点都保存pgdat_list 链表中。其结构如下:
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
* (mostly NUMA machines?) to denote a higher-level memory zone than the
* zone_struct denotes.
*
* On NUMA machines, each NUMA node would have a pg_data_t to describe
* it's memory layout.
*
* XXX: we need to move the global memory statistics (active_list, ...)
* into the pg_data_t to properly support NUMA.
*/
typedef struct pglist_data {
zone_t node_zones[MAX_NR_ZONES]; //节点包含的zone
zonelist_t node_zonelists[GFP_ZONEMASK+1];//内存分配的区域优先次序
int nr_zones;.//分区的个数
struct page *node_mem_map;//节点第一个page的指针,在mem_map数组的某个位置
unsigned long *valid_addr_bitmap;//有效地址位图,用于有空洞的内存系统
struct bootmem_data *bdata;//用于启动内存分配
unsigned long node_start_paddr;.//节点的起始物理地址
unsigned long node_start_mapnr;//全局mem map数组的偏移量
unsigned long node_size;//包含的page数量
int node_id;// 节点ID, 从0开始
struct pglist_data *node_next;//下个节点指针
} pg_data_t;
UMA 的架构下,只有一个node,对应于page_alloc .c 下静态定义的 contig_page_data
zone
节点又分为不同的zone,目前有三种类型的zone:normal、highmem、dma。不是所有的系统都有这三个分区。其结构如下:
typedef struct zone_struct {
/*
* Commonly accessed fields:
*/
spinlock_t lock;
unsigned long free_pages; //本区空闲的页数
unsigned long pages_min, pages_low, pages_high;//水位标志
int need_balance;//需平衡标志
/*
* free areas of different sizes
*/
free_area_t free_area[MAX_ORDER];//空闲区域位图,用于伙伴分配器
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
* wait_table_shift -- wait_table_size
* == BITS_PER_LONG (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c//
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
unsigned long wait_table_shift; //the number of bits a page address must be shifted right to return an index within the table
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat; //指向其父节点指针
struct page *zone_mem_map;//该区在MEM_MAP中首个页地址
unsigned long zone_start_paddr;//该区起始物理地址
unsigned long zone_start_mapnr;//该区在全局mem map数组的偏移量
/*
* rarely used fields:
*/
char *name;//This is the string name of the zone: “DMA”, “Normal” or “HighMem”
unsigned long size;// the size of the zone in pages
} zone_t;
watermarks
page
一个zone 包含多个page,每个page的大小是相同的,页的大小依赖硬件架构,或者硬件支持多种大小的页,由操作系统选择决定,但同一时刻只能选择一种大小,所以对整个系统来说,页是固定大小的。
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page.
*
* Try to keep the most commonly accessed fields in single cache lines
* here (16 bytes or greater). This ordering should be particularly
* beneficial on 32-bit processors.
*
* The first line is data used in page cache lookup, the second line
* is used for linear searches (eg. clock algorithm scans).
*
* TODO: make this structure smaller, it could be as small as 32 bytes.
*/
typedef struct page {
struct list_head list; /* ->mapping has some page lists. */
struct address_space *mapping; /* The inode (or ...) we belong to. */
unsigned long index; /* Our offset within mapping. */
struct page *next_hash; /* Next page sharing our hash bucket in
the pagecache hash table. */
atomic_t count; /* Usage count, see below. */
unsigned long flags; /* atomic flags, some possibly
updated asynchronously */
struct list_head lru; /* Pageout list, eg. active_list;
protected by pagemap_lru_lock !! */
struct page **pprev_hash; /* Complement to *next_hash. */
struct buffer_head * buffers; /* Buffer maps us to a disk block. */
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */
} mem_map_t;