内核中与内存相关的结构体主要有:struct pglist_data、struct zone、struct page、struct mm和struct vma.
1.struct pglist_data:
在内核中struct pglist_data谓之为"节点".它是内核内存生物链的金字塔顶,它表征每个CPU所管辖的内存空间.就是说,多少个CPU,对应多少个pglist_data.每一个pglist_data又进一步被划分为一些区域(表示内存中的范围),分别为:ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM.如下:
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[GFP_ZONETYPES];
int nr_zones;
struct page *node_mem_map;
struct bootmem_data *bdata;
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
struct pglist_data *pgdat_next;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
} pg_data_t;
各成员大体意思如下:
node_zones:
某一节点对应的的内存区域划分,这里是ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM;
node_zonelists:
用来组织和管理zone的数组,是内存应该分配哪个zone的内存的一种策略依据,见GFP_ZONETYPES的注释;
nr_zones:
一个节点有多少个zone.一般是1~3个;
node_mem_map:
node中第一个page;
bdata:
用于接收boot传递过来的参数的内存分配管理;
node_start_pfn:
pfn是page frame number的缩写.这个成员是用于表示node中的开始那个page在物理内存中的位置的;
node_present_pages:
一个node可包含的物理页;
pgdat_next:
下一个struct pglist_data;
kswapd_wait:
等待队列头;
kswapd:
内核线程,负责维护当前ZONE内存分配的平衡性.
2.struct zone:
表征一个内存区域,如ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM.如下:
/*
* On machines where it is needed (eg PCs) we divide physical memory
* into multiple physical zones. On a PC we have 3 zones:
*
* ZONE_DMA < 16 MB ISA DMA capable memory
* ZONE_NORMAL 16-896 MB direct mapped by the kernel
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
/*
* protection[] is a pre-calculated number of extra pages that must be
* available in a zone in order for __alloc_pages() to allocate memory
* from the zone. i.e., for a GFP_KERNEL alloc of "order" there must
* be "(1<<order) + protection[ZONE_NORMAL]" free pages in the zone
* for us to choose to allocate the page from that zone.
*
* It uses both min_free_kbytes and sysctl_lower_zone_protection.
* The protection values are recalculated if either of these values
* change. The array elements are in zonelist order:
* [0] == GFP_DMA, [1] == GFP_KERNEL, [2] == GFP_HIGHMEM.
*/
unsigned long protection[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
/*
* free areas of different sizes
*/
spinlock_t lock;
struct free_area free_area[MAX_ORDER];
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct list_head active_list;
struct list_head inactive_list;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long nr_active;
unsigned long nr_inactive;
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
* target at the previous try_to_free_pages() or balance_pgdat()
* invokation.
*
* We use prev_priority as a measure of how much stress page reclaim is
* under - it drives the swappiness decision: whether to unmap mapped
* pages.
*
* temp_priority is used to remember the scanning priority at which
* this zone was successfully refilled to free_pages == pages_high.
*
* Access to both these fields is quite racy even on uniprocessor. But
* it is expected to average out OK.
*/
int temp_priority;
int prev_priority;
ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
unsigned long wait_table_bits;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
struct page *zone_mem_map;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
/*
* rarely used fields:
*/
char *name;
} ____cacheline_maxaligned_in_smp;
结构体各成员意义如下:
free_pages:
空闲page数量;
pages_min, pages_low, pages_high:
ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM所包含的pages;
protection[MAX_NR_ZONES]:
分配哪个zone里面的内存的标识,在zonelist里面,数组protection里面的元素意义如下:
protection[0]:表示从ZONE_DMA中分配内存;
protection[1]:表示从ZONE_NORMAL中分配内存;
protection[2]:表示从ZONE_HIGHMEM中分配内存.
pageset[NR_CPUS]:
每个CPU所管辖的page数目.就像一个CPU管辖一个zonelist(一般用来组织ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM)一样.
lock:
自旋锁;
free_area[MAX_ORDER]:
页面使用状态的信息,以每个bit标识对应的page是否可以分配;
active_list:
可用的pages的组织list;
inactive_list:
不可用的pages的组织list;
nr_scan_active:
扫描可用pages的数目;
nr_scan_inactive:
扫描不可用pages的数目;
nr_active:
可用页的数目;
nr_inactive:
不可用页的数目;
pages_scanned:
上次扫描完,回收的页的数量;
zone_pgdat:
指向此ZONE所属的struct pglist_data;
zone_mem_map:
指向此ZONE映射的内存区域的第一个page;
zone_start_pfn:
和node_start_pfn的含义一样.这个成员是用于表示引ZONE中的开始那个page在物理内存中的位置的;
spanned_pages:
此ZONE总共的page数目;
present_pages:
此ZONE可用的page数目;
name:
此ZONE的名字,如"DMA"、"NORMAL"、"HIGHMEM".
关于zone的相关代码流程大体如下:
asmlinkage void __init start_kernel(void)
-->
void __init setup_arch(char **cmdline_p)
-->
void __init paging_init(struct machine_desc *mdesc)
-->
void __init bootmem_init(void)
-->
static void __init bootmem_free_node(int node, struct meminfo *mi)
-->
free_area_init_core(pgdat, zones_size, zholes_size);
3.struct page:
struct page用来表征系统里面的每个物理页.这个结构体包含了内核所需要的所有的内存的信息.如下:
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page, though if it is a pagecache page, rmap structures can tell us
* who is mapping it.
*/
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
union {
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
*/
struct { /* SLUB */
u16 inuse;
u16 objects;
};
};
union {
struct {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
};
#if USE_SPLIT_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
};
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
*/
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
unsigned long debug_flags; /* Use atomic bitops on this */
#endif
#ifdef CONFIG_KMEMCHECK
/*
* kmemcheck wants to track the status of each byte in a page; this
* is a pointer to such a status block. NULL if not tracked.
*/
void *shadow;
#endif
};
与驱动相关的主要结构体各成员意义如下:
flags:
一套描述页状态的一套位标志.这些包括PG_locked,它指示该页在内存中已被加锁,以及 PG_reserved,它防止内存管理系统使用该页.
_count:
这个页的引用数.当这个count 掉到0, 这页被返回给空闲列表;
virtual:
如果该页被映射了,这里记录的是此页的内核虚拟地址,否则为NULL.低端内存总是被映射,而高端内存则不然.
4.struct mm_struct:
内存描述符,用来表示进程的地址空间,它包含了和进程地址空间相关的所有信息.每个进程都有其惟一的mm_struct.如下:
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
unsigned long mmap_base; /* base of mmap area */
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
/* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
unsigned long total_vm, locked_vm, shared_vm, exec_vm;
unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
struct linux_binfmt *binfmt;
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
mm_context_t context;
/* Swap token stuff */
/*
* Last value of global fault stamp as seen by this process.
* In other words, this value gives an indication of how long
* it has been since this task got the token.
* Look at mm/thrash.c
*/
unsigned int faultstamp;
unsigned int token_priority;
unsigned int last_interval;
unsigned long flags; /* Must use atomic bitops to access the bits */
struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct *owner;
#endif
#ifdef CONFIG_PROC_FS
/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
};
结构体各主要成员意义如下:
mm_users:
正在使用该地址的进程的数目.如果有两个进程共享此空间,那么mm_users等于2.
mmlist:
所有的mm_struct都是通过mmlist域链接在一个双向链表中,该链表的首元素是ini_mm内存描述符,它代表init进程的地址空间.
5.struct vm_area_struct:
内存区域描述符,它描述了指定地址空间内连续区间上的一个独立的内存范围.注意,struct vm_area_struct描述的是一段内存区间.每一个vm_area_struct对应于进程地址空间中的唯一区间,包括当前进程的代码段、数据段和BSS段的布局.如下:
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, see mm.h. */
struct rb_node vm_rb;
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
* linkage to the list of like vmas hanging off its node, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
*/
union {
struct {
struct list_head list;
void *parent; /* aligns with prio_tree_node parent */
struct vm_area_struct *head;
} vm_set;
struct raw_prio_tree_node prio_tree_node;
} shared;
/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
};
其主要成员意义如下:
vm_start、vm_end:
内存区间开始、结束地址.因此,[vm_start,vm_end]构成了这个内存区间;
vm_flags:
VMA的行为标识,比如我们需要对当前进程的内存区间进行内存共享,需要把vm_flags设置为VM_SHARED.
vm_ops:
struct vm_area_struct可以理解为一个对象,自然的,对象有其相应的操作集,这里对应着就是vm_ops.
下面我们通过一个简单的示例查看一下一个进程的VMA.如下:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main(int argc,char **argv)
{
while(1)
{
usleep(1000);
}
return 0;
}
编译后运行并查看其maps(cat /proc/<pid>/maps):
04dd000-004fd000 r-xp 00000000 fd:00 201453 /lib/ld-2.10.1.so
004fd000-004fe000 r--p 0001f000 fd:00 201453 /lib/ld-2.10.1.so
004fe000-004ff000 rw-p 00020000 fd:00 201453 /lib/ld-2.10.1.so
00501000-0066c000 r-xp 00000000 fd:00 201496 /lib/libc-2.10.1.so
0066c000-0066d000 ---p 0016b000 fd:00 201496 /lib/libc-2.10.1.so
0066d000-0066f000 r--p 0016b000 fd:00 201496 /lib/libc-2.10.1.so
0066f000-00670000 rw-p 0016d000 fd:00 201496 /lib/libc-2.10.1.so
00670000-00673000 rw-p 00670000 00:00 0
0094b000-0094c000 r-xp 0094b000 00:00 0 [vdso]
08048000-08049000 r-xp 00000000 fd:00 1743273 /home/seven/learn/vma
08049000-0804a000 rw-p 00000000 fd:00 1743273 /home/seven/learn/vma
b7f32000-b7f34000 rw-p b7f32000 00:00 0
bf835000-bf84a000 rw-p bffeb000 00:00 0 [stack]
各列意义如下:
开始-结束 访问权限 偏移 主设备号:次设备号 i节点 文件
每一行都与一个vm_area_struct相对应.例如:
第四行是libc库的代码段;
第七行是libc库的的数据段;
第八行是libc库的bss段;
最后一行是进程的栈.大小为84KB(bf84a000 - bf835000 = 0x15000 = 84KB).
用工具pamp可以看得更明了:
[root@seven linux-2.6.34]# pmap 2546
2546: ./vma
004dd000 128K r-x-- /lib/ld-2.10.1.so
004fd000 4K r---- /lib/ld-2.10.1.so
004fe000 4K rw--- /lib/ld-2.10.1.so
00501000 1452K r-x-- /lib/libc-2.10.1.so
0066c000 4K ----- /lib/libc-2.10.1.so
0066d000 8K r---- /lib/libc-2.10.1.so
0066f000 4K rw--- /lib/libc-2.10.1.so
00670000 12K rw--- [ anon ]
0094b000 4K r-x-- [ anon ]
08048000 4K r-x-- /home/seven/learn/vma
08049000 4K rw--- /home/seven/learn/vma
b7f32000 8K rw--- [ anon ]
bf835000 84K rw--- [ stack ]
total 1720K
[root@seven linux-2.6.34]#
[注:]这里我们的进程的私有内存区域只占了84KB,但是总进程所占的内存区域为1720KB,其中1452KB是C库的,这个内存区域的大小是共享的,每一个使用C库的进程都是共享这个内存区域的,而不是每一个使用C库的进程都要额外分配1452KB的内存空间.