内核本身不能像用户空间那样奢侈地使用内存,故在内核里分配内存不像在其他地方那么容易。
- 页
内核把物理页作为内存管理的基本单位。尽管处理器的最小可寻址单位通常为字(甚至是字节),但是,内存管理单元(MMU,管理内存并把虚地址转换为物理地址的硬件)通常以页为单位进行处理。MMU也页page为单位来管理系统中的页表,从虚拟内存的角度来看,页就是最小单位。
体系结构不同支持的页大小也不尽相同。大多数32位体系结构支持4KB的页,而64位体系结构一般会支持8KB的页。
内核用struct page结构表示系统中的每个物理页:
- 在<Mm_type.h(inluce/linux)>中
- struct address_space;
- /*
- * Each physical page in the system has a struct page associated with
- * it to keep track of whatever it is we are using the page for at the
- * moment. Note that we have no way to track which tasks are using
- * a page, though if it is a pagecache page, rmap structures can tell us
- * who is mapping it.
- */
- struct page {
- unsigned long flags; /* Atomic flags, some possibly
- * updated asynchronously */
- atomic_t _count; /* Usage count, see below. */ 引用计数,当值为0是,当前内核没有引用这一页,在新的分配中可以使用它;内核代码不应直接引用该域,而是通过page_count()函数进行检查。
- atomic_t _mapcount; /* Count of ptes mapped in mms,
- * to show when page is mapped
- * & limit reverse map searches.
- */
- union {
- struct {
- unsigned long private; /* Mapping-private opaque data:
- * usually used for buffer_heads
- * if PagePrivate set; used for
- * swp_entry_t if PageSwapCache;
- * indicates order in the buddy
- * system if PG_buddy is set.
- */
- struct address_space *mapping; /* If low bit clear, points to 一个页可以由页缓存使用,这时mapping指向与 * inode address_space, or NULL. 这个 页关联的 addreaa_space 对象
- * If page mapped as anonymous
- * memory, low bit is set, and
- * it points to anon_vma object:
- * see PAGE_MAPPING_ANON below.
- */
- };
- #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
- spinlock_t ptl;
- #endif
- };
- pgoff_t index; /* Our offset within mapping. */
- struct list_head lru; /* Pageout list, eg. active_list
- * protected by zone->lru_lock !
- */
- /*
- * On machines where all RAM is mapped into kernel address space,
- * we can simply calculate the virtual address. On machines with
- * highmem some memory is mapped into kernel virtual memory
- * dynamically, so we need a place to store that address.
- * Note that this field could be 16 bits on x86 ... ;)
- *
- * Architectures with slow multiplication can define
- * WANT_PAGE_VIRTUAL in asm/page.h
- */
- #if defined(WANT_PAGE_VIRTUAL)
- /*页的虚拟地址。有些内存(如高端内存)并不永久的映射到内核地址空间上。这时这个值为NULL,需要时在动态地映射这些页*/
- void *virtual; /* Kernel virtual address (NULL if
- not kmapped, ie. highmem) */
- #endif /* WANT_PAGE_VIRTUAL */
- };
- 在<Mm.h(include/linux)>中
- static inline int page_count(struct page *page)
- {
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
- return atomic_read(&page->_count);
- }
- #define page_private(page) ((page)->private)
一个页可以有页缓存使用(这时,mapping域指向和这个页关联的address_space对象),或者作为私有数据(由private指向),或者作为进程页表中的映射。
page结构与物理页相关,而并非与虚拟页相关。因此,该结构对页的描述只是短暂的。即使页中所包含的数据继续存在,但是由于交换等原因,它们可能并不再和同一个page相关联。内核仅仅用这个数据结构来描述当前时刻在相关的物理页中存放的的哦你哦更新。这种数据结构的目的在于描述物理内存本身,而不是描述包含在其中的数据。
内核用这一结构来管理系统中的所有页,因为内存需要知道一个页是否空闲(也就是没有被分配)。如果页已经分配,内核还uxy知道谁拥有这个页。拥有者可能是用户空间进程、动态分配的内核数据、静态内核代码、或页高速缓存等。
系统中的每个物理页都要分配一个这样的结构体。
- 区
由于硬件限制,有些页位于内存中特定的物理地址上,不能将其用于一些特定的任务。 Linux必须处理如下两种由于硬件存在的缺陷引起的内存寻址问题:
一些硬件只能用特定的内存来执行DMA
一些体系结构其内存的物理寻址范围比虚拟寻址范围大得多。这样,就有一些内存不能永久的映射到内核空间上
因此,内核把页分为不同的区(zone)。内核使用区对具有相似特性的页进行分组。Linux使用了三种区:
ZONE_DMA,这个区包含的页能用来执行DMA操作
ZONE_NORMAL,这个包含的是能正常映射的页
ZONE_HIGHMEM,这个区包含“高端内存”,其中的页并不能永久地映射到内核地址空间
区的实际使用和分布是与体系结构相关的。
某些体系结构在内存的任何地址上执行DMA都没问题,在这些体系结构中,ZONE_DMA为空,ZONE_NORMAL就可以直接用于分配。
x86体系结构与此相反,ISA设备不能在整个32位的地址空间中执行DMA,因为ISA设备只能访问物理内存的前16MB。因此,ZONE_DMA在x86上包含的页都在0~16MB的内存范围里。
ZONE_HIGHMEM的工作方式页差不多。能否直接映射取决于体系结构。x86上,ZONE_HIGHMEM为高于896MB的所有物理内存。在其他体系结构上,由于所有的内存都被直接映射,所以ZONE_HIGHMEM为空。
ZONE_HIGHMEM所在的内存就是所谓的高端内存,系统其余部分就是所谓的低端内存。
ZONE_HIGHMEM和ZONE_DMA各取所需之后剩下的就是ZONE_NORMAL区独享了。在x86上,ZONE_NORMAL区是从16MB到896MB之间的所有物理内存。其他体系结构上,ZONE_NORMAL是所有可用物理内存。
- 在<Mmzone.h(include/linux)>中
- enum zone_type {
- #ifdef CONFIG_ZONE_DMA
- /*
- * ZONE_DMA is used when there are devices that are not able
- * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
- * carve out the portion of memory that is needed for these devices.
- * The range is arch specific.
- *
- * Some examples
- *
- * Architecture Limit
- * ---------------------------
- * parisc, ia64, sparc <4G
- * s390 <2G
- * arm26 <48M
- * arm Various
- * alpha Unlimited or 0-16MB.
- *
- * i386, x86_64 and multiple other arches
- * <16M.
- */
- ZONE_DMA,
- #endif
- #ifdef CONFIG_ZONE_DMA32
- /*
- * x86_64 needs two ZONE_DMAs because it supports devices that are
- * only able to do DMA to the lower 16M but also 32 bit devices that
- * can only do DMA areas below 4G.
- */
- ZONE_DMA32,
- #endif
- /*
- * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
- * performed on pages in ZONE_NORMAL if the DMA devices support
- * transfers to all addressable memory.
- */
- ZONE_NORMAL,
- #ifdef CONFIG_HIGHMEM
- /*
- * A memory area that is only addressable by the kernel through
- * mapping portions into its own address space. This is for example
- * used by i386 to allow the kernel to address the memory beyond
- * 900MB. The kernel will set up special mappings (page
- * table entries on i386) for each page that the kernel needs to
- * access.
- */
- ZONE_HIGHMEM,
- #endif
- MAX_NR_ZONES
- };
x86上的区
区 | 描述 | 物理内存 |
ZONE_DMA | DMA使用的页 | <16MB |
ZONE_NORMAL | 正常可寻址的页 | 16-896MB |
ZONE_HIGHMEM | 动态映射的页 | >896MB |
并不是某些用途的内存一定要从对应的区获取。比如,一般用途的内存机能从ZONE_DMA分配,也能从ZONE_NORMAL分配。
每个区都用sturct zone来表示:
- struct zone {
- /* Fields commonly accessed by the page allocator */
- unsigned long pages_min, pages_low, pages_high;
- /*
- * We don't know if the memory that we're going to allocate will be freeable
- * or/and it will be released eventually, so to avoid totally wasting several
- * GB of ram we must reserve some of the lower zone memory (otherwise we risk
- * to run OOM on the lower zones despite there's tons of freeable ram
- * on the higher zones). This array is recalculated at runtime if the
- * sysctl_lowmem_reserve_ratio sysctl changes.
- */
- unsigned long lowmem_reserve[MAX_NR_ZONES];
- #ifdef CONFIG_NUMA
- int node;
- /*
- * zone reclaim becomes active if more unmapped pages exist.
- */
- unsigned long min_unmapped_pages;
- unsigned long min_slab_pages;
- struct per_cpu_pageset *pageset[NR_CPUS];
- #else
- struct per_cpu_pageset pageset[NR_CPUS];
- #endif
- /*
- * free areas of different sizes
- */
- spinlock_t lock; /* 防止该结构被并发访问,注意该结构只保护结构而不保护驻留在这个区中的所有页*/
- #ifdef CONFIG_MEMORY_HOTPLUG
- /* see spanned/present_pages for more description */
- seqlock_t span_seqlock;
- #endif
- struct free_area free_area[MAX_ORDER];
- ZONE_PADDING(_pad1_)
- /* Fields commonly accessed by the page reclaim scanner */
- spinlock_t lru_lock;
- struct list_head active_list;
- struct list_head inactive_list;
- unsigned long nr_scan_active;
- unsigned long nr_scan_inactive;
- unsigned long pages_scanned; /* since last reclaim */
- int all_unreclaimable; /* All pages pinned */
- /* A count of how many reclaimers are scanning this zone */
- atomic_t reclaim_in_progress;
- /* Zone statistics */
- atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
- /*
- * prev_priority holds the scanning priority for this zone. It is
- * defined as the scanning priority at which we achieved our reclaim
- * target at the previous try_to_free_pages() or balance_pgdat()
- * invokation.
- *
- * We use prev_priority as a measure of how much stress page reclaim is
- * under - it drives the swappiness decision: whether to unmap mapped
- * pages.
- *
- * Access to both this field is quite racy even on uniprocessor. But
- * it is expected to average out OK.
- */
- int prev_priority;
- ZONE_PADDING(_pad2_)
- /* Rarely used or read-mostly fields */
- /*
- * wait_table -- the array holding the hash table
- * wait_table_hash_nr_entries -- the size of the hash table array
- * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
- *
- * The purpose of all these is to keep track of the people
- * waiting for a page to become available and make them
- * runnable again when possible. The trouble is that this
- * consumes a lot of space, especially when so few things
- * wait on pages at a given time. So instead of using
- * per-page waitqueues, we use a waitqueue hash table.
- *
- * The bucket discipline is to sleep on the same queue when
- * colliding and wake all in that wait queue when removing.
- * When something wakes, it must check to be sure its page is
- * truly available, a la thundering herd. The cost of a
- * collision is great, but given the expected load of the
- * table, they should be so rare as to be outweighed by the
- * benefits from the saved space.
- *
- * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
- * primary users of these fields, and in mm/page_alloc.c
- * free_area_init_core() performs the initialization of them.
- */
- wait_queue_head_t * wait_table;
- unsigned long wait_table_hash_nr_entries;
- unsigned long wait_table_bits;
- /*
- * Discontig memory support fields.
- */
- struct pglist_data *zone_pgdat;
- /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
- unsigned long zone_start_pfn;
- /*
- * zone_start_pfn, spanned_pages and present_pages are all
- * protected by span_seqlock. It is a seqlock because it has
- * to be read outside of zone->lock, and it is done in the main
- * allocator path. But, it is written quite infrequently.
- *
- * The lock is declared along with zone->lock because it is
- * frequently read in proximity to zone->lock. It's good to
- * give them a chance of being in the same cacheline.
- */
- unsigned long spanned_pages; /* total size, including holes */
- unsigned long present_pages; /* amount of memory (excluding holes) */
- /*
- * rarely used fields:
- */
- const char *name; /* 以NULL结束的字符串,表示这个区的名字。内核启动期间初始化这个值,其代码位于mm/page_alloc.c中*/
- } ____cacheline_internodealigned_in_smp;
系统中只有三个区,因此页只有三个这样的结构。
- 获得页
内核提供了一种请求内存的底层机制,并提供了对它进行访问的几个接口。所有这些接口都以页为单位分配内存,定义于<linux/gfp.h>中,最核心的函数是alloc_pages(),该函数分配2^order个连续的物理页,并返回一个指针:
- #ifdef CONFIG_NUMA
- extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
- /*分配2^order页,返回一个指向第一页页结构的指针*/
- static inline struct page *
- alloc_pages(gfp_t gfp_mask, unsigned int order)
- {
- if (unlikely(order >= MAX_ORDER))
- return NULL;
- return alloc_pages_current(gfp_mask, order);
- }
- extern struct page *alloc_page_vma(gfp_t gfp_mask,
- struct vm_area_struct *vma, unsigned long addr);
- #else
- #define alloc_pages(gfp_mask, order) /
- alloc_pages_node(numa_node_id(), gfp_mask, order)
- #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
- #endif
- typedef unsigned __bitwise__ gfp_t;
- 在<Mempolice.c(mm)>中
- /**
- * alloc_pages_current - Allocate pages.
- *
- * @gfp:
- * %GFP_USER user allocation,
- * %GFP_KERNEL kernel allocation,
- * %GFP_HIGHMEM highmem allocation,
- * %GFP_FS don't call back into a file system.
- * %GFP_ATOMIC don't sleep.
- * @order: Power of two of allocation size in pages. 0 is a single page.
- *
- * Allocate a page from the kernel page pool. When not in
- * interrupt context and apply the current process NUMA policy.
- * Returns NULL when no page can be allocated.
- *
- * Don't call cpuset_update_task_memory_state() unless
- * 1) it's ok to take cpuset_sem (can WAIT), and
- * 2) allocating for current task (not interrupt).
- */
- struct page *alloc_pages_current(gfp_t gfp, unsigned order)
- {
- struct mempolicy *pol = current->mempolicy;
- if ((gfp & __GFP_WAIT) && !in_interrupt())
- cpuset_update_task_memory_state();
- if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
- pol = &default_policy;
- if (pol->policy == MPOL_INTERLEAVE)
- return alloc_page_interleave(gfp, order, interleave_nodes(pol));
- return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
- }
可以使用page_address()函数把给定的页转换成它的逻辑地址:
- 在<Mm.h(include/linux)>中
- #if defined(WANT_PAGE_VIRTUAL)
- #define page_address(page) ((page)->virtual)
- #define set_page_address(page, address) /
- do { /
- (page)->virtual = (address); /
- } while(0)
- #define page_address_init() do { } while(0)
- #endif
- #if defined(HASHED_PAGE_VIRTUAL)
- void *page_address(struct page *page);
- void set_page_address(struct page *page, void *virtual);
- void page_address_init(void);
- #endif
- #if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL)
- #define page_address(page) lowmem_page_address(page)
- #define set_page_address(page, address) do { } while(0)
- #define page_address_init() do { } while(0)
- #endif
- 在<Highmem.c(mm)>中
- void *page_address(struct page *page)
- {
- unsigned long flags;
- void *ret;
- struct page_address_slot *pas;
- if (!PageHighMem(page))
- return lowmem_page_address(page);
- pas = page_slot(page);
- ret = NULL;
- spin_lock_irqsave(&pas->lock, flags);
- if (!list_empty(&pas->lh)) {
- struct page_address_map *pam;
- list_for_each_entry(pam, &pas->lh, list) {
- if (pam->page == page) {
- ret = pam->virtual;
- goto done;
- }
- }
- }
- done:
- spin_unlock_irqrestore(&pas->lock, flags);
- return ret;
- }
- /*
- * Hash table bucket
- */
- static struct page_address_slot {
- struct list_head lh; /* List of page_address_maps */
- spinlock_t lock; /* Protect this bucket's list */
- } ____cacheline_aligned_in_smp page_address_htable[1<<PA_HASH_ORDER];
- static struct page_address_slot *page_slot(struct page *page)
- {
- return &page_address_htable[hash_ptr(page, PA_HASH_ORDER)];
- }
如果不需要用的struct page,可以调用__get_free_pages()函数直接返回所请求的第一个页的逻辑地址。因为页是连续的,因此其他的页也会紧随其后:
- 在<Page_alloc.c(mm)>中
- /*
- * Common helper functions.
- */
- /*分配2^order页,返回一个指向第一页逻辑地址的指针*/
- fastcall unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
- {
- struct page * page;
- page = alloc_pages(gfp_mask, order);
- if (!page)
- return 0;
- return (unsigned long) page_address(page);
- }
如果只须一页,可以使用下面两个函数:
- 在<Gfp.h(include/linux)>中
- /*分配1页,返回一个指向页结构的指针*/
- #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
- /*分配1页,返回一个指向其逻辑地址的指针*/
- #define __get_free_page(gfp_mask) /
- __get_free_pages((gfp_mask),0)
获得填充为0的页:
- 在<Page_alloc.c(mm)>中
- /*分配1页,让其内容填充为0,返回一个指向其逻辑地址的指针*/
- fastcall unsigned long get_zeroed_page(gfp_t gfp_mask)
- {
- struct page * page;
- /*
- * get_zeroed_page() returns a 32-bit address, which cannot represent
- * a highmem page
- */
- VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0);
- page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
- if (page)
- return (unsigned long) page_address(page);
- return 0;
- }
- 释放页
当不再需要页时可以使用下面的函数进行释放:
- 在<Page_alloc.c(mm)>中
- fastcall void free_pages(unsigned long addr, unsigned int order)
- {
- if (addr != 0) {
- VM_BUG_ON(!virt_addr_valid((void *)addr));
- __free_pages(virt_to_page((void *)addr), order);
- }
- }
- fastcall void __free_pages(struct page *page, unsigned int order)
- {
- if (put_page_testzero(page)) {
- if (order == 0)
- free_hot_page(page);
- else
- __free_pages_ok(page, order);
- }
- }
- 在<Gfp.h(include/linux)>中
- #define __free_page(page) __free_pages((page), 0)
- #define free_page(addr) free_pages((addr),0)
释放页时,要谨慎。只能释放属于你的页。传递了错误的struct page或地址,用错误的order值,都可能导致系统崩溃。
当你需要以页为单位的一簇连续物理页时,尤其是只需要一两页时,这些低级页函数很有用。对于常用的以字节为单位的分配来说,内核提供的函数是Kmalloc()。