今天我们开始啃内存管理子系统,ok,let‘s go!
一.虚拟内存
首先linux使用的是虚拟内存地址,这个是有别与uboot的,在uboot中使用的是实地址,所以在linux启动时会有初始化mmu功能。初始化mmu后,linux就是用的是虚拟地址了。
cpu使用的虚拟地址va–>经过转换tva–>mmu–>实地址pa。
32位的cpu可寻址空间是4G;通常情况下内核使用的是3G-4G的地址空间;而用户使用的是0~3G的空间。每个进程都有一个单独的用户空间地址,共用内核空间,这个界限是通过PAGE_OFFSET宏来确定的,可通过make menuconfig来设置。所以就需要虚拟地址和实地址的转换,这个转换表称为页表。页表是给MMU看的,cpu发出虚拟地址va,va查表访问pa。翻译的过程是MMU来做的,但是创建页表的过程是CPU做的。
图片:
进程使用的地址为va(虚拟地址);但是多进程的虚拟地址会有冲突,所以在cpu访问内存时,在地址总线上的地址已经经过一次转换MVA,传给MMU的地址是MVA,经过MMU,MMU查表得到pa,在传给RAM。
linux内核代码段,data等程序运行基础段和vmalloc,还有remap,及dma(cma)空间固定映射空间,是线性映射的;用户空间是非线性映射需要通过页表动态映射,当然有些系统内存超过4G,称为高端内存,高端内存没法直接访问也是通过动态映射来访问的。
1. 页表,物理页框
1. linux 三级页表
linux内核使用3级页表。数据结构PGD,PMD,PTE,但是有的MMU只支持两级页表,则PMD就是第三级页表。物理内存则是用page来表示,一个page为4K。
物理地址:页框码+偏移量
MMU两大功能:1. 地址映射、翻译 2. 权限检查。
那一个页表就需要两个信息,一是物理地址的索引,二是权限控制位。
** 页表结构图:**
页表和页表项:
页表项:pte
page大小4K,所以页框号只使用高20bit,低12bit刚好是4K大小,用来存储控制位。
2. linux三级页表
如果只使用一级页表,那么有多少个物理页就需要多少个页表。虚拟空间4GB,page 4K,则需要1MB的空间来存储这些页表。那就需要很大的连续空间来存储这些页表,(当然这些页表在内核空间是固定映射的地址),所以就有了分级页表。将页表按页存储,然后二级页表再存储这些页表的索引地址。这样就可是使用分散的内存来存储页表减小内存开销。
虚拟地址va中:pgd,pmd,pte
代码中结构体:
代码中pgd,pmd,pte都是long型的整数。
2. 快表:tlb
系统一旦访问了某一个页,那么系统就会在一段时间内稳定地工作在这个页上。所以,为了提高访问页表的速度,系统还配备了一组正好能容纳一个页表的硬件寄存器,这样当系统再访问虚存时,就首先到这组硬件寄存器中去访问,系统速度就快多了。这组存放当前页表的寄存器叫做快表
3.arm两级页表
arm mmu硬件是只识别二级页表的。
虚拟地址高12bit是L1一级页表的索引index,所以以及页表有4096个项,每个项长度为4,所以以及页表大小为16K。
** 一级页表每个项(item):**
一级页表项entriy,有4种。每个entry的最低2bit [1:0]表示这是什么页表项。
主要有两种页表项:
- 指向二级页表entry:
- 指向1MB的物理地址。所以4K个entry可以管理4MB的物理内存(section entry)。
L1页表在0xc0004000处,而linux kernel的地址在0xc0008000处,L1页表刚好16K。
linux三级页表和arm二级页表
linux三级页表适配arm中二级页表,则pud为空,pgd就是pmd。
代码分布:mm/ arch/arm/mm/
建立页表:
内核初始化时会建立临时页表,这个L1页表就是第二种,指向一个1MB的物理空间的。
arch/arm/kernel/head.S 的__create_page_tables函数进行页表初始化,供内核初始化时使用,在start_kernel之前,mmu开启之前就建立好页表了。
3. 页表的初始化
上面建立的页表会在kernel初始化时覆盖掉,即
set_arch()
–>paging_init()
–>-->map_lowmem()中会重新建立页表。
paging_init()在arch/arm/mm/mmu.c中定义
/*
* paging_init() sets up the page tables, initialises the zone memory
* maps, and sets up the zero page, bad page and bad page tables.
*/
void __init paging_init(const struct machine_desc *mdesc)
{
void *zero_page;
prepare_page_table();
map_lowmem(); /*映射低端内存*/
memblock_set_current_limit(arm_lowmem_limit);
dma_contiguous_remap();
early_fixmap_shutdown();
devicemaps_init(mdesc);
kmap_init();
tcm_init();
top_pmd = pmd_off_k(0xffff0000);
/* allocate the zero page. */
zero_page = early_alloc(PAGE_SIZE);
bootmem_init();
empty_zero_page = virt_to_page(zero_page);
__flush_dcache_page(NULL, empty_zero_page);
/* Compute the virt/idmap offset, mostly for the sake of KVM */
kimage_voffset = (unsigned long)&kimage_voffset - virt_to_idmap(&kimage_voffset);
}
4. 建立映射create_mapping
动态建立页表映射的函数。kernel想要访问超过1GB的高端内存时需要建立动态映射。
arch/arm/mm/mmu.c中
static void __init __create_mapping(struct mm_struct *mm, struct map_desc *md,
void *(*alloc)(unsigned long sz),
bool ng)
{
unsigned long addr, length, end;
phys_addr_t phys;
const struct mem_type *type;
pgd_t *pgd;
type = &mem_types[md->type];
#ifndef CONFIG_ARM_LPAE
/*
* Catch 36-bit addresses
*/
if (md->pfn >= 0x100000) {
create_36bit_mapping(mm, md, type, ng);
return;
}
#endif
addr = md->virtual & PAGE_MASK;
phys = __pfn_to_phys(md->pfn);
length = PAGE_ALIGN(md->length + (md->virtual & ~PAGE_MASK));
if (type->prot_l1 == 0 && ((addr | phys | length) & ~SECTION_MASK)) {
pr_warn("BUG: map for 0x%08llx at 0x%08lx can not be mapped using pages, ignoring.\n",
(long long)__pfn_to_phys(md->pfn), addr);
return;
}
pgd = pgd_offset(mm, addr);
end = addr + length;
do {
unsigned long next = pgd_addr_end(addr, end);
alloc_init_pud(pgd, addr, next, phys, type, alloc, ng);
phys += next - addr;
addr = next;
} while (pgd++, addr != end);
}
mmu从虚拟地址转换成物理地址的过程
建立页表代码分析(只分析内核低端内存线性映射部分):
start_kernel
–>setup_arch() //arch/arm/setup.c
–>-->paging_init() //arch/arm/mm/mmu.c
–>-->–> map_lowmem()
–>-->–>-->create_mapping()
先看架构相关设置函数 setup_atch()
init_mm在mm/init-mm.c中,init进程所使用的mm_struct结构
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir, /*initj进程的页表*/
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
.cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
INIT_MM_CONTEXT(init_mm)
};
void __init setup_arch(char **cmdline_p)
{
....
init_mm.start_code = (unsigned long) _text; /*init进程的代码段*/
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = (unsigned long) _end;
.....
early_fixmap_init();
early_ioremap_init();
parse_early_param();
#ifdef CONFIG_MMU
early_mm_init(mdesc);
#endif
setup_dma_zone(mdesc);
xen_early_init();
efi_init();
/*
* Make sure the calculation for lowmem/highmem is set appropriately
* before reserving/allocating any mmeory
*/
adjust_lowmem_bounds();
arm_memblock_init(mdesc);
/* Memory may have been removed so recalculate the bounds. */
adjust_lowmem_bounds();
early_ioremap_reset();
paging_init(mdesc); /*初始化页表建立映射*/
}
我们先来看一下,内存在paging_init()之前的函数。
early_fixmap_init()
fixmap区域是固定映射的一块区域,比如我们的fdt在物理内存中,在加载时,还没有进行page_init(),所以会在fixmap区域进行映射,early_fixmap_init只是建立映射的框架,即pmd的建立,至于pte的填充,到fdt映射时才填, fixmap_remap_fdt()函数。我们查看设备内存信息,fixed项,既是fixmap区域。
void __init early_fixmap_init(void)
{
pmd_t *pmd;
/*
* The early fixmap range spans multiple pmds, for which
* we are not prepared:
*/
BUILD_BUG_ON((__fix_to_virt(__end_of_early_ioremap_region) >> PMD_SHIFT)
!= FIXADDR_TOP >> PMD_SHIFT);
pmd = fixmap_pmd(FIXADDR_TOP);
pmd_populate_kernel(&init_mm, pmd, bm_pte);
pte_offset_fixmap = pte_offset_early_fixmap;
}
static inline pmd_t * __init fixmap_pmd(unsigned long addr)
{
pgd_t *pgd = pgd_offset_k(addr);
pud_t *pud = pud_offset(pgd, addr);
pmd_t *pmd = pmd_offset(pud, addr);
return pmd;
}
pgd_offset_k在arch/arm/include/asm/pgtable.h中
extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; /*init进程的一级页表*/
/* to find an entry in a page-table-directory */
#define pgd_index(addr) ((addr) >> PGDIR_SHIFT)
#define pgd_offset(mm, addr) ((mm)->pgd + pgd_index(addr))
/* to find an entry in a kernel page-table-directory */
#define pgd_offset_k(addr) pgd_offset(&init_mm, addr)
pud_offset既是pgd
#define pud_offset(pgd, start) (pgd)
pmd_offset在二级页表中也是pud,也就是pgd。
arch/arm/include/asm/pagetable-2level.h中
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
{
return (pmd_t *)pud;
}
再看pmd_populate_kernel,在arch/arm/include/asm/pgalloc.h中
pmd_populate_kernel
–>__pmd_populate
–>-->flush_pmd_entry
static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t pte,
pmdval_t prot)
{
pmdval_t pmdval = (pte + PTE_HWTABLE_OFF) | prot;
pmdp[0] = __pmd(pmdval);
#ifndef CONFIG_ARM_LPAE
pmdp[1] = __pmd(pmdval + 256 * sizeof(pte_t));
#endif
flush_pmd_entry(pmdp);
}
flush_pmd_entry在arch/arm/include/asm/tlbflush.h,刷新pmd的tlb快表
static inline void flush_pmd_entry(void *pmd)
{
const unsigned int __tlb_flag = __cpu_tlb_flags;
tlb_op(TLB_DCLEAN, "c7, c10, 1 @ flush_pmd", pmd);
tlb_l2_op(TLB_L2CLEAN_FR, "c15, c9, 1 @ L2 flush_pmd", pmd);
if (tlb_flag(TLB_WB))
dsb(ishst);
}
__fix_to_virt宏在include/asm-generic/fixmap.h中
#define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT))
#define __virt_to_fix(x) ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT)
FIXADDR_TOP 在arch/arm/include/asm/中
#define FIXADDR_START 0xffc00000UL
#define FIXADDR_END 0xfff00000UL
#define FIXADDR_TOP (FIXADDR_END - PAGE_SIZE)
arm是二级页表,所以pgd既是pmd,所以这里只设置pmd…
fixmap区域示意图,用于fdt映射和io的早期映射。
解析fdt之前,尚不知道内存ddr的分布信息,都依赖于编译时的地址。所以要尽快解析fdt获得内存信息。而这个fixmap使用于dtb的映射和早期的io的固定映射。
early_ioremap_init()
arch/arm/mm/ioremap.c
early_ioremap_init()
–>early_ioremap_setup() //mm/early_ioremap.c
void __init early_ioremap_setup(void)
{
int i;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
if (WARN_ON(prev_map[i]))
break;
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
}
早期的io映射也在fixmap区域,使用slot槽来管理,最大支持7个slot,slot_virt记录每个槽位的虚拟地址。prev_map数组用来记录已经分配出去的虚拟地址。
setup_dma_zone()
arch/arm/mm/init.c
void __init setup_dma_zone(const struct machine_desc *mdesc)
{
#ifdef CONFIG_ZONE_DMA
if (mdesc->dma_zone_size) {
arm_dma_zone_size = mdesc->dma_zone_size;
arm_dma_limit = PHYS_OFFSET + arm_dma_zone_size - 1;
} else
arm_dma_limit = 0xffffffff;
arm_dma_pfn_limit = arm_dma_limit >> PAGE_SHIFT;
#endif
}
只是初始化几个变量这个,我们后面memzone章节再讲。
arm_memblock_init()
arch/arm/mm/init.c
void __init arm_memblock_init(const struct machine_desc *mdesc)
{
/* Register the kernel text, kernel data and initrd with memblock. */
memblock_reserve(__pa(KERNEL_START), KERNEL_END - KERNEL_START); /*将内核使用的区域,作为保留块,无法在使用了*/
arm_initrd_init(); /*initrd临时的跟文件系统,内存文件系统,在挂载rootfs前挂载*/
arm_mm_memblock_reserve(); /*将init的pgd swapper_pg_dir页表区域预留出来*/
/* reserve any platform specific memblock areas */
if (mdesc->reserve)
mdesc->reserve();
early_init_fdt_reserve_self();
early_init_fdt_scan_reserved_mem();
/* reserve memory for DMA contiguous allocations */
dma_contiguous_reserve(arm_dma_limit);
arm_memblock_steal_permitted = false;
memblock_dump_all();
}
static void __init arm_initrd_init(void)
{
#ifdef CONFIG_BLK_DEV_INITRD
phys_addr_t start;
unsigned long size;
/* FDT scan will populate initrd_start */
if (initrd_start && !phys_initrd_size) {
phys_initrd_start = __virt_to_phys(initrd_start);
phys_initrd_size = initrd_end - initrd_start;
}
initrd_start = initrd_end = 0;
if (!phys_initrd_size)
return;
/*
* Round the memory region to page boundaries as per free_initrd_mem()
* This allows us to detect whether the pages overlapping the initrd
* are in use, but more importantly, reserves the entire set of pages
* as we don't want these pages allocated for other purposes.
*/
start = round_down(phys_initrd_start, PAGE_SIZE);
size = phys_initrd_size + (phys_initrd_start - start);
size = round_up(size, PAGE_SIZE);
if (!memblock_is_region_memory(start, size)) {
pr_err("INITRD: 0x%08llx+0x%08lx is not a memory region - disabling initrd\n",
(u64)start, size);
return;
}
if (memblock_is_region_reserved(start, size)) {
pr_err("INITRD: 0x%08llx+0x%08lx overlaps in-use memory region - disabling initrd\n",
(u64)start, size);
return;
}
memblock_reserve(start, size);
/* Now convert initrd to virtual addresses */
initrd_start = __phys_to_virt(phys_initrd_start);
initrd_end = initrd_start + phys_initrd_size;
#endif
}
arm_mm_memblock_reserve()在arch/arm/mm/mmu.c中
void __init arm_mm_memblock_reserve(void)
{
/*
* Reserve the page tables. These are already in use,
* and can only be in node 0.
*/
memblock_reserve(__pa(swapper_pg_dir), SWAPPER_PG_DIR_SIZE);
#ifdef CONFIG_SA1111
/*
* Because of the SA1111 DMA bug, we want to preserve our
* precious DMA-able memory...
*/
memblock_reserve(PHYS_OFFSET, __pa(swapper_pg_dir) - PHYS_OFFSET);
#endif
}
memblock_reserve() mm/memblock.c
增加一个逻辑块到保留区域
int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
{
phys_addr_t end = base + size - 1;
memblock_dbg("memblock_reserve: [%pa-%pa] %pF\n",
&base, &end, (void *)_RET_IP_);
return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
}
paging_init()
再来看paging_init()函数
prepare_page_table()清理内核镜像的pmd的映射。
map_lowmem()映射低端内存
static inline void prepare_page_table(void)
{
unsigned long addr;
phys_addr_t end;
/*
* Clear out all the mappings below the kernel image.
*/
for (addr = 0; addr < MODULES_VADDR; addr += PMD_SIZE)
pmd_clear(pmd_off_k(addr));
#ifdef CONFIG_XIP_KERNEL
/* The XIP kernel is mapped in the module area -- skip over it */
addr = ((unsigned long)_exiprom + PMD_SIZE - 1) & PMD_MASK;
#endif
for ( ; addr < PAGE_OFFSET; addr += PMD_SIZE)
pmd_clear(pmd_off_k(addr));
/*
* Find the end of the first block of lowmem.
*/
end = memblock.memory.regions[0].base + memblock.memory.regions[0].size;
if (end >= arm_lowmem_limit)
end = arm_lowmem_limit;
/*
* Clear out all the kernel space mappings, except for the first
* memory bank, up to the vmalloc region.
*/
for (addr = __phys_to_virt(end);
addr < VMALLOC_START; addr += PMD_SIZE)
pmd_clear(pmd_off_k(addr));
}
static void __init map_lowmem(void)
{
struct memblock_region *reg;
phys_addr_t kernel_x_start = round_down(__pa(KERNEL_START), SECTION_SIZE);
phys_addr_t kernel_x_end = round_up(__pa(__init_end), SECTION_SIZE);
/* Map all the lowmem memory banks. */
for_each_memblock(memory, reg) { /*对加入的每一个memblock的memory区域进行映射*/
phys_addr_t start = reg->base;
phys_addr_t end = start + reg->size;
struct map_desc map;
if (memblock_is_nomap(reg))
continue;
if (end > arm_lowmem_limit)
end = arm_lowmem_limit;
if (start >= end)
break;
if (end < kernel_x_start) {
map.pfn = __phys_to_pfn(start);
map.virtual = __phys_to_virt(start);
map.length = end - start;
map.type = MT_MEMORY_RWX;
create_mapping(&map);
} else if (start >= kernel_x_end) {
map.pfn = __phys_to_pfn(start);
map.virtual = __phys_to_virt(start);
map.length = end - start;
map.type = MT_MEMORY_RW;
create_mapping(&map);
} else {
/* This better cover the entire kernel */
if (start < kernel_x_start) {
map.pfn = __phys_to_pfn(start);
map.virtual = __phys_to_virt(start);
map.length = kernel_x_start - start;
map.type = MT_MEMORY_RW;
create_mapping(&map);
}
map.pfn = __phys_to_pfn(kernel_x_start);
map.virtual = __phys_to_virt(kernel_x_start);
map.length = kernel_x_end - kernel_x_start;
map.type = MT_MEMORY_RWX;
create_mapping(&map);
if (kernel_x_end < end) {
map.pfn = __phys_to_pfn(kernel_x_end);
map.virtual = __phys_to_virt(kernel_x_end);
map.length = end - kernel_x_end;
map.type = MT_MEMORY_RW;
create_mapping(&map);
}
}
}
}
map_desc结构体
arch/arm/include/asm/mach/map.h
struct map_desc {
unsigned long virtual; /*虚拟地址*/
unsigned long pfn; /*页框号*/
unsigned long length;
unsigned int type;
};
5.缺页异常
6.swap交换,页面交换,kswapd的作用
二. 内存的管理
1.总体关系,即内存管理框架
node,memzone,页框分配器,和slab分配器的关系。
图示:
2. node
节点的概念,NUMA系统会有多个节点内存node,UMA系统只有一个节点。
3. memzone
每个node会有多个zone区,以UMA系统中分区:DMA,NORMAL,HIGHMEM等区。
见文件mm/memzone.c
/*
* next_zone - helper magic for for_each_zone()
*/
struct zone *next_zone(struct zone *zone)
{
pg_data_t *pgdat = zone->zone_pgdat;
if (zone < pgdat->node_zones + MAX_NR_ZONES - 1)
zone++;
else {
pgdat = next_online_pgdat(pgdat);
if (pgdat)
zone = pgdat->node_zones;
else
zone = NULL;
}
return zone;
}
memzone.c中只有这个next_zone()函数,更多的函数在include/linux/memzone.h中。
pg_data_t 结构体include/linux/memzone.h
struct bootmem_data;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;
#endif
#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
/*
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
* or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
*
* Nests above zone->lock and zone->span_seqlock
*/
spinlock_t node_size_lock;
#endif
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
mem_hotplug_begin/end() */
int kswapd_order;
enum zone_type kswapd_classzone_idx;
int kswapd_failures; /* Number of 'reclaimed == 0' runs */
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
wait_queue_head_t kcompactd_wait;
struct task_struct *kcompactd;
#endif
/*
* This is a per-node reserve of pages that are not available
* to userspace allocations.
*/
unsigned long totalreserve_pages;
#ifdef CONFIG_NUMA
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif /* CONFIG_NUMA */
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* If memory initialisation on large machines is deferred then this
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
/* Number of non-deferred pages */
unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
spinlock_t split_queue_lock;
struct list_head split_queue;
unsigned long split_queue_len;
#endif
/* Fields commonly accessed by the page reclaim scanner */
struct lruvec lruvec;
unsigned long flags;
ZONE_PADDING(_pad2_)
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
} pg_data_t;
4. bootmem分配器
linux起来时使用bootmem分配器,使用bitmap来管理page的,采用最先匹配策略,效率低下,碎片问题。当伙伴系统初始化好时,内存的管理权交给伙伴系统。
核心结构体bootmem_data :
include/linux/bootmem.h
typedef struct bootmem_data {
unsigned long node_min_pfn;
unsigned long node_low_pfn; 低端内存最后一个page的页帧号
void *node_bootmem_map; 内存中bitmap所在的位置
unsigned long last_end_off; 分配的最后一个页内的偏移,如果该页完全使用,则offset为0
unsigned long hint_idx;
struct list_head list;
} bootmem_data_t;
初始化:
paging_init()
–>bootmem_init()
bootmem_init()在arch/arm/mm/init.c中,
mmc/memblock.c中的init_bootmem_core()是体系无关的bootmem的初始化,这二者是如何联系的呢,拭目以看。
void __init bootmem_init(void)
{
unsigned long min, max_low, max_high;
memblock_allow_resize();
max_low = max_high = 0;
find_limits(&min, &max_low, &max_high);
early_memtest((phys_addr_t)min << PAGE_SHIFT,
(phys_addr_t)max_low << PAGE_SHIFT);
/*
* Sparsemem tries to allocate bootmem in memory_present(),
* so must be done after the fixed reservations
*/
arm_memory_present();
/*
* sparse_init() needs the bootmem allocator up and running.
*/
sparse_init();
/*
* Now free the memory - free_area_init_node needs
* the sparse mem_map arrays initialized by sparse_init()
* for memmap_init_zone(), otherwise all PFNs are invalid.
*/
zone_sizes_init(min, max_low, max_high);
/*
* This doesn't seem to be used by the Linux memory manager any
* more, but is used by ll_rw_block. If we can get rid of it, we
* also get rid of some of the stuff above as well.
*/
min_low_pfn = min;
max_low_pfn = max_low;
max_pfn = max_high;
}
在bootmem_init()并没有调用init_bootmem_node(),所以arm中是没有用到bootmem了?
确实如此,arm中没有用到bootmem分配器,直接进入伙伴系统,初始化函数为zone_sizes_init(min, max_low, max_high);
3. 伙伴系统buddy:大于=1页的分配
页框分配器,每个zone会有一个伙伴系统,分配次序一次从highmem到normal再到dma区。伙伴系统是按2^order管理页表的,每次只能分配2的order次方个页框。
结构体zone结构,每一个zone都有一个页框分配器。
zone内free_area数组存储该内域中空闲页。
2^order次方page的链表。
struct zone {
...
struct free_area free_area[MAX_ORDER];
}
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
free_area相当与一个二维数组,指向一个迁移类型的指针数组,每个指针指向一个2^order个页框链表
为了解决内存碎片的问题,将伙伴系统的伙伴设置迁移类型
当该迁移类型内存不足时,需要从其他迁移类型链表中分配内存,默认分配顺序:
在fallbacks数组中 mm/page_alloc.c
static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
[MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};
两全局变量
pageblock_order表示内核认为是”大”的一个分配阶
pageblock_nr_pages则表示该分配阶对应的页数.
可以得到伙伴系统就是对zone的free_area直接进行管理,因为每个free_area对应一个order的链表。
3.1伙伴系统的初始化
函数调用关系
bootmem_init() //arch/arm/mm/init.c
–>zone_sizes_init()
–>-->free_area_init_node() //mm/page_alloc.c
–>-->–>free_area_init_core()
–>-->–>-->memmap_init()/memmap_init_zone()
memmap_init_zone既是伙伴系统的初始化函数。
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context,
struct vmem_altmap *altmap)
{
unsigned long end_pfn = start_pfn + size;
pg_data_t *pgdat = NODE_DATA(nid);
unsigned long pfn;
unsigned long nr_initialised = 0;
struct page *page;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
struct memblock_region *r = NULL, *tmp;
#endif
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
/*
* Honor reservation requested by the driver for this ZONE_DEVICE
* memory
*/
if (altmap && start_pfn == altmap->base_pfn)
start_pfn += altmap->reserve;
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s handed to this
* function. They do not exist on hotplugged memory.
*/
if (context != MEMMAP_EARLY)
goto not_early;
if (!early_pfn_valid(pfn))
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))
break;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
/*
* Check given memblock attribute by firmware which can affect
* kernel memory layout. If zone==ZONE_MOVABLE but memory is
* mirrored, it's an overlapped memmap init. skip it.
*/
if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
if (!r || pfn >= memblock_region_memory_end_pfn(r)) {
for_each_memblock(memory, tmp)
if (pfn < memblock_region_memory_end_pfn(tmp))
break;
r = tmp;
}
if (pfn >= memblock_region_memory_base_pfn(r) &&
memblock_is_mirror(r)) {
/* already initialized as NORMAL */
pfn = memblock_region_memory_end_pfn(r);
continue;
}
}
#endif
not_early:
page = pfn_to_page(pfn);
__init_single_page(page, pfn, zone, nid); /*只是将page和zone绑定起来,标记该页为该zone所有*/
if (context == MEMMAP_HOTPLUG)
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
* to reserve their blocks rather than leaking throughout
* the address space during boot when many long-lived
* kernel allocations are made.
*
* bitmap is created for zone's valid pfn range. but memmap
* can be created for invalid pages (for alignment)
* check here not to call set_pageblock_migratetype() against
* pfn out of zone.
*
* Please note that MEMMAP_HOTPLUG path doesn't clear memmap
* because this is done early in sparse_add_one_section
*/
if (!(pfn & (pageblock_nr_pages - 1))) {
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
}
}
既然伙伴系统管理的是zone的free_area,那zone的free_area又是在何处初始化的呢,我们来一探究竟。
但是我们又没有看到free_area的初始化,这是先将page设为busy,然后在一一回收到free_area的。上面的bootmem_init函数是在/start_kernel()/setup()/page_init()/中调用的,真正伙伴系统的初始化,是在/start_kernel()/mm_init()中进行的。
mm_init
–>mem_init()
–>-->free_all_bootmem()
–>-->–>free_low_memory_core_early()
–>-->–>-->__free_memory_core()
–>-->–>-->–>__free_pages_memory()
–>-->–>-->–>-->__free_pages_bootmem()
–>-->–>-->–>-->–>__free_pages_boot_core()
–>-->–>-->–>-->–>-->__free_pages()
完成回收进buudy伙伴系统管理,刚初始化后,是只有最高的order有内容的,低阶的free_area是没有东西的,分配时再从大order的free_area中拆散。
3.2 分配函数
alloc_pages() include/linux/gfp.h
–>alloc_pages_node()
–>-->__alloc_pages_node()
–>-->–>__alloc_pages()
–>-->–>-->__alloc_pages_nodemask() //mm/page_alloc.c
来看伙伴系统的心脏__alloc_pages_nodemask()函数
/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = { };
/*
* There are several places where we assume that the order value is sane
* so bail out early if the request is out of bound.
*/
if (unlikely(order >= MAX_ORDER)) {
WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
return NULL;
}
gfp_mask &= gfp_allowed_mask;
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
return NULL;
finalise_ac(gfp_mask, &ac);
/* First allocation attempt */
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (likely(page))
goto out;
/*
* Apply scoped allocation constraints. This is mainly about GFP_NOFS
* resp. GFP_NOIO which has to be inherited for all allocation requests
* from a particular context which has been marked by
* memalloc_no{fs,io}_{save,restore}.
*/
alloc_mask = current_gfp_context(gfp_mask);
ac.spread_dirty_pages = false;
/*
* Restore the original nodemask if it was potentially replaced with
* &cpuset_current_mems_allowed to optimize the fast-path attempt.
*/
if (unlikely(ac.nodemask != nodemask))
ac.nodemask = nodemask;
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
out:
if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
__free_pages(page, order);
page = NULL;
}
trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
return page;
}
来看看分配函数,可以想到分配,分两步走:
1.找到对应的zone;该zone没有空间,找下一个替代zone,有替代zone数组;
2.找相应order的链表,如果该链表中没有空闲,去替代链表中找;比如4页的没有了,去8页的链表中找,找到了把8页分成两个4页的,如果8页没有就去更高阶去找。
来看重点函数get_page_from_freelist():
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
{
struct zoneref *z = ac->preferred_zoneref;
struct zone *zone;
struct pglist_data *last_pgdat_dirty_limit = NULL;
/*
* Scan zonelist, looking for a zone with enough free.
* See also __cpuset_node_allowed() comment in kernel/cpuset.c.
*/
for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
struct page *page;
unsigned long mark;
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
continue;
/*
* When allocating a page cache page for writing, we
* want to get it from a node that is within its dirty
* limit, such that no single node holds more than its
* proportional share of globally allowed dirty pages.
* The dirty limits take into account the node's
* lowmem reserves and high watermark so that kswapd
* should be able to balance it without having to
* write pages from its LRU list.
*
* XXX: For now, allow allocations to potentially
* exceed the per-node dirty limit in the slowpath
* (spread_dirty_pages unset) before going into reclaim,
* which is important when on a NUMA setup the allowed
* nodes are together not big enough to reach the
* global limit. The proper fix for these situations
* will require awareness of nodes in the
* dirty-throttling and the flusher threads.
*/
if (ac->spread_dirty_pages) {
if (last_pgdat_dirty_limit == zone->zone_pgdat)
continue;
if (!node_dirty_ok(zone->zone_pgdat)) {
last_pgdat_dirty_limit = zone->zone_pgdat;
continue;
}
}
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_fast(zone, order, mark,
ac_classzone_idx(ac), alloc_flags)) {
int ret;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* Watermark failed for this zone, but see if we can
* grow this zone if it contains deferred pages.
*/
if (static_branch_unlikely(&deferred_pages)) {
if (_deferred_grow_zone(zone, order))
goto try_this_zone;
}
#endif
/* Checked here to keep the fast path fast */
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
if (node_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
continue;
ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
switch (ret) {
case NODE_RECLAIM_NOSCAN:
/* did not scan */
continue;
case NODE_RECLAIM_FULL:
/* scanned but unreclaimable */
continue;
default:
/* did we reclaim enough */
if (zone_watermark_ok(zone, order, mark,
ac_classzone_idx(ac), alloc_flags))
goto try_this_zone;
continue;
}
}
try_this_zone:
/*从zone中获取一页*/
page = rmqueue(ac->preferred_zoneref->zone, zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
prep_new_page(page, order, gfp_mask, alloc_flags);
/*
* If this is a high-order atomic allocation then check
* if the pageblock should be reserved for the future
*/
if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
reserve_highatomic_pageblock(page, zone, order);
return page;
} else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/* Try again if zone has deferred pages */
if (static_branch_unlikely(&deferred_pages)) {
if (_deferred_grow_zone(zone, order))
goto try_this_zone;
}
#endif
}
}
return NULL;
}
rmqueue()函数是获取一页:
/*
* Allocate a page from the given zone. Use pcplists for order-0 allocations.
*/
static inline
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
unsigned long flags;
struct page *page;
if (likely(order == 0)) {
/*如果是一页,则去cpu的告诉缓存中取*/
page = rmqueue_pcplist(preferred_zone, zone, order,
gfp_flags, migratetype);
goto out;
}
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
spin_lock_irqsave(&zone->lock, flags);
do {
page = NULL;
if (alloc_flags & ALLOC_HARDER) {
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
if (!page)
page = __rmqueue(zone, order, migratetype);
} while (page && check_new_pages(page, order));
spin_unlock(&zone->lock);
if (!page)
goto failed;
__mod_zone_freepage_state(zone, -(1 << order),
get_pcppage_migratetype(page));
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone);
local_irq_restore(flags);
out:
VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return page;
failed:
local_irq_restore(flags);
return NULL;
}
如果是一页则去cpu的高速缓存中取:rmqueue_pcplist(),其他order则取zone的frea_area中取。
* the smallest available page from the freelists
*/
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
unsigned int current_order;
struct free_area *area;
struct page *page;
/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &(zone->free_area[current_order]); /*获取free_area*/
page = list_first_entry_or_null(&area->free_list[migratetype],
struct page, lru);
if (!page)
continue;
list_del(&page->lru); /*从lru链表中删除*/
rmv_page_order(page);
area->nr_free--; /*减一*/
/*如果该order大于请求的,将该大页扩充到本order中来*/
expand(zone, page, order, current_order, area, migratetype);
set_pcppage_migratetype(page, migratetype);
return page;
}
return NULL;
}
3.3释放函数
free_pages() //mm/page_alloc.g
–>__free_pages()
–>-->free_the_page()
–>-->–>__free_pages_ok()
–>->–>-->free_one_page()
–>-->–>-->–>__free_one_page()
static inline void __free_one_page(struct page *page,
unsigned long pfn,
struct zone *zone, unsigned int order,
int migratetype)
{
unsigned long combined_pfn;
unsigned long uninitialized_var(buddy_pfn);
struct page *buddy;
unsigned int max_order;
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
__mod_zone_freepage_state(zone, 1 << order, migratetype);
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
continue_merging:
while (order < max_order - 1) {
buddy_pfn = __find_buddy_pfn(pfn, order);
buddy = page + (buddy_pfn - pfn);
if (!pfn_valid_within(buddy_pfn))
goto done_merging;
if (!page_is_buddy(page, buddy, order))
goto done_merging;
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
if (page_is_guard(buddy)) {
clear_page_guard(zone, buddy, order, migratetype);
} else {
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
rmv_page_order(buddy);
}
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
order++;
}
if (max_order < MAX_ORDER) {
/* If we are here, it means order is >= pageblock_order.
* We want to prevent merge between freepages on isolate
* pageblock and normal pageblock. Without this, pageblock
* isolation could cause incorrect freepage or CMA accounting.
*
* We don't want to hit this code for the more frequent
* low-order merging.
*/
if (unlikely(has_isolate_pageblock(zone))) {
int buddy_mt;
buddy_pfn = __find_buddy_pfn(pfn, order);
buddy = page + (buddy_pfn - pfn);
buddy_mt = get_pageblock_migratetype(buddy);
if (migratetype != buddy_mt
&& (is_migrate_isolate(migratetype) ||
is_migrate_isolate(buddy_mt)))
goto done_merging;
}
max_order++;
goto continue_merging;
}
done_merging:
set_page_order(page, order);
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
* that pages are being freed that will coalesce soon. In case,
* that is happening, add the free page to the tail of the list
* so it's less likely to be used soon and more likely to be merged
* as a higher order page
*/
if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
struct page *higher_page, *higher_buddy;
combined_pfn = buddy_pfn & pfn;
higher_page = page + (combined_pfn - pfn);
buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
higher_buddy = higher_page + (buddy_pfn - combined_pfn);
if (pfn_valid_within(buddy_pfn) &&
page_is_buddy(higher_page, higher_buddy, order + 1)) {
list_add_tail(&page->lru,
&zone->free_area[order].free_list[migratetype]);
goto out;
}
}
list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
out:
zone->free_area[order].nr_free++;
}
释放内存的时候,存在去碎片化的操作,可以看到有先合并成大的order,然后再加入free_list。
每cpu高速缓存,大小都是一页,根据页的分配和释放规律来实现缓存。在分配内存中,如果order==0,则使用rmqueue_pcplist()来从percpu list中配分一页。进去看看:
/* Lock and remove page from the per-cpu list */
static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, int migratetype)
{
struct per_cpu_pages *pcp;
struct list_head *list;
struct page *page;
unsigned long flags;
local_irq_save(flags);
pcp = &this_cpu_ptr(zone->pageset)->pcp; /*获取percpu的list*/
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone);
}
local_irq_restore(flags);
return page;
}
来看看zone->pageset这个percpu成员变量:
struct zone {
struct per_cpu_pageset __percpu *pageset;
}
struct per_cpu_pageset {
struct per_cpu_pages pcp;
#ifdef CONFIG_NUMA
s8 expire;
u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
#endif
#ifdef CONFIG_SMP
s8 stat_threshold;
s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
};
struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES]; /*一页的链表*/
};
来看看这个pcp链表在何处设置.
free_area_init_core()
–>zone_init_internals()
–>-->zone_pcp_init(zone)
static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
static __meminit void zone_pcp_init(struct zone *zone)
{
/*
* per cpu subsystem is not up at this point. The following code
* relies on the ability of the linker to provide the
* offset of a (static) per cpu variable into the per cpu area.
*/
zone->pageset = &boot_pageset;
if (populated_zone(zone))
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n",
zone->name, zone->present_pages,
zone_batchsize(zone));
}
这个boot_pageset又在何处设置呢?
start_kernel()
–>build_all_zonelists()
–>-->build_all_zonelists_init()
static noinline void __init
build_all_zonelists_init(void)
{
int cpu;
__build_all_zonelists(NULL);
/*
* Initialize the boot_pagesets that are going to be used
* for bootstrapping processors. The real pagesets for
* each zone will be allocated later when the per cpu
* allocator is available.
*
* boot_pagesets are used also for bootstrapping offline
* cpus if the system is already booted because the pagesets
* are needed to initialize allocators on a specific cpu too.
* F.e. the percpu allocator needs the page allocator which
* needs the percpu allocator in order to allocate its pagesets
* (a chicken-egg dilemma).
*/
for_each_possible_cpu(cpu)
setup_pageset(&per_cpu(boot_pageset, cpu), 0);
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
}
这里也是初始化成0,具体的填充,是在free一页时,先free到percpu链表。
static inline void free_the_page(struct page *page, unsigned int order)
{
if (order == 0) /* Via pcp? */ /*如果是一页则使用pcp,即percpu链表*/
free_unref_page(page);
else
__free_pages_ok(page, order);
}
-
小于一页的分配器—slab分配器;
当然slab分配器,是分配小于一页的内存,俗称打通铺,分配2的n次方个字节,这个是通用的slab,专有的slab如dentroy,等等专有的struct,频繁使用的节点也会用slab缓存管理起来,提高效率。slab分配器,会预先分配固定的页来作为一个slab,如8字节的slab,等到slab用完时,再从页框分配器中分配一些page。
代码: -
当用户使用malloc系统调用时,内存分配的流程。
-
内核中 内存的分配流程 kmaloc使用的是slab分配器,dma内存分配;ioremap的流程。
内核空间地址区间
linux虚拟地址空间:
图示:
虚拟地址和物理地址的映射
- vmmaloc区
- kmalloc区
- 分配一页
- dma空间:cma内存
- 进程的内核栈
用户空间
- 内存图:
栈在高地址,从高向下生长。即从3G往下生长。 - 程序的段空间,code,rodata,data,bss,堆,栈。
- task_struct中:
mm_struct: 是用段来管理的,
vma结构:描述一段虚拟内存 - mmap系统调用,将内核空间的一段地址也映射到用户空间。
- 程序中访问一个栈中变量,堆中变量的流程。
栈是在进程创建的时候初始化好栈的长度的,页表也是此时建立的,所以会出现栈溢出现象。而堆中是malloc有c库管理的一部分内存,而当出现虚拟地址没有映射时,此时会动态建立一张页表。
https://blog.csdn.net/linux58/article/details/76906443
https://www.cnblogs.com/Leo_wl/archive/2013/04/17/3027230.html