2.6内核的zone结构中一个成员变量 lowmem_reserve
struct zone {
/* Fields commonly accessed by the page allocator */
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
kernel在分配内存时,可能会涉及到多个zone,分配会尝试从zonelist第一个zone分配,如果失败就会尝试下一个低级的zone(这里的低级仅仅指zone内存的位置,实际上低地址zone是更稀缺的资源)。我们可以想像应用进程通过内存映射申请Highmem 并且加mlock分配,如果此时Highmem zone无法满足分配,则会尝试从Normal进行分配。这就有一个问题,来自Highmem的请求可能会耗尽Normal zone的内存,而且由于mlock又无法回收,最终的结果就是Normal zone无内存提供给kernel的正常分配,而Highmem有大把的可回收内存无法有效利用。
因此针对这个case,使得Normal zone在碰到来自Highmem的分配请求时,可以通过lowmem_reserve声明:可以使用我的内存,但是必须要保留lowmem_reserve[NORMAL]给我自己使用。
同样当从Normal失败后,会尝试从zonelist中的DMA申请分配,通过lowmem_reserve[DMA],限制来自HIGHMEM和Normal的分配请求。
/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
* NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
* HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
* HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA
*
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
#ifdef CONFIG_ZONE_DMA
256,
#endif
#ifdef CONFIG_ZONE_DMA32
256,
#endif
#ifdef CONFIG_HIGHMEM
32,
#endif
32,
};
如果不希望低级zone被较高级分配使用,那么可以设置系数为1,得到最大的保护效果
不过这个值的计算非常的奇怪,来自NORMAL的分配,lowmem_reserve[DMA] = normal_size / ratio,这里使用Normal zone size而不是DMA zone size,这点一直没有想明白。
lowmem_reserve数组
当用户通过/proc/sys/vm/lowmem_reserve_ratio修改lowmem reserve时, 会调用setup_per_zone_lowmem_reserve函数
5637 for_each_online_pgdat(pgdat) {
5638 for (j = 0; j < MAX_NR_ZONES; j++) {
5639 struct zone *zone = pgdat->node_zones + j;
5640 unsigned long managed_pages = zone->managed_pages;
5641
5642 zone->lowmem_reserve[j] = 0;
5643
5644 idx = j;
5645 while (idx) {
5646 struct zone *lower_zone;
5647
5648 idx--;
5649
5650 if (sysctl_lowmem_reserve_ratio[idx] < 1)
5651 sysctl_lowmem_reserve_ratio[idx] = 1;
5652
5653 lower_zone = pgdat->node_zones + idx;
5654 lower_zone->lowmem_reserve[j] = managed_pages /
5655 sysctl_lowmem_reserve_ratio[idx];
5656 managed_pages += lower_zone->managed_pages;
5657 }
5658 }
5659 }
5660
5661 /* update totalreserve_pages */
5662 calculate_totalreserve_pages();
假定我们的系统只包含两个zone: Normal zone和Highmem zone
5642行 把zone->lowmem_reserve[j] = 0设置为0, 也就是normal_zone->lowmem_reserve[0] 和highmem_zone->lowmem_reserve[1]设置为0
也就是说normal_zone不会对normal申请做限制, highmem_zone不会对highmem申请做限制.
内部循环5645~5657: 是根据当前zone管理的页面来设置较低zone的lowmem_reserve
此外,较新的内核源码目录中/Documentation/sysctl/vm.txt,对lowmem_reserve做了非常准确的描述。