Linux1.0.0swap机制的实现
究竟哪些页面可以SWAP出去?
回溯过去,我们从Linux的早期版本看这个问题,在linux1.0.0的try_to_wap_out实现中,我们能看到这个问题的一些线索。
从代码中可以看出,对于标志为RESERVED FLAG的页面,是不会被swap出去的,什么情况下PAGE会被设置为RESERVED,linux1.0.0中有例子,比如当一个模块需要一段保留内存的时候:
跟着代码继续往下读,其次是高端内存不可以被交换,刚刚被visit过的page也不能被交换,最近刚刚被被LRU缓存的也不能被SWAP。其它都可以swap,包括栈,堆。当然CODE段不会,根据下面代码的逻辑,CODE段不会是dirty的,所以直接将页表项目置0,触发下一次的page fault,在page fault调用do_no_page,将内容从新在从文件都回来填充页面。
swap_in和重读page都发生在do_no_page函数中,那么do_no_page是如何区分这两种情况的呢?原来在swap out时候,会将swap out对应的设备文件ID(可能有多个交换分区或者交换文件)写入页表项,而非交换情况下(比如DIRTY没有置),则此项为0,do_no_page就是通过这个标志来判断当前的page fault需要从文件读还是从swap区去读的。
读文件通过nopage handler实现。swap in回来的页面会恢复页表项的DIRTY属性,这是因为匿名页面没有backend file,所有数据直接产生于内存运行时,没有可以对比的“最初状态”,自然都是脏数据。
页面回收就是将已经分配出去的页面释放出来,提供给伙伴系统做后续的页面申请分配。可被回收的也有多种:1、进程中被映射文件的页面;2、进程中的匿名页面;3、磁盘高速缓存;而内核中动态申请的页面是不能并回收的。
可回收页面可以分为两大类,一类就是像映射了文件的映射页面和磁盘高速缓存,这种页面都有后备存储设备,回收页面只需要将页面写入到后备存储器后就能将页面回收。而另外一种就是匿名页面,这个需要在系统中建立交换分区,将页面写入交换分区,才能回收释放页面。可见,只有没有文件对应的页面才可以被SWAP出去。
增:只有没有文件对应的页面才可以通过交换分区被SWAP出去。
交换分区功能测试,首先打开交换分区:
caozilong@caozilong-Vostro-3268:~/Workspace$ sudo swapon -a
caozilong@caozilong-Vostro-3268:~/Workspace$ sudo swapon -v
NAME TYPE SIZE USED PRIO
/dev/sdb7 partition 6.7G 711.5M -2
caozilong@caozilong-Vostro-3268:~/Workspace$
caozilong@caozilong-Vostro-3268:~/Workspace/linux-compile$ free
总计 已用 空闲 共享 缓冲/缓存 可用
内存: 8058628 766840 6044272 277584 1247516 6721384
交换: 6972412 0 6972412
caozilong@caozilong-Vostro-3268:~/Workspace/linux-compile$
然后在__swap_writepage中加入调试打印。
caozilong@caozilong-Vostro-3268:~/Workspace/linux-compile/linux-5.4.129$ git diff
diff --git a/mm/page_io.c b/mm/page_io.c
index bcf27d057..85290948b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -276,6 +276,9 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
int ret;
struct swap_info_struct *sis = page_swap_info(page);
+ printf("%s line %d, comm %s.\n", __func__, __LINE__, current->comm);
+ dump_stack();
+
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
if (sis->flags & SWP_FS) {
struct kiocb kiocb;
caozilong@caozilong-Vostro-3268:~/Workspace/linux-compile/linux-5.4.129$
重新编译内核,重启系统选择新内核:
caozilong@caozilong-Vostro-3268:~/Workspace/linux-compile$ uname -r
5.4.129+
caozilong@caozilong-Vostro-3268:~/Workspace/linux-compile$
开发内存泄漏模型的用户程序:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <unistd.h>
int main (void)
{
char *p = NULL;
int count = 1;
while(1){
p = (char *)malloc(1024*1024*100);
if(!p){
printf("malloc error!\n");
return -1;
}
memset(p, 0, 1024*1024*100);
printf("malloc %dM memory\n", 100*count++);
usleep(500000);
}
return 0;
}
运行测试程序
caozilong@caozilong-Vostro-3268:~/Workspace$ ./a.out
malloc 100M memory
malloc 200M memory
malloc 300M memory
malloc 400M memory
malloc 500M memory
malloc 600M memory
malloc 700M memory
malloc 800M memory
malloc 900M memory
malloc 1000M memory
malloc 1100M memory
malloc 1200M memory
malloc 1300M memory
malloc 1400M memory
malloc 1500M memory
malloc 1600M memory
malloc 1700M memory
malloc 1800M memory
malloc 1900M memory
malloc 2000M memory
malloc 2100M memory
malloc 2200M memory
malloc 2300M memory
malloc 2400M memory
这个时候,由于运行次数比较少,我们在__swap_writepage中添加的打印钩子函数并没有执行,继续执行测试程序。
caozilong@caozilong-Vostro-3268:~/Workspace$ ./a.out
malloc 100M memory
malloc 200M memory
malloc 300M memory
malloc 400M memory
malloc 500M memory
malloc 600M memory
malloc 700M memory
malloc 800M memory
malloc 900M memory
malloc 1000M memory
malloc 1100M memory
malloc 1200M memory
malloc 1300M memory
malloc 1400M memory
malloc 1500M memory
malloc 1600M memory
malloc 1700M memory
malloc 1800M memory
malloc 1900M memory
malloc 2000M memory
malloc 2100M memory
malloc 2200M memory
malloc 2300M memory
malloc 2400M memory
malloc 2500M memory
malloc 2600M memory
malloc 2700M memory
malloc 2800M memory
malloc 2900M memory
malloc 3000M memory
malloc 3100M memory
malloc 3200M memory
malloc 3300M memory
malloc 3400M memory
malloc 3500M memory
malloc 3600M memory
malloc 3700M memory
malloc 3800M memory
malloc 3900M memory
malloc 4000M memory
malloc 4100M memory
malloc 4200M memory
malloc 4300M memory
malloc 4400M memory
malloc 4500M memory
malloc 4600M memory
malloc 4700M memory
malloc 4800M memory
malloc 4900M memory
malloc 5000M memory
malloc 5100M memory
malloc 5200M memory
malloc 5300M memory
malloc 5400M memory
malloc 5500M memory
malloc 5600M memory
malloc 5700M memory
malloc 5800M memory
malloc 5900M memory
malloc 6000M memory
malloc 6100M memory
malloc 6200M memory
malloc 6300M memory
malloc 6400M memory
malloc 6500M memory
malloc 6600M memory
malloc 6700M memory
malloc 6800M memory
malloc 6900M memory
malloc 7000M memory
malloc 7100M memory
malloc 7200M memory
malloc 7300M memory
malloc 7400M memory
malloc 7500M memory
malloc 7600M memory
malloc 7700M memory
malloc 7800M memory
malloc 7900M memory
malloc 8000M memory
malloc 8100M memory
malloc 8200M memory
malloc 8300M memory
malloc 8400M memory
malloc 8500M memory
malloc 8600M memory
malloc 8700M memory
malloc 8800M memory
malloc 8900M memory
malloc 9000M memory
malloc 9100M memory
malloc 9200M memory
malloc 9300M memory
malloc 9400M memory
malloc 9500M memory
malloc 9600M memory
malloc 9700M memory
malloc 9800M memory
malloc 9900M memory
malloc 10000M memory
malloc 10100M memory
malloc 10200M memory
malloc 10300M memory
malloc 10400M memory
malloc 10500M memory
malloc 10600M memory
malloc 10700M memory
malloc 10800M memory
malloc 10900M memory
malloc 11000M memory
malloc 11100M memory
malloc 11200M memory
malloc 11300M memory
malloc 11400M memory
malloc 11500M memory
malloc 11600M memory
malloc 11700M memory
malloc 11800M memory
malloc 11900M memory
malloc 12000M memory
malloc 12100M memory
malloc 12200M memory
malloc 12300M memory
malloc 12400M memory
malloc 12500M memory
malloc 12600M memory
已杀死
caozilong@caozilong-Vostro-3268:~/Workspace$
可以看到成功的触发了OOM.
那么内核发生了什么呢?
调试发现,调用__swap_writepage的地方有两点,一个是kswapd0线程。
[ 460.690149] CPU: 3 PID: 102 Comm: kswapd0 Not tainted 5.4.129+ #25
[ 460.690149] Hardware name: Dell Inc. Vostro 3268/0TJYKK, BIOS 1.11.1 12/11/2018
[ 460.690150] Call Trace:
[ 460.690156] dump_stack+0x6d/0x8b
[ 460.690159] __swap_writepage+0x61/0x450
[ 460.690162] ? smp_call_function_many+0x1de/0x270
[ 460.690164] ? cpumask_next_and+0x1e/0x20
[ 460.690166] ? smp_call_function_many+0x1de/0x270
[ 460.690168] ? __frontswap_store+0x73/0x100
[ 460.690170] swap_writepage+0x34/0x90
[ 460.690173] pageout.isra.58+0x11d/0x350
[ 460.690175] shrink_page_list+0x9eb/0xbb0
[ 460.690177] shrink_inactive_list+0x204/0x3d0
[ 460.690179] shrink_node_memcg+0x3b4/0x820
[ 460.690182] shrink_node+0xb5/0x410
[ 460.690183] ? shrink_node+0xb5/0x410
[ 460.690185] balance_pgdat+0x293/0x5f0
[ 460.690188] kswapd+0x156/0x3c0
[ 460.690190] ? wait_woken+0x80/0x80
[ 460.690192] kthread+0x121/0x140
[ 460.690194] ? balance_pgdat+0x5f0/0x5f0
[ 460.690195] ? kthread_park+0x90/0x90
[ 460.690197] ret_from_fork+0x35/0x40
[ 460.692259] __swap_writepage line 279, comm kswapd0.
另一个是在page_fault中:
[ 460.721281] CPU: 0 PID: 273 Comm: systemd-journal Not tainted 5.4.129+ #25
[ 460.721282] Hardware name: Dell Inc. Vostro 3268/0TJYKK, BIOS 1.11.1 12/11/2018
[ 460.721282] Call Trace:
[ 460.721285] dump_stack+0x6d/0x8b
[ 460.721287] __swap_writepage+0x61/0x450
[ 460.721289] ? __frontswap_store+0x73/0x100
[ 460.721290] swap_writepage+0x34/0x90
[ 460.721291] pageout.isra.58+0x11d/0x350
[ 460.721293] shrink_page_list+0x9eb/0xbb0
[ 460.721294] shrink_inactive_list+0x204/0x3d0
[ 460.721296] shrink_node_memcg+0x3b4/0x820
[ 460.721298] shrink_node+0xb5/0x410
[ 460.721299] ? shrink_node+0xb5/0x410
[ 460.721300] do_try_to_free_pages+0xcf/0x380
[ 460.721301] try_to_free_pages+0xee/0x1d0
[ 460.721303] __alloc_pages_slowpath+0x417/0xe50
[ 460.721305] __alloc_pages_nodemask+0x2cd/0x320
[ 460.721306] alloc_pages_current+0x6a/0xe0
[ 460.721307] __page_cache_alloc+0x6a/0xa0
[ 460.721309] __do_page_cache_readahead+0xa5/0x190
[ 460.721310] filemap_fault+0x65c/0xb80
[ 460.721312] ? __switch_to+0x2ce/0x490
[ 460.721313] ? __switch_to+0x2ce/0x490
[ 460.721314] ? devkmsg_poll+0x6b/0xa0
[ 460.721316] ? xas_load+0xc/0x80
[ 460.721317] ? xas_find+0x16f/0x1b0
[ 460.721318] ? filemap_map_pages+0x181/0x3b0
[ 460.721320] ext4_filemap_fault+0x31/0x50
[ 460.721322] __do_fault+0x57/0x110
[ 460.721323] __handle_mm_fault+0xdae/0x1290
[ 460.721325] handle_mm_fault+0xcb/0x210
[ 460.721327] __do_page_fault+0x2a1/0x4d0
[ 460.721328] do_page_fault+0x2c/0xe0
[ 460.721329] page_fault+0x34/0x40
[ 460.721330] RIP: 0033:0x7f5ad4dc7a47
这两个地方都可以归因到shrink_node调用路径上来,这个函数是用来进行内存回收的,可以看出,随着剩余内存的逐渐减少,__swap_writepage最终被调用。
swap out的核心函数调用时swap_writepage,它被swap功能的重要结构 address_spaces 的aops通过swap_aops对象引用。
swap_ops的组织结构如下