当我们在开发内核功能或者验证定位问题时,经常需要模拟各种内核的异常场景,来验证程序的健壮性或加速问题的复现,比如内存分配失败、磁盘IO错误超时等等。Linux内核集成了一个比较实用的功能“Fault-injection”来帮助我们进行故障注入,从而可以构建一些通用的内核异常场景。它能够模拟内存slab分配失败、内存页分配失败、磁盘IO错误、磁盘IO超时、futex锁错误以及专门针对mmc的IO错误,用户也可以利用该机制设计增加自己需要的故障注入。本文主要从内存分配和磁盘IO两个方面介绍如何使用“Fault-injection”注入异常并详细分析其实现。
内核版本:Linux 4.11.y
实验环境:Rpi 3
Fault-injection概述
故障注入类型
Fault-injection默认实现了6种错误注入方式,分别是failslab、fail_page_alloc、fail_futex、fail_make_request、fail_io_timeout和fail_mmc_request。它们分别的功能如下:
1)failslab
注入slab分配器内存分配错误,主要包括kmalloc()、kmem_cache_alloc()等等。
2)fail_page_alloc
注入内存页分配错误,主要包括alloc_pages()、get_free_pages()等等(较failslab更为底层)。
3)fail_futex
注入futex锁死锁和uaddr错误。
4)fail_make_request
注入磁盘IO错误。它对块核心层的generic_make_request()函数进行故障注入,可以通过/sys/block/<device>/make-it-fail或者/sys/block/<device>/<partition>/make-it-fail接口对指定的磁盘或分区进行注入。
5)fail_io_timeout
注入IO超时错误。它对IO处理流程中的IO处理完成blk_complete_request()函数进行故障注入,忽略IO完成“通知”。仅对使用通用超时处理流程的drivers有效,例如标准的scsi流程。
6)fail_mmc_request
注入mmc 数据错误,仅对mmc设备有效,通过对mmc core返回data error进行错误注入,从而可以测试mmc块设备驱动的错误处理流程以及重试机制,可通过/sys/kernel/debug/mmcx/fail_mmc_request接口进行设置。
以上6中故障注入类型是内核中已经默认实现了的,用户也可以利用其核心框架自行按需进行修改添加,只需依葫芦画瓢即可。我这里挑选了使用最多的failslab、fail_page_alloc、fail_make_request和fail_io_timeout进行详细分析,其他的两种大同小异。
故障注入debugfs配置
Fault-injection提供了内核选项可以开启debugfs控制接口,启停或者调整故障注入配置,主要包括如下一些文件接口:
1)/sys/kernel/debug/fail*/probability:
设置异常发生的比例,百分制。如果觉得最小值1%依然太频繁,可以设置该值为100,然后通过interval来调整异常触发的频率。默认值为0。
2)/sys/kernel/debug/fail*/interval:
设置异常发生的间隔,如果需要启用则设置大于1的值,probability设置为100。默认值为1。
3)/sys/kernel/debug/fail*/times:
设置异常发生的最大次数,超过该次数后将不会再发生异常了,设置为-1表示不设限。默认值为1。
4)/sys/kernel/debug/fail*/space
设置异常的size余量,每次执行到故障注入点后,都会将在该space的基础上递减size值,直到该值降低为0后才会注入异常。其中size的含义对各种异常各不相同,对于IO异常表示的是本次IO的字节数,对于内存分配表示的是内存的大小。默认值为0。
5)/sys/kernel/debug/fail*/verbose、 verbose_ratelimit_burst
格式:{ 0 | 1 | 2 }
设置异常触发后的内核打印信息输出方式。0表示不输出日志信息;1表示输出以“FAULT_INJECTION”开头的最基本信息,包括触发的类型、间隔、频率等等;2表示会追加backtrace的输出(这点对问题的定位很有用)。默认值为2。
6)/sys/kernel/debug/fail*/verbose_ratelimit_interval_ms、/sys/kernel/debug/fail*/verbose_ratelimit_burst
用于控制日志输出ratelimit的interval和burst这两个参数,可以用来调节日志输出的频率,若太过频繁会丢掉一些输出,默认值分别为0和10。
7)/sys/kernel/debug/fail*/task-filter:
格式:{ 'Y' | 'N' }
设置进程过滤,N表示不过滤,Y表示对启用了make-it-fail的进程和在中断上下文的流程进行过滤(通过/proc/<pid>/make-it-fail=1进行设置),不触发故障注入。默认值为N。
8)/sys/kernel/debug/fail*/require-start、 /sys/kernel/debug/fail*/require-end、 /sys/kernel/debug/fail*/reject-start、 /sys/kernel/debug/fail*/reject-end:
设置调用流程的虚拟地址空间过滤。若调用流程设计的代码段(Text段)包含在require-start~require-end且不包含在reject-start~reject-end中才注入异常,可以用来设置故障注入只针对某个或某些模块执行。默认required范围为[0, ULONG_MAX)(即整个虚拟地址空间),rejected范围为[0, 0)。
9)/sys/kernel/debug/fail*/stacktrace-depth:
设置[require-start, require-end) 和[reject-start, reject-end)跟踪的调用深度。默认值为32.
10)/sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem:
格式:{ 'Y' | 'N' }
设置页分配的高端内存过滤,设置为Y后当分配的内存类型包含__GFP_HIGHMEM不启用故障注入。默认值为N。
11)/sys/kernel/debug/failslab/ignore-gfp-wait、 /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait
格式:{ 'Y' | 'N' }
设置内存分配的分配模式过滤,设置为Y后只对非睡眠的内存分配启用故障注入(GFP_ATOMIC)。默认值为N。
12)/sys/kernel/debug/fail_page_alloc/min-order:
设置页分配order的过滤限制,当内核分配页小于该设定值则不进行故障注入。默认值为1
故障注入启动参数配置
前文中提到的debugfs接口只在debugfs启用后在有效,对于在内核启动阶段或没有设置debugfs配置选项的情况,Fault-injection的默认配置值通过启动参数进行传递,包括以下:
failslab=
fail_page_alloc=
fail_make_request=
fail_futex=
mmc_core.fail_request=<interval>,<probability>,<space>,<times>
通过启动参数传入的参数有限,目前只能接受interval、probability、space和times这4个参数(其他参数会被内核设置为默认的值),但是在一般情况下也够用了。
例如:如果想在内核启动阶段就启用failslab 100%无限故障注入,则可以传入内核启动参数:
failslab=1,100,0,-1
Fault-injection使用
配置内核选项
Fault-injection功能主要涉及以下几个内核配置选项,每一种注入模式一个配置选项,可按需开启:
CONFIG_FAULT_INJECTION:功能总开关
CONFIG_FAILSLAB:failslab故障注入功能配置
CONFIG_FAIL_PAGE_ALLOC:fail_page_alloc故障注入功能配置
CONFIG_FAIL_MAKE_REQUEST:fail_make_request故障注入功能配置
CONFIG_FAIL_IO_TIMEOUT:fail_io_timeout故障注入功能配置
CONFIG_FAIL_MMC_REQUEST:fail_mmc_request故障注入功能配置
CONFIG_FAIL_FUTEX:fail_futex故障注入功能配置
CONFIG_FAULT_INJECTION_DEBUG_FS:debugfs接口启用
这里我只介绍内存和IO相关的4个故障注入功能,因此需要开启CONFIG_FAULT_INJECTION、CONFIG_FAILSLAB、CONFIG_FAIL_PAGE_ALLOC、CONFIG_FAIL_IO_TIMEOUT和CONFIG_FAIL_MAKE_REQUEST这5个内核配置选项,与此同时,为了操作的方便,也设置CONFIG_FAULT_INJECTION_DEBUG_FS选项开启debugfs动态配置功能,然后重新编译安装内核。
fail_make_request使用
进入debugfs的挂载点,可以看到出现了以下几个目录:
[root@centos-rpi3 debug]# ls | grep fail
fail_futex
fail_io_timeout
fail_make_request
fail_page_alloc
failslab
从名字就可以看出它们分配用于配置哪类故障注入了,在fail_make_request目录下则有以下配置参数:
[root@centos-rpi3 fail_make_request]# ls
interval reject-start space times verbose_ratelimit_interval_ms
probability require-end stacktrace-depth verbose
reject-end require-start task-filter verbose_ratelimit_burst
这些配置参数前文中已经介绍过了,这里以100%无上限触发make request错误为例进行演示:
[root@centos-rpi3 fail_make_request]# echo 1 > interval
[root@centos-rpi3 fail_make_request]# echo -1 > times
[root@centos-rpi3 fail_make_request]# echo 100 > probability
这里触发比率设置为100%,无触发上限,其他参数无需修改使用默认值即可,这样fail_make_request的参数就算配置完成了,下面来开其它:
在磁盘块设备及其分区的sys接口目录下都有一个make-it-fail文件,例如我树莓派的sda和mmcblk下:
[root@centos-rpi3 block]# find -name make-it-fail
./sda/sda2/make-it-fail
./sda/make-it-fail
./sda/sda1/make-it-fail
[root@centos-rpi3 mmcblk1]# find -name make-it-fail
./mmcblk1p3/make-it-fail
./make-it-fail
./mmcblk1p1/make-it-fail
./mmcblk1p4/make-it-fail
./mmcblk1p2/make-it-fail
这个make-it-fail文件就是对相应块设备的故障注入开关,对该文件写入1以后对该设备就正式启用故障注入了:
[root@centos-rpi3 sda]# echo 1 > make-it-fail
[root@centos-rpi3 sda]# dd if=/dev/zero of=/dev/sda2 bs=4k count=1 oflag=direct
[13744.902281] FAULT_INJECTION: forcing a failure.
[13744.902281] name fail_make_request, interval 1, probability 100, space 0, times -1
[13744.922972] CPU: 2 PID: 1649 Comm: dd Not tainted 4.11.0-v7+ #1
[13744.933280] Hardware name: BCM2835
[13744.941091] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[13744.957492] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[13744.973606] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[13744.989915] [<80490414>] (should_fail) from [<80433310>] (should_fail_request+0x28/0x30)
[13745.006664] [<80433310>] (should_fail_request) from [<804334ac>] (generic_make_request_checks+0xe4/0x668)
[13745.025003] [<804334ac>] (generic_make_request_checks) from [<80435868>] (generic_make_request+0x20/0x228)
[13745.043590] [<80435868>] (generic_make_request) from [<80435b18>] (submit_bio+0xa8/0x194)
[13745.060677] [<80435b18>] (submit_bio) from [<802b7cac>] (__blkdev_direct_IO_simple+0x158/0x2e0)
[13745.078294] [<802b7cac>] (__blkdev_direct_IO_simple) from [<802b8224>] (blkdev_direct_IO+0x3c4/0x400)
[13745.096168] [<802b8224>] (blkdev_direct_IO) from [<8021520c>] (generic_file_direct_write+0xac/0x1c0)
[13745.113872] [<8021520c>] (generic_file_direct_write) from [<802153e0>] (__generic_file_write_iter+0xc0/0x204)
[13745.132466] [<802153e0>] (__generic_file_write_iter) from [<802b8e50>] (blkdev_write_iter+0xb0/0x130)
[13745.150240] [<802b8e50>] (blkdev_write_iter) from [<80279d2c>] (__vfs_write+0xd4/0x124)
[13745.166717] [<80279d2c>] (__vfs_write) from [<8027b6d4>] (vfs_write+0xb0/0x1c4)
[13745.182487] [<8027b6d4>] (vfs_write) from [<8027ba40>] (SyS_write+0x4c/0x98)
[13745.198050] [<8027ba40>] (SyS_write) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)
可以看到,对sda2设备,fail_make_request异常已经被成功注入了,如若去挂载设备上的Ext4文件系统将会无法挂载:
[root@centos-rpi3 sda]# mount /dev/sda1 /mnt/
mount: /dev/sda1: can't read superblock
fail_io_timeou使用
fail_io_timeout故障注入的用法同fail_make_request类似,在debugfs挂载点的fail_io_timeout目录下存在同样的几个配置文件,现按同样的方式进行配置:
[root@centos-rpi3 fail_io_timeout]# echo 1 > interval
[root@centos-rpi3 fail_io_timeout]# echo -1 > times
[root@centos-rpi3 fail_io_timeout]# echo 100 > probability
配置完成后,同样需要对块设备启用,启用的接口为/sys/block/sdx/io-timeout-fail,注意该异常只能对磁盘块设备(struct gendisk)注入而无法对分区注入。
[root@centos-rpi3 sda]# echo 1 > io-timeout-fail
[root@centos-rpi3 sda]# dd if=/dev/zero of=/dev/sda2 bs=4k count=1 oflag=direct
[15198.056490] FAULT_INJECTION: forcing a failure.
[15198.056490] name fail_io_timeout, interval 1, probability 100, space 0, times -1
[15198.081768] CPU: 0 PID: 1405 Comm: usb-storage Not tainted 4.11.0-v7+ #1
[15198.097541] Hardware name: BCM2835
[15198.105454] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[15198.122090] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[15198.138443] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[15198.155013] [<80490414>] (should_fail) from [<8043edbc>] (blk_should_fake_timeout+0x30/0x38)
[15198.172426] [<8043edbc>] (blk_should_fake_timeout) from [<8043ed68>] (blk_complete_request+0x20/0x44)
[15198.190450] [<8043ed68>] (blk_complete_request) from [<8053e2c4>] (scsi_done+0x24/0x98)
[15198.207148] [<8053e2c4>] (scsi_done) from [<805a5b34>] (usb_stor_control_thread+0x130/0x28c)
[15198.224387] [<805a5b34>] (usb_stor_control_thread) from [<8013c0b0>] (kthread+0x12c/0x168)
[15198.241451] [<8013c0b0>] (kthread) from [<80108268>] (ret_from_fork+0x14/0x2c)
由于完成complete完成调用被忽略,一般的IO会超时并重试,而dd命令在下面的同步调用流程中会等待barrier命令返回,从而会进入D状态而引起hungtask:
[15235.156646] INFO: task dd:1738 blocked for more than 120 seconds.
[15235.167371] Not tainted 4.11.0-v7+ #1
[15235.176039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15235.393326] dd D 0 1738 1371 0x00000000
[15235.403207] [<80723ccc>] (__schedule) from [<8072447c>] (schedule+0x44/0xa8)
[15235.418950] [<8072447c>] (schedule) from [<80727bdc>] (schedule_timeout+0x1f8/0x338)
[15235.435528] [<80727bdc>] (schedule_timeout) from [<80724f44>] (wait_for_common+0xe8/0x190)
[15235.452572] [<80724f44>] (wait_for_common) from [<8072500c>] (wait_for_completion+0x20/0x24)
[15235.469657] [<8072500c>] (wait_for_completion) from [<80134cac>] (flush_work+0x11c/0x1a0)
[15235.486675] [<80134cac>] (flush_work) from [<801369a0>] (__cancel_work_timer+0x138/0x208)
[15235.503478] [<801369a0>] (__cancel_work_timer) from [<80136a8c>] (cancel_delayed_work_sync+0x1c/0x20)
[15235.521301] [<80136a8c>] (cancel_delayed_work_sync) from [<8044ad58>] (disk_block_events+0x74/0x78)
[15235.538733] [<8044ad58>] (disk_block_events) from [<802b9838>] (__blkdev_get+0x108/0x430)
[15235.555331] [<802b9838>] (__blkdev_get) from [<802b9cfc>] (blkdev_get+0x19c/0x310)
[15235.571404] [<802b9cfc>] (blkdev_get) from [<802ba40c>] (blkdev_open+0x7c/0x88)
[15235.587364] [<802ba40c>] (blkdev_open) from [<80276f08>] (do_dentry_open+0x100/0x30c)
[15235.603973] [<80276f08>] (do_dentry_open) from [<80278530>] (vfs_open+0x60/0x8c)
[15235.620265] [<80278530>] (vfs_open) from [<8028929c>] (path_openat+0x410/0xef4)
[15235.636666] [<8028929c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[15235.653430] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[15235.670269] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[15235.686748] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)
若是对古仔的ext4文件系统进行touch操作,会出现以下hungtask现象:
[ 1964.356864] INFO: task touch:1363 blocked for more than 120 seconds.
[ 1964.367450] Not tainted 4.11.0-v7+ #1
[ 1964.375846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1964.392079] touch D 0 1363 1078 0x00000000
[ 1964.401834] [<80723ccc>] (__schedule) from [<8072447c>] (schedule+0x44/0xa8)
[ 1964.417429] [<8072447c>] (schedule) from [<80149104>] (io_schedule+0x20/0x40)
[ 1964.432969] [<80149104>] (io_schedule) from [<80724998>] (bit_wait_io+0x1c/0x64)
[ 1964.448858] [<80724998>] (bit_wait_io) from [<80724d08>] (__wait_on_bit+0x94/0xcc)
[ 1964.465131] [<80724d08>] (__wait_on_bit) from [<80724e50>] (out_of_line_wait_on_bit+0x78/0x84)
[ 1964.482734] [<80724e50>] (out_of_line_wait_on_bit) from [<802b1dcc>] (__wait_on_buffer+0x3c/0x44)
[ 1964.500710] [<802b1dcc>] (__wait_on_buffer) from [<80308a44>] (ext4_read_inode_bitmap+0x6b0/0x758)
[ 1964.518775] [<80308a44>] (ext4_read_inode_bitmap) from [<803095b0>] (__ext4_new_inode+0x470/0x15dc)
[ 1964.536764] [<803095b0>] (__ext4_new_inode) from [<8031c064>] (ext4_create+0xb0/0x178)
[ 1964.553559] [<8031c064>] (ext4_create) from [<8028996c>] (path_openat+0xae0/0xef4)
[ 1964.570016] [<8028996c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[ 1964.586650] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[ 1964.603310] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[ 1964.619605] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)
关闭故障注入后,hungtask可恢复。
内存分配failslab使用
在debugfs挂载点的failslab目录也同样有类似的几个配置文件,只是多两个特有的配置:ignore-gfp-wait和cache_filter,前者用于过滤__GFP_RECLAIM类型内存分配的开关,后者用于过滤用户需要的slab分配,避免一启用后直接系统就报错而无法运行。
[root@centos-rpi3 failslab]# ls
cache-filter probability require-end stacktrace-depth verbose
ignore-gfp-wait reject-end require-start task-filter verbose_ratelimit_burst
interval reject-start space times verbose_ratelimit_interval_ms
用户可以在/sys/kernel/slab/xxx/failslab中配置要进行注入的kmem_cache类型:
[root@centos-rpi3 slab]# ls /sys/kernel/slab
:at-0000016 :t-0001024 dentry inotify_inode_mark nsproxy
:at-0000024 :t-0001536 dio ip4-frags pid
:at-0000032 :t-0002048 discard_cmd ip_dst_cache pid_namespace
:at-0000040 :t-0003072 discard_entry ip_fib_alias pool_workqueue
:at-0000048 :t-0004032 dmaengine-unmap-2 ip_fib_trie posix_timers_cache
:at-0000064 :t-0004096 dnotify_mark ip_mrt_cache proc_inode_cache
:at-0000072 :t-0008192 dnotify_struct jbd2_inode radix_tree_node
:at-0000104 :tA-0000032 dquot jbd2_journal_handle request_queue
:at-0000112 :tA-0000064 eventpoll_epi jbd2_journal_head request_sock_TCP
:at-0000184 :tA-0000088 eventpoll_pwq jbd2_revoke_record_s rpc_buffers
:at-0000192 :tA-0000128 ext4_allocation_context jbd2_revoke_table_s rpc_inode_cache
:atA-0000136 :tA-0000256 ext4_extent_status jbd2_transaction_s rpc_tasks
:atA-0000528 :tA-0000448 ext4_free_data kernfs_node_cache scsi_data_buffer
:t-0000024 :tA-0000704 ext4_groupinfo_4k key_jar scsi_sense_cache
:t-0000032 :tA-0003776 ext4_inode_cache kioctx sd_ext_cdb
:t-0000040 PING ext4_io_end kmalloc-1024 secpath_cache
:t-0000048 RAW ext4_prealloc_space kmalloc-128 sgpool-128
:t-0000056 TCP ext4_system_zone kmalloc-192 sgpool-16
:t-0000064 UDP f2fs_extent_node kmalloc-2048 sgpool-32
:t-0000080 UDP-Lite f2fs_extent_tree kmalloc-256 sgpool-64
:t-0000088 UNIX f2fs_ino_entry kmalloc-4096 sgpool-8
:t-0000112 aio_kiocb f2fs_inode_cache kmalloc-512 shmem_inode_cache
:t-0000120 anon_vma f2fs_inode_entry kmalloc-64 sighand_cache
:t-0000128 anon_vma_chain fanotify_event_info kmalloc-8192 signal_cache
:t-0000144 bdev_cache fasync_cache kmem_cache sigqueue
:t-0000152 bio-0 fat_cache kmem_cache_node sit_entry_set
:t-0000176 bio-1 fat_inode_cache mbcache skbuff_fclone_cache
:t-0000192 biovec-128 file_lock_cache mm_struct skbuff_head_cache
:t-0000208 biovec-16 file_lock_ctx mnt_cache sock_inode_cache
:t-0000256 biovec-256 files_cache mqueue_inode_cache task_delay_info
:t-0000320 biovec-64 filp names_cache task_group
:t-0000328 blkdev_ioc flow_cache nat_entry task_struct
:t-0000344 blkdev_requests free_nid nat_entry_set taskstats
:t-0000384 bsg_cmd fs_cache net_namespace tcp_bind_bucket
:t-0000448 buffer_head fscache_cookie_jar nfs_commit_data trace_event_file
:t-0000512 cachefiles_object_jar fsnotify_mark nfs_direct_cache tw_sock_TCP
:t-0000576 cfq_io_cq ftrace_event_field nfs_inode_cache uid_cache
:t-0000704 cfq_queue inet_peer_cache nfs_page user_namespace
:t-0000768 configfs_dir_cache inmem_page_entry nfs_read_data vm_area_struct
:t-0000904 cred_jar inode_cache nfs_write_data xfrm_dst_cache
这里以:t-0000xxx多表示为kmalloc()分配使用的指定内存大小,由好多一些是指向它们的符号链接:
[root@centos-rpi3 slab]# ll kmalloc-1024
lrwxrwxrwx 1 root root 0 May 21 09:26 kmalloc-1024 -> :t-0001024
下面以ext4_inode_cache为例进行故障注入:
[root@centos-rpi3 failslab]# echo -1 > times
[root@centos-rpi3 failslab]# echo 100 > probability
[root@centos-rpi3 failslab]# echo 1 > cache-filter
[root@centos-rpi3 failslab]# echo 1 > /sys/kernel/slab/ext4_inode_cache/failslab
[root@centos-rpi3 failslab]# echo N > ignore-gfp-wait
启用以后可以在ext4文件系统中执行创建文件等命令,会打印如下故障注入信息:
[ 157.633204] FAULT_INJECTION: forcing a failure.
[ 157.633204] name failslab, interval 1, probability 100, space 0, times -1
[ 157.659029] CPU: 1 PID: 379 Comm: in:imjournal Not tainted 4.11.0-v7+ #1
[ 157.675420] Hardware name: BCM2835
[ 157.683660] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[ 157.701176] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[ 157.718337] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[ 157.735689] [<80490414>] (should_fail) from [<8026950c>] (should_failslab+0x60/0x8c)
[ 157.753309] [<8026950c>] (should_failslab) from [<80266120>] (kmem_cache_alloc+0x44/0x230)
[ 157.771433] [<80266120>] (kmem_cache_alloc) from [<80323390>] (ext4_alloc_inode+0x24/0x104)
[ 157.789647] [<80323390>] (ext4_alloc_inode) from [<80296254>] (alloc_inode+0x2c/0xb0)
[ 157.807213] [<80296254>] (alloc_inode) from [<80297d90>] (new_inode_pseudo+0x18/0x5c)
[ 157.824793] [<80297d90>] (new_inode_pseudo) from [<80297df0>] (new_inode+0x1c/0x30)
[ 157.842200] [<80297df0>] (new_inode) from [<803091d0>] (__ext4_new_inode+0x90/0x15dc)
[ 157.859751] [<803091d0>] (__ext4_new_inode) from [<8031c064>] (ext4_create+0xb0/0x178)
[ 157.877373] [<8031c064>] (ext4_create) from [<8028996c>] (path_openat+0xae0/0xef4)
[ 157.894604] [<8028996c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[ 157.911620] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[ 157.928965] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[ 157.945881] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)
内存分配fail_page_alloc使用
在debugfs挂载点的fail_page_alloc目录也同样有类似的几个配置文件,只是fail_page_alloc会多3个特有的配置:min-order,ignore-gfp-wait和ignore-gfp-highmem。ignore-gfp-wait为用于过滤__GFP_DIRECT_RECLAIM类型内存分配的开关,ignore-gfp-highmem为用于是否过滤高端内存分配__GFP_HIGHMEM的开关,min-order为对故障注入最小分配页的过滤器,只有大于该参数的分配才能够进行故障注入
[root@centos-rpi3 fail_page_alloc]# ls
ignore-gfp-highmem probability require-start times
ignore-gfp-wait reject-end space verbose
interval reject-start stacktrace-depth verbose_ratelimit_burst
min-order require-end task-filter verbose_ratelimit_interval_ms
[root@centos-rpi3 fail_page_alloc]# echo 2 > times
[root@centos-rpi3 fail_page_alloc]# echo 100 > probability
[18950.321696] FAULT_INJECTION: forcing a failure.
[18950.321696] name fail_page_alloc, interval 1, probability 100, space 0, times -1
[18950.439402] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 4.11.0-v7+ #1
[18950.516300] Hardware name: BCM2835
[18950.553611] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[18950.628930] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[18950.703954] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[18950.780256] [<80490414>] (should_fail) from [<8021d208>] (__alloc_pages_nodemask+0xc0/0xf88)
[18950.858001] [<8021d208>] (__alloc_pages_nodemask) from [<8021e210>] (page_frag_alloc+0x68/0x188)
[18950.937115] [<8021e210>] (page_frag_alloc) from [<80625774>] (__netdev_alloc_skb+0xb0/0x154)
[18951.016021] [<80625774>] (__netdev_alloc_skb) from [<8055fe0c>] (rx_submit+0x3c/0x20c)
[18951.094485] [<8055fe0c>] (rx_submit) from [<80560450>] (rx_complete+0x1e0/0x204)
[18951.172404] [<80560450>] (rx_complete) from [<80569aa4>] (__usb_hcd_giveback_urb+0x80/0x154)
[18951.251386] [<80569aa4>] (__usb_hcd_giveback_urb) from [<80569cc8>] (usb_hcd_giveback_urb+0x4c/0xf4)
[18951.331268] [<80569cc8>] (usb_hcd_giveback_urb) from [<805931e8>] (completion_tasklet_func+0x6c/0x98)
[18951.411890] [<805931e8>] (completion_tasklet_func) from [<805a1060>] (tasklet_callback+0x20/0x24)
[18951.493310] [<805a1060>] (tasklet_callback) from [<80122f74>] (tasklet_hi_action+0x74/0x108)
[18951.575189] [<80122f74>] (tasklet_hi_action) from [<8010162c>] (__do_softirq+0x134/0x3ac)
[18951.658448] [<8010162c>] (__do_softirq) from [<80122b44>] (irq_exit+0xf8/0x164)
[18951.741056] [<80122b44>] (irq_exit) from [<80175384>] (__handle_domain_irq+0x68/0xc0)
[18951.824219] [<80175384>] (__handle_domain_irq) from [<801014f0>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xb0)
[18951.909282] [<801014f0>] (bcm2836_arm_irqchip_handle_irq) from [<807293fc>] (__irq_svc+0x5c/0x7c)
当然了,内存异常一般都是不希望全局生效的,但又没有设置过多的过滤器,因此往往需要用户对生效范围(如某个模块或者某个调用)和概率等进行设置。
Fault-injection实现
核心数据结构
/*
* For explanation of the elements of this struct, see
* Documentation/fault-injection/fault-injection.txt
*/
struct fault_attr {
unsigned long probability;
unsigned long interval;
atomic_t times;
atomic_t space;
unsigned long verbose;
bool task_filter;
unsigned long stacktrace_depth;
unsigned long require_start;
unsigned long require_end;
unsigned long reject_start;
unsigned long reject_end;
unsigned long count;
struct ratelimit_state ratelimit_state;
struct dentry *dname;
};
该结构体是fault-injection实现的核心结构体,该结构体中的大多数字段是否都有一种似成相识的感觉 :) ,其实它们都对应到debugfs中的各个配置接口文件。最后的三个字段是用于功能实现控制用的,其中count用于统计故障注入点的执行次数,ratelimit_state用于日志输出频率控制,最后的dname表示故障的类型(即fail_make_request、failslab等等)。下面来跟踪程序流程逐个详细分析fail_make_request、fail_io_timeout、failslab和fail_page_alloc的实现。
fail_make_request
static DECLARE_FAULT_ATTR(fail_make_request);
static int __init setup_fail_make_request(char *str)
{
return setup_fault_attr(&fail_make_request, str);
}
__setup("fail_make_request=", setup_fail_make_request);
首先代码中静态定义一个struct fault_attr结构以实例fail_make_request用于描述fail_make_request类型故障注入,DECLARE_FAULT_ATTR是一个宏定义:
#define FAULT_ATTR_INITIALIZER { \
.interval = 1, \
.times = ATOMIC_INIT(1), \
.require_end = ULONG_MAX, \
.stacktrace_depth = 32, \
.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \
.verbose = 2, \
.dname = NULL, \
}
#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
这里fail_make_request的一些通用的字段被初始化为默认的值。随后通过这里的__setup宏可知,在内核初始化阶段将处理“fail_make_request=xxx”的启动参数,注册的处理函数为setup_fail_make_request,它进一步调用通用函数setup_fault_attr,对fail_make_request结构体进一步初始化。
/*
* setup_fault_attr() is a helper function for various __setup handlers, so it
* returns 0 on error, because that is what __setup handlers do.
*/
int setup_fault_attr(struct fault_attr *attr, char *str)
{
unsigned long probability;
unsigned long interval;
int times;
int space;
/* "<interval>,<probability>,<space>,<times>" */
if (sscanf(str, "%lu,%lu,%d,%d",
&interval, &probability, &space, ×) < 4) {
printk(KERN_WARNING
"FAULT_INJECTION: failed to parse arguments\n");
return 0;
}
attr->probability = probability;
attr->interval = interval;
atomic_set(&attr->times, times);
atomic_set(&attr->space, space);
return 1;
}
EXPORT_SYMBOL_GPL(setup_fault_attr);
前文中已经介绍了,启动参数的配置只能接收interval、probability、space和times这4个参数,由setup_fault_attr()负责解析并赋值到fail_make_request结构体中。下面来看下debugfs的入口:
static int __init fail_make_request_debugfs(void)
{
struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
NULL, &fail_make_request);
return PTR_ERR_OR_ZERO(dir);
}
late_initcall(fail_make_request_debugfs);
该函数也在内核初始化阶段调用,它会在debugfs的目录下创建一个名为fail_make_request的attr目录,如下:
struct dentry *fault_create_debugfs_attr(const char *name,
struct dentry *parent, struct fault_attr *attr)
{
umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
struct dentry *dir;
dir = debugfs_create_dir(name, parent);
if (!dir)
return ERR_PTR(-ENOMEM);
if (!debugfs_create_ul("probability", mode, dir, &attr->probability))
goto fail;
if (!debugfs_create_ul("interval", mode, dir, &attr->interval))
goto fail;
if (!debugfs_create_atomic_t("times", mode, dir, &attr->times))
goto fail;
if (!debugfs_create_atomic_t("space", mode, dir, &attr->space))
goto fail;
if (!debugfs_create_ul("verbose", mode, dir, &attr->verbose))
goto fail;
if (!debugfs_create_u32("verbose_ratelimit_interval_ms", mode, dir,
&attr->ratelimit_state.interval))
goto fail;
if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir,
&attr->ratelimit_state.burst))
goto fail;
if (!debugfs_create_bool("task-filter", mode, dir, &attr->task_filter))
goto fail;
#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER
if (!debugfs_create_stacktrace_depth("stacktrace-depth", mode, dir,
&attr->stacktrace_depth))
goto fail;
if (!debugfs_create_ul("require-start", mode, dir,
&attr->require_start))
goto fail;
if (!debugfs_create_ul("require-end", mode, dir, &attr->require_end))
goto fail;
if (!debugfs_create_ul("reject-start", mode, dir, &attr->reject_start))
goto fail;
if (!debugfs_create_ul("reject-end", mode, dir, &attr->reject_end))
goto fail;
#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */
attr->dname = dget(dir);
return dir;
fail:
debugfs_remove_recursive(dir);
return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL_GPL(fault_create_debugfs_attr);
首先传入的parent为NULL,所以fail_make_request目录创建的点为degubfs的根目录,然后在该目录下依次创建probability、interval、times等等之前看到的属性配置文件,最后将目录的dentry保存到attr->dname字段中。接下来再看一下块设备的开关接口:
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
#endif
...
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct hd_struct *p = dev_to_part(dev);
return sprintf(buf, "%d\n", p->make_it_fail);
}
ssize_t part_fail_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
struct hd_struct *p = dev_to_part(dev);
int i;
if (count > 0 && sscanf(buf, "%d", &i) > 0)
p->make_it_fail = (i == 0) ? 0 : 1;
return count;
}
#endif
基于sysfs的接口,当用户往/sys/block/sdx/make-it-fail写入非0时,对应设备struct hd_struct的make_it_fail字段就被置位为1,开关就打开了,否则设置为0,开关就关闭了。
了解了以上的参数配置接口,下面进入正题,fail_make_request究竟是如何注入故障的?如何判断是否需要注入?
首先来关注以下通用IO提交流程:submit_bio()->generic_make_request()->generic_make_request_checks():
static noinline_for_stack bool
generic_make_request_checks(struct bio *bio)
{
...
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(&part_to_disk(part)->part0,
bio->bi_iter.bi_size))
goto end_io;
...
}
在IO提交流程的generic_make_request_checks()函数中会调用should_fail_request()函数进行故障注入的判断,如果这里返回true(注入),那IO的提交流程就不会继续下去,上层的generic_make_request()函数会返回cookie为BLK_QC_T_NONE,故障得以注入。
static bool should_fail_request(struct hd_struct *part, unsigned int bytes)
{
return part->make_it_fail && should_fail(&fail_make_request, bytes);
}
判断是否注入故障主要取决于should_fail_request()函数,该函数返回true表示需要注入,返回false表示不注入。入参中第一个参数为磁盘设备hd_struct结构体,第二个参数为本次IO的字节数。这里可以看到hd_struct结构体中的make_it_fail开关的作用了,然后还有一个“与”条件为通用函数should_fail()的返回结果。should_fail()函数时整个故障注入条件判断的核心,它将根据struce fault_attr结构体中配置的参数进行评估。
/*
* This code is stolen from failmalloc-1.0
* http://www.nongnu.org/failmalloc/
*/
bool should_fail(struct fault_attr *attr, ssize_t size)
{
/* No need to check any other properties if the probability is 0 */
if (attr->probability == 0)
return false;
if (attr->task_filter && !fail_task(attr, current))
return false;
if (atomic_read(&attr->times) == 0)
return false;
if (atomic_read(&attr->space) > size) {
atomic_sub(size, &attr->space);
return false;
}
if (attr->interval > 1) {
attr->count++;
if (attr->count % attr->interval)
return false;
}
if (attr->probability <= prandom_u32() % 100)
return false;
if (!fail_stacktrace(attr))
return false;
fail_dump(attr);
if (atomic_read(&attr->times) != -1)
atomic_dec_not_zero(&attr->times);
return true;
}
EXPORT_SYMBOL_GPL(should_fail);
1)如果设置的probability为0,不注入;
2)如果设置了进程过滤,调用fail_task函数进行判断,如果当前进程的make_it_fail标没有记置位或者在中断上下文中,不注入;
static bool fail_task(struct fault_attr *attr, struct task_struct *task)
{
return !in_interrupt() && task->make_it_fail;
}
3)如果注入次数不足超过上限,不注入;
4)如果预留的size余量大于本次io的字节数,那递减余量,不注入;
5)如果设置的间隔数大于1,则计算调用次数统计,若间隔数未到则不注入;
6)判断注入比率,通过prandom_u32()%100得到一个100以内的随机值,以此实现注入比率;
7)如果配置了CONFIG_FAULT_INJECTION_STACKTRACE_FILTER内核参数,这里会在fail_stacktrace()函数中判断执行流程中调用栈的代码段和配置“[require-start, require-end) ,[reject-start, reject-end)”的对应关系(这个函数中给出了获取调用栈地址的方法,这是一些比较有用的工具函数,值得mark一下):
#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER
static bool fail_stacktrace(struct fault_attr *attr)
{
struct stack_trace trace;
int depth = attr->stacktrace_depth;
unsigned long entries[MAX_STACK_TRACE_DEPTH];
int n;
bool found = (attr->require_start == 0 && attr->require_end == ULONG_MAX);
if (depth == 0)
return found;
trace.nr_entries = 0;
trace.entries = entries;
trace.max_entries = depth;
trace.skip = 1;
save_stack_trace(&trace);
for (n = 0; n < trace.nr_entries; n++) {
if (attr->reject_start <= entries[n] &&
entries[n] < attr->reject_end)
return false;
if (attr->require_start <= entries[n] &&
entries[n] < attr->require_end)
found = true;
}
return found;
}
#else
static inline bool fail_stacktrace(struct fault_attr *attr)
{
return true;
}
#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */
首先为了加速默认条件下的判断,当require_xxx和reject_xxx的值为默认时,直接返回pass(表示可以注入)。若用户设置了非默认值,则调用save_stack_trace函数向上逐级抓取调用栈,栈的深度由attr->stacktrace_depth给出,最大支持深度为32级,每一级调用函数的地址保存在trace.entries这个数组中,接下来就逐函数判断了,现判断[reject_start, reject_end),再判断[require-start, require-end) 。
回到should_fail()函数中,如果上边的各项判断都顺利通过了,就表示可以注入故障了,不过再注入故障之前最后要做的就是打印日志信息:
static void fail_dump(struct fault_attr *attr)
{
if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) {
printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n"
"name %pd, interval %lu, probability %lu, "
"space %d, times %d\n", attr->dname,
attr->interval, attr->probability,
atomic_read(&attr->space),
atomic_read(&attr->times));
if (attr->verbose > 1)
dump_stack();
}
}
这些打印信息在前文介绍使用时已经看到了。如果verbose为2还会调用dump_stack()打出内核调用栈。
至此fail_make_request类型的故障注入就分析完了,其中最核心的部分也已经分析清楚了,余下的三个故障注入类型也大同小异。
fail_io_timeout
static DECLARE_FAULT_ATTR(fail_io_timeout);
static int __init setup_fail_io_timeout(char *str)
{
return setup_fault_attr(&fail_io_timeout, str);
}
__setup("fail_io_timeout=", setup_fail_io_timeout);
fail_io_timeout的定义也由DECLARE_FAULT_ATTR宏给出,它的启动参数初始化接口由setup_fail_io_timeout()进行处理。
static int __init fail_io_timeout_debugfs(void)
{
struct dentry *dir = fault_create_debugfs_attr("fail_io_timeout",
NULL, &fail_io_timeout);
return PTR_ERR_OR_ZERO(dir);
}
debugfs的接口由fail_io_timeout_debugfs()函数负责在debugfs根目录创建。这两点同之前的fail_make_request是一样的。
int blk_should_fake_timeout(struct request_queue *q)
{
if (!test_bit(QUEUE_FLAG_FAIL_IO, &q->queue_flags))
return 0;
return should_fail(&fail_io_timeout, 1);
}
判断是否要进行故障注入的接口blk_should_fake_timeout(),它会在进行should_fail()判断之前对功能的开关进行判断,这里的开关为QUEUE_FLAG_FAIL_IO标记,该标记通过/sys/block/sdx/io-timeout-fail接口设置,对应的sysfs处理函数为:
ssize_t part_timeout_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
struct gendisk *disk = dev_to_disk(dev);
int set = test_bit(QUEUE_FLAG_FAIL_IO, &disk->queue->queue_flags);
return sprintf(buf, "%d\n", set != 0);
}
ssize_t part_timeout_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
struct gendisk *disk = dev_to_disk(dev);
int val;
if (count) {
struct request_queue *q = disk->queue;
char *p = (char *) buf;
val = simple_strtoul(p, &p, 10);
spin_lock_irq(q->queue_lock);
if (val)
queue_flag_set(QUEUE_FLAG_FAIL_IO, q);
else
queue_flag_clear(QUEUE_FLAG_FAIL_IO, q);
spin_unlock_irq(q->queue_lock);
}
return count;
}
当用户写入非0后,part_timeout_store函数对相应的磁盘所在的struct request_queue结构体的queue_flags设置QUEUE_FLAG_FAIL_IO标记,对应磁盘的fail_io_timeout故障注入开关也就打开了,否则就复位该标记(关闭该开关)。
blk_should_fake_timeout()函数调用的地方(即故障注入点)一共有2处,分别如下:
1、blk_complete_request
void blk_complete_request(struct request *req)
{
if (unlikely(blk_should_fake_timeout(req->q)))
return;
if (!blk_mark_rq_complete(req))
__blk_complete_request(req);
}
EXPORT_SYMBOL(blk_complete_request);
该函数在底层IO写入完成或出现错误以后会由底层硬件驱动进行回调,正常的执行流程下会调用__blk_complete_request(),然后提交BLOCK_SOFTIRQ类型的软中断:
void __blk_complete_request(struct request *req)
{
...
do_local:
if (list->next == &req->ipi_list)
raise_softirq_irqoff(BLOCK_SOFTIRQ);
...
}
static __latent_entropy void blk_done_softirq(struct softirq_action *h)
{
...
while (!list_empty(&local_list)) {
struct request *rq;
rq = list_entry(local_list.next, struct request, ipi_list);
list_del_init(&rq->ipi_list);
rq->q->softirq_done_fn(rq);
}
}
BLOCK_SOFTIRQ类型的软中断由blk_done_softirq()负责处理,它回调注册到request_queue中softirq_done_fn函数指针的函数,例如对于SCSI设备接下来的流程就是scsi_softirq_done()->scsi_finish_command()->scsi_io_completion()->scsi_end_request()->blk_update_request()->req_bio_endio()->bio_endio()完成本次IO,最后通知上层。
但是如果在blk_complete_request函数中注入故障,主动丢弃complete回调的向上传递,那就会触发request请求超时,调用流程如下:
blk_timeout_work()->blk_rq_check_expired()->blk_rq_timed_out()->scsi_times_out(),由scsi驱动程序进行超时处理,后台工作队列会定期进行IO的重试操作。
2、blk_mq_complete_request
blk层多队列完成回调函数,在使能内核多队列功能时IO的提交流程会走该流程分支。
scsi_mq_done()->blk_mq_complete_request()->__blk_mq_complete_request()->blk_mq_end_request()->blk_update_request()
failslab
failslab控制结构体:
static struct {
struct fault_attr attr;
bool ignore_gfp_reclaim;
bool cache_filter;
} failslab = {
.attr = FAULT_ATTR_INITIALIZER,
.ignore_gfp_reclaim = true,
.cache_filter = false,
};
failslab对通用的struct fault_attr结构体进行封装,另外定义了两个单独的参数ignore_gfp_reclaim和cache_filter,前者用于过滤__GFP_RECLAIM类型内存分配的开关,后者用于过滤用户需要的slab分配。之所以设置这两个参数是为了用户正对某些特定的kmem_cache注入需要,同事也为了避免一启用failslab后整个系统报错不可进一步调试。
failslab的启动配置初始化接口:
static int __init setup_failslab(char *str)
{
return setup_fault_attr(&failslab.attr, str);
}
__setup("failslab=", setup_failslab);
failslab的debugfs配置接口:
static int __init failslab_debugfs_init(void)
{
struct dentry *dir;
umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
dir = fault_create_debugfs_attr("failslab", NULL, &failslab.attr);
if (IS_ERR(dir))
return PTR_ERR(dir);
if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
&failslab.ignore_gfp_reclaim))
goto fail;
if (!debugfs_create_bool("cache-filter", mode, dir,
&failslab.cache_filter))
goto fail;
return 0;
fail:
debugfs_remove_recursive(dir);
return -ENOMEM;
}
failslab在创建debugfs目录和配置文件时会多创建两个配置文件ignore-gfp-wait和cache-filter,分别对应于struct failslab结构体中的ignore_gfp_reclaim和cache_filter这两个配置参数。
failslab cache_filter过滤器配置接口:
#ifdef CONFIG_FAILSLAB
static ssize_t failslab_show(struct kmem_cache *s, char *buf)
{
return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB));
}
static ssize_t failslab_store(struct kmem_cache *s, const char *buf,
size_t length)
{
if (s->refcount > 1)
return -EINVAL;
s->flags &= ~SLAB_FAILSLAB;
if (buf[0] == '1')
s->flags |= SLAB_FAILSLAB;
return length;
}
SLAB_ATTR(failslab);
#endif
该接口位于/sys/kernel/slab/xxx/failslab,设置为非0后会在kmem_cache的flags标识位置位SLAB_FAILSLAB(否则则清除该标识),SLAB_FAILSLAB标识在故障注入判断函数should_failslab中进行确认:
bool should_failslab(struct kmem_cache *s, gfp_t gfpflags)
{
/* No fault-injection for bootstrap cache */
if (unlikely(s == kmem_cache))
return false;
if (gfpflags & __GFP_NOFAIL)
return false;
if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
return false;
if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB))
return false;
return should_fail(&failslab.attr, s->object_size);
}
首先若ignore_gfp_reclaim标识启用,则自动忽略__GFP_RECLAIM类型的内存分配,然后如果cache_filter过滤器被启用,则自动过滤SLAB_FAILSLAB标识未置位的内存分配,最后调用should_fail进行通用标识判断。
接下来看一下failslab是在何处进行故障注入的:
static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
gfp_t flags)
{
flags &= gfp_allowed_mask;
lockdep_trace_alloc(flags);
might_sleep_if(gfpflags_allow_blocking(flags));
if (should_failslab(s, flags))
return NULL;
if (memcg_kmem_enabled() &&
((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
return memcg_kmem_get_cache(s);
return s;
}
正常的内存分配的调用流程为:kmem_cache_alloc()->slab_alloc()->slab_alloc_node()->slab_pre_alloc_hook()(kmalloc的内存分配流程在也类似),这里判断如果进行故障注入则直接返回NULL,表示内存分配失败,故障得以注入。
fail_page_alloc
fail_page_alloc的故障注入比failslab的更为底层,直接在内存伙伴系统中进行注入。它也有自己的结构体定义:
static struct {
struct fault_attr attr;
bool ignore_gfp_highmem;
bool ignore_gfp_reclaim;
u32 min_order;
} fail_page_alloc = {
.attr = FAULT_ATTR_INITIALIZER,
.ignore_gfp_reclaim = true,
.ignore_gfp_highmem = true,
.min_order = 1,
};
在通用struct fault_attr配置参数的基础之上又增加了ignore_gfp_reclaim、ignore_gfp_highmem和min_order这三个配置参数,第一个为用于过滤__GFP_DIRECT_RECLAIM类型内存分配的开关,第二个为用于过滤高端内存分配__GFP_HIGHMEM的开关,最后一个为对故障注入最小分配页的过滤器,只有大于该参数的分配才能够进行故障注入。
fail_page_alloc的启动参数配置接口:
static int __init setup_fail_page_alloc(char *str)
{
return setup_fault_attr(&fail_page_alloc.attr, str);
}
__setup("fail_page_alloc=", setup_fail_page_alloc);
fail_page_alloc的debugfs配置接口:
static int __init fail_page_alloc_debugfs(void)
{
umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
struct dentry *dir;
dir = fault_create_debugfs_attr("fail_page_alloc", NULL,
&fail_page_alloc.attr);
if (IS_ERR(dir))
return PTR_ERR(dir);
if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
&fail_page_alloc.ignore_gfp_reclaim))
goto fail;
if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
&fail_page_alloc.ignore_gfp_highmem))
goto fail;
if (!debugfs_create_u32("min-order", mode, dir,
&fail_page_alloc.min_order))
goto fail;
return 0;
fail:
debugfs_remove_recursive(dir);
return -ENOMEM;
}
它在通用配置文件的基础上多创建了ignore-gfp-wait、ignore-gfp-highmem和min-order这三个配置文件,分别对应于fail_page_alloc结构体的ignore_gfp_reclaim、ignore_gfp_highmem和min_order配置参数。
下面来分析故障注入判断函数:
static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
{
if (order < fail_page_alloc.min_order)
return false;
if (gfp_mask & __GFP_NOFAIL)
return false;
if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
return false;
if (fail_page_alloc.ignore_gfp_reclaim &&
(gfp_mask & __GFP_DIRECT_RECLAIM))
return false;
return should_fail(&fail_page_alloc.attr, 1 << order);
}
1)首先判断如果本次需要申请页的order值小于min_order过滤器的设置值则不注入故障;
2)然后如果申请页置位了__GFP_NOFAIL标记页不注入故障;
3)如果打开了高端内存开关,则对于置位了__GFP_HIGHMEM的高端页分配不注入故障;
4)如果打开了ignore_gfp_reclaim开关,则对置位了__GFP_DIRECT_RECLAIM的页分配不注入故障;
5)最后执行should_fail进行通用判断。
fail_page_alloc故障注入到内存页分配流程的prepare_alloc_pages()函数中:
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
{
ac->high_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = zonelist;
ac->nodemask = nodemask;
ac->migratetype = gfpflags_to_migratetype(gfp_mask);
if (cpusets_enabled()) {
*alloc_mask |= __GFP_HARDWALL;
if (!ac->nodemask)
ac->nodemask = &cpuset_current_mems_allowed;
else
*alloc_flags |= ALLOC_CPUSET;
}
lockdep_trace_alloc(gfp_mask);
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
if (should_fail_alloc_page(gfp_mask, order))
return false;
if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
*alloc_flags |= ALLOC_CMA;
return true;
}
正常的内存页分配流程之一为:page_frag_alloc()->alloc_pages_node()->__alloc_pages_node()->__alloc_pages->__alloc_pages_nodemask->prepare_alloc_pages(),该函数返回true表示可以分配页,如果故障注入,那这里就返回false,无法分配页,则调用方需要进行异常处理。
总结
在我们编写和调试内核程序的时候,一般情况下很容易只考虑到正常的执行流程,而对一些不常见的异常流程缺乏有效的处理机制,导致程序的健壮性不够,这样往往由于各种原因最终导致内核出现“卡死”、panic等用户不想见到的结果。本文介绍了内核中用于磁盘IO和内存分配的4种常见的故障注入技术(Fault injection),在程序的调试过程中或问题定位复现时可以用来模拟故障的场景,大大的提高了软件开发验证的效率。
参考文献:
Documentation/fault-injection/fault-injection.txt