coredump文件和kernel.core_pattern

light_forest

已于 2024-10-12 11:16:32 修改

阅读量1k

点赞数 4

分类专栏： linux内核文章标签：运维 linux 服务器

于 2024-10-10 18:24:32 首次发布

本文链接：https://blog.csdn.net/leiyanjie8995/article/details/142787301

版权

linux内核专栏收录该内容

5 篇文章

订阅专栏

产品调测过程中，进程有概率会发生crash。这时候生成问题现场的core文件就是一个非常有效和通用的调试手段。那么对于整个linux环境来说，首先要支持生成进程的core文件。

而对于dpdk和spdk，产品运行过程中发现异常情况后crash然后依赖热恢复重新拉起继续运行，也不是一个罕见的场景。所以要考虑到在整个产品生命周期中，core文件持续生成占用过多的磁盘空间，引发系统级的灾难。

所以对于core文件的生成需要一定的运维规则，对其进行维护。

一、kernel.core_pattern

内核提供了core文件的规则配置，用户使用系统级配置core_pattern来配置。

可在/etc/sysctl.conf中静态配置，这样每次开机启动都会生效。

kernel.core_pattern=|/home/script/core_file_handle.sh core_%e_%p_sigwq_%s_time_%t

如果是系统启动以后动态配置，则可以

#cat /proc/sys/kernel/core_pattern
#echo "kernel.core_pattern=|/home/script/core_file_handle.sh core_%e_%p_sigwq_%s_time_%t" > /proc/sys/kernel/core_pattern

内核接收到这个字符串后，在生成core文件时，会根据这个字符串进行转义处理，对于%e %p等，会替换为真实的字符串。

二、内核执行路径

首先异常退出一定是走信号处理流程，常见的比如内存越界访问“SegmentFault”信号。

那么查看内核代码，生成core文件的路径也在信号处理流程中，以X86的内核实现为例，调用路径如下：

syscall_exit_to_user_mode     (系统调用）
irqentry_exit_to_user_mode   （中断）
asm_exit_to_user_mode         (do_fork)

 ->exit_to_user_mode_prepare
   ->arch_exit_to_user_mode_prepare
    ->exit_to_user_mode_loop
     ->arch_do_signal_or_restart
      ->get_signal
       ->do_coredump

信号的处理是在内核态切换回用户态的时候，所以上面的信号处理入口可以看到，有：

1. 系统调用；2. 中断；3.do_fork（虽然fork也是系统调用，但因为它特殊的处理，所以返回用户空间时进入了专门的分支）

在最终get_signal接口中，判断符合条件，则进入do_coredump生成core文件。

三、do_coredump处理流程

因为我们主要关注的是kernel_pattern相关的内容，所以只重点关注下kernel_pattern相关的内容。

内核会根据kernel_pattern的格式进入两个分支。

如果第一个字符是|，说明是管道的形式，用户可以通过管道的方式在一个脚本中接收数据，这个脚本中可以添加自己的处理。

比如可以在脚本中，对历史core文件进行清理，只保留最近的20个文件。这样就在每次有新的core文件生成时，同步清理历史core文件。

1）普通方式

只定义core文件的路径和文件名


kernel.core_pattern=/var/coresave/core_%e_%p_sigwq_%s_time_%t

2）管道的方式

指定一个脚本，可以进行更复杂的处理；

kernel.core_pattern=|/home/script/core_file_handle.sh core_%e_%p_sigwq_%s_time_%t

内核会通过管道的方式，把core文件传入这个脚本中，所以在这个脚本中一定要有cat > corefile的操作，来生成core文件。

一个core文件的示例代码如下（删除>7天的core文件）：

#!/bin/bash

COREDIR=/home/coresave

# compare to 7 days ago
before_7_day=`date -d "-7 day" "+%s"`

# first step
# check the generation time of the core and delete the corefile older than 7 days.
total_size=0
oldifs=$IFS
IFS=$'\n'
for corefile_des in `ls -lrt $COREDIR | grep core`; do
    size=`echo $corefile_des | awk  -F ' ' '{ print $5}'`
    filename=`echo $corefile_des | awk  -F ' ' '{ print $9}'`
    time=`echo ${filename#*time_}`
    echo "$time"|[ -z "`sed -n '/^[0-9][0-9]*$/p'`" ] && time=`date '+%s' -r $COREDIR/$filename`
    if [[ $time -lt $before_7_day ]]; then
        rm -f $COREDIR/${filename}
    else
        ((total_size+=$size))
    fi
done

# second step
# check the size of all corefile. if capacity exceeds the high water mark,
# the oldest core file will be deleted until the total capacity is lower than low water mark
if [[ $total_size -gt $COREFILE_HIGH_WATERMARK ]]; then
    for corefile_des in `ls -lrt $COREDIR | grep core`; do
        size=`echo $corefile_des | awk  -F ' ' '{ print $5}'`
        filename=`echo $corefile_des | awk  -F ' ' '{ print $9}'`
        rm -f $COREDIR/$filename
        ((total_size-=$size))
        if [[ $total_size -le $COREFILE_LOW_WATERMARK ]]; then
            break
        fi
    done
fi
IFS="$oldifs"

# third step
# generate a corefile
if [[ -n $1 ]]; then
    cat > ${COREDIR}/$1
fi

内核的do_coredump接口会首先判断kernel_pattern的首字符是否是'|'来确认是否是管道处理方式，进入不同的分支处理。

if(ispipe) {        //管道方式
   1.解析kernel_pattern
   2.生成标准的用户态程序调用格式
		sub_info = call_usermodehelper_setup(helper_argv[0],
						helper_argv, NULL, GFP_KERNEL,
						umh_pipe_setup, NULL, &cprm);
   3.调用用户态程序
		if (sub_info)
			retval = call_usermodehelper_exec(sub_info,
							  UMH_WAIT_EXEC);
} else {        //普通方式
   filp_open打开文件
   do_truncate写入文件
}

call_usermodehelper_exec是在kernel/umh.c中定义的，会把生成core文件的操作使用queue_work调度到system_unbound_wq来执行，然后等待其执行完成后退出。

之前也整理过workqueue的基础知识，在这里又见到了system_unbound_wq。