偶然出现 segmentation fault 时的调试方法与 SIGSEGV 信号

最新推荐文章于 2024-05-29 22:25:11 发布

longyu_wlz

最新推荐文章于 2024-05-29 22:25:11 发布

阅读量4.6k

点赞数

分类专栏： Linux 文章标签：段错误 SIGSEGV

本文链接：https://blog.csdn.net/Longyu_wlz/article/details/106677213

版权

Linux 专栏收录该内容

143 篇文章 13 订阅

订阅专栏

问题描述

最近遇到一个 bug，有一定的偶然性会出现段错误。第一步需要确定的是段错误出现在哪里。可由于这个 bug 的偶然性，常规的方法无法确定问题。

根据经验，这个问题有两个方案可以使用。

生成 core dump file 文件
让程序在产生段错误的时候停下来，用 gdb attach 程序来调试

第一种方式在我的问题上不可用，要生成的 core dump file 过大无法存储。

那么第二种方式如何进行呢？‘

怎么让程序在产生段错误时停住呢？

通过研究发现，在程序出现段错误时，系统会发送 SIGSEGV 信号给程序。这个信号一般并不会注册信号处理函数，默认方式就是杀死进程。

在这里我们可以利用 SIGSEGV 信号来让程序在产生段错误时停住。具体的实现方法是调用 signal 函数注册 SIGSEGV 信号的处理函数，将这个处理函数写成一个死循环。这样当程序出现段错误的时候，内核向程序发送 SIGSEGV 信号，预先注册的信号处理函数被调用，程序停住。

我使用下面的代码验证上述方法是否可行。

#include <signal.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

void segv_handler(int signo)
{
    printf("in segv_handler\n");

    while (signo) {
        sleep(1);
    }
}

int main(int argc, char *argv[])
{
    char *pointer = NULL;

    signal(SIGSEGV, segv_handler);

    *pointer = 'c';

    return 0;
}

上述程序首先注册了一个 SIGSEGV 信号的信号处理函数 segv_handler，这个函数里面实现为死循环。注册之后，通过向 0 地址写值来触发段错误。

使用 gcc -g 编译带调试信息的版本，目标文件名为 segmentation_fault_test。

在一个终端中执行测试程序，输出的 log 信息如下：

longyu@debian:/tmp$ ./segmentation_fault_test
in segv_handler

从上面的输出中可以看到，已经在段错误信号 handler 中停住了。这时按 Control+Z 停止此程序。

这样终端就能够继续使用了，我们现在使用 gdb attach segmentation_fault_test 程序，查看栈帧就能看到死在哪里了。

longyu@debian:/tmp$ ps aux |grep 'seg'
longyu 1218 0.0 0.0 2276 684 pts/0 T 19:07 0:00 ./segmentation_fault_test
longyu 1228 0.0 0.0 15548 884 pts/0 S+ 19:10 0:00 grep seg
longyu@debian:/tmp$ gdb -p 1218
.......

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1218
Reading symbols from /tmp/segmentation_fault_test...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug/.build-id/18/b9a9a8c523e5cfe5b5d946d605d09242f09798.debug...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/.build-id/f2/5dfd7b95be4ba386fd71080accae8c0732b711.debug...done.
done.

Program received signal SIGTSTP, Stopped (user).
0x00007f153e59f6f4 in __GI___nanosleep (requested_time=requested_time@entry=0x7fff71925520, remaining=remaining@entry=0x7fff71925520)
at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) bt
#0 0x00007f153e59f6f4 in __GI___nanosleep (requested_time=requested_time@entry=0x7fff71925520, remaining=remaining@entry=0x7fff71925520)
at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1 0x00007f153e59f62a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x000055f0ec640178 in segv_handler (signo=11) at segmentation_fault_test.c:11
#3 <signal handler called>
#4 0x000055f0ec6401ad in main (argc=1, argv=0x7fff71925d28) at segmentation_fault_test.c:21

可以看到 segmentation_fault_test.c 的 21 行触发了段错误。

查看代码，可以看到第 21 行就是出问题的地方。

21 *pointer = ‘c’;

使用这种方法，我们可以让程序在出现段错误的时候停下来，这样我们就能够使用 gdb 收集一些发生段错误时的信息。

内核生成 core dump file

linux 内核提供了一种当程序触发异常处理程序时生成 core dump 文件的机制。生成的这个 core dump 文件能够用 gdb 调试，这样我们就能很快定位到段错误出现的点。

man core 获取到了如下 manual 信息：

The default action of certain signals is to cause a process to
terminate and produce a core dump file, a disk file containing an
image of the process’s memory at the time of termination. This image
can be used in a debugger (e.g., gdb(1)) to inspect the state of the
program at the time that it terminated. A list of the signals which
cause a process to dump core can be found in signal(7).

这里的 core dump file 是一个包含程序异常终止时内存镜像的磁盘文件。这个文件能够在调试器中使用以观察程序终止时的状态。能触发程序 dump core 的信号列表可以通过 man 7 signal 查看。

我在我的系统上执行 man 7 signal，帮助信息中有如下列表：

Signal Value Action Comment
──────────────────────────────────────────────────────────────────────
SIGHUP 1 Term Hangup detected on controlling terminal
or death of controlling process
SIGINT 2 Term Interrupt from keyboard
SIGQUIT 3 Core Quit from keyboard
SIGILL 4 Core Illegal Instruction
SIGABRT 6 Core Abort signal from abort(3)
SIGFPE 8 Core Floating-point exception
SIGKILL 9 Term Kill signal
SIGSEGV 11 Core Invalid memory reference
SIGPIPE 13 Term Broken pipe: write to pipe with no
readers; see pipe(7)
SIGALRM 14 Term Timer signal from alarm(2)
SIGTERM 15 Term Termination signal
SIGUSR1 30,10,16 Term User-defined signal 1
SIGUSR2 31,12,17 Term User-defined signal 2
SIGCHLD 20,17,18 Ign Child stopped or terminated
SIGCONT 19,18,25 Cont Continue if stopped
SIGSTOP 17,19,23 Stop Stop process
SIGTSTP 18,20,24 Stop Stop typed at terminal
SIGTTIN 21,21,26 Stop Terminal input for background process
SIGTTOU 22,22,27 Stop Terminal output for background process

从上面的列表中可以看到，SIGSEGV、SIGFPE、SIGABRT、SIGILL、SIGQUIT 等信号都能够触发程序生成 core dump 文件。

ulimit 配置

不过要成功生成一个 core 文件，我们还需要调节 ulimit 中对 core file size 的资源限制，这个限制默认为 0。普通用户只能够调大这个 file size 不能够减小 file size。

在我的系统上执行 ulimit -a 命令查看 core file size 的资源限制信息，有如下输出：

longyu@debian:/tmp$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7767
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7767
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

可以看到第一行输出中 core file size 的大小为 0。我通过执行 ulimit -c 10240 来调大这个 core file size，这之后就能够生成最大 10240 个磁盘 blocks 大小的 core file。

示例如下；

longyu@debian:~$ ulimit -c 10240
longyu@debian:~$ ulimit -c
10240

设定了 ulimit 中 core file size 的 limit 后，我修改上面测试用的程序进行测试。修改后的程序内容如下：

#include <stdio.h>

int main(int argc, char *argv[])
{
    char *pointer = NULL;
    *pointer = 'c';

    return 0;
}

将上述代码内容保存为 test.c，使用 gcc -g 编译并重命名为 test 程序。执行 test 程序后就能够生成 core dump file。

操作示例如下：

longyu@debian:/tmp$ ./test
Segmentation fault (core dumped)
longyu@debian:/tmp$ ls
core systemd-private-b48e55e2134a4af3b93dc8d9a5815601-systemd-timesyncd.service-DCgRqT test test.c

可以看到当前目录中生成了一个 core 文件，这个文件就是 core dump file。有了这个文件我们就可以通过 gdb 来调试，查看出现段错误的位置。

使用的命令格式如下：

gdb [options] [executable-file [core-file or process-id]]

示例如下：

longyu@debian:/tmp$ gdb test core
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test...done.
[New LWP 5779]
Core was generated by `./test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055f377f4413c in main (argc=1, argv=0x7ffd3aa692c8) at test.c:6
6 *pointer = 'c';
(gdb)

可以看到 gdb 的输出信息中已经打印出来了出问题的地方，就是 test.c 的第 6 行。