一、背景
今天codding的时候,发现一个段错误。
-> % ./a.out 9000000
the size is: 0x895440
[2] 10558 segmentation fault (core dumped) ./a.out 9000000
打印跟了一下程序,段错误发生在定义数组的时候,感觉程序没毛病,就使用gdb跟了一下,效果如下:
(gdb) r 2304098328304234802342
Starting program: /home/signal/a.out 2304098328304234802342
the size is: 0x7fffffff
Program received signal SIGSEGV, Segmentation fault.
0x08048512 in main (argc=2, argv=0xbffff634) at sigsegv.c:15
15 bzero(test, sizeof(test));
(gdb) s
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) quit
于是就专门测试了一下这个信号:SIGSEGV
。
二、定位问题
1. 测试程序
大概知道了是数组分配的内存太大引起的,就顺手写了个测试程序,如下:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
int size;
if (argc != 2) {
printf("Usage: %s [size]\n", argv[0]);
return -1;
}
size = atoi(argv[1]);
printf("the size is: 0x%x\n", size);
char test[size];
bzero(test, sizeof(test));
return 0;
}
执行结果如下:
-> % ./a.out 9000000
the size is: 0x895440
[2] 10558 segmentation fault (core dumped) ./a.out 9000000
-> % ./a.out 8000000
the size is: 0x7a1200
可见,当分配的内存大于一定值时,就会出现段错误。
2. gdb调试core文件
使用gdb调试时,打印的错误信息如前所述,
设置ulimit -c
参数,程序运行错误时会生成core文件,使用gdb调试,如下:
-> % gdb -c core ./a.out
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...done.
[New LWP 11075]
Core was generated by `./a.out 9000000'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x08048512 in main (argc=2, argv=0xbf9a6584) at sigsegv.c:15
15 bzero(test, sizeof(test));
(gdb) s
The program is not being run.
(gdb) bt
#0 0x08048512 in main (argc=2, argv=0xbf9a6584) at sigsegv.c:15
(gdb)
3. strace调试系统调用
使用strace跟踪系统调用,打印如下:
-> % strace ./a.out 9000000
execve("./a.out", ["./a.out", "9000000"], [/* 63 vars */]) = 0
brk(0) = 0x8156000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7752000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=90693, ...}) = 0
mmap2(NULL, 90693, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb773b000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340\233\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1754876, ...}) = 0
mmap2(NULL, 1759868, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb758d000
mmap2(0xb7735000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a8000) = 0xb7735000
mmap2(0xb7738000, 10876, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7738000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb758c000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb758c940, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7735000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0xb7778000, 4096, PROT_READ) = 0
munmap(0xb773b000, 90693) = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 10), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7751000
write(1, "1the size is: 0x895440\n", 231the size is: 0x895440
) = 23
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xbf35fc54} ---
+++ killed by SIGSEGV (core dumped) +++
[2] 11100 segmentation fault (core dumped) strace ./a.out 9000000
由此可知,大概也就这个问题了。
三、分析与解决
SIGSEGV:指示进程进行了一次无效的内存引用(通常说明程序有错,若访问了一个未经初始化的指针)。名字SEGV代表“段违例”(segmentation violation).
SIGSEGV的默认动作是终止+core
对于不正确的内存处理,计算机程序可能抛出SIGSEGV。在函数内分配数组是保存在程序栈上的,而每种操作系统对栈的大小都是有限制的,如果分配的数组空间超过了栈大小,就会发生内存非法使用的错误。操作系统可能使用信号栈向一个处于自然状态的应用程序通告错误,在一个程序接收到SIGSEGV时的默认动作是异常终止。这个动作也许会结束进程,但是可能生成一个核心文件以帮助调试。
SIGSEGV可以被捕获。也就是说,应用程序可以请求它们想要的动作,以替代默认发生的动作。这样的动作可以是忽略它、调用一个函数,或恢复默认的动作。在一些情形下,忽略SIGSEGV导致未定义的行为。
在以后调试过程中,如果再遇到SIGSEGV信号导致的段错误,就要仔细检查程序中内存的使用,避免内存的非法引用。