简介
以下内容来自维基百科:Segmentation fault
A segmentation fault occurs when a program attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location, or to overwrite part of the operating system).
The term “segmentation” has various uses in computing; in the context of “segmentation fault”, a term used since the 1950s,[citation needed] it refers to the address space of a program.[6] With memory protection, only the program’s own address space is readable, and of this, only the stack and the read/write portion of the data segment of a program are writable, while read-only data and the code segment are not writable. Thus attempting to read outside of the program’s address space, or writing to a read-only segment of the address space, results in a segmentation fault, hence the name.
On systems using hardware memory segmentation to provide virtual memory, a segmentation fault occurs when the hardware detects an attempt to refer to a non-existent segment, or to refer to a location outside the bounds of a segment, or to refer to a location in a fashion not allowed by the permissions granted for that segment. On systems using only paging, an invalid page fault generally leads to a segmentation fault, and segmentation faults and page faults are both faults raised by the virtual memory management system. Segmentation faults can also occur independently of page faults: illegal access to a valid page is a segmentation fault, but not an invalid page fault, and segmentation faults can occur in the middle of a page (hence no page fault), for example in a buffer overflow that stays within a page but illegally overwrites memory.
At the hardware level, the fault is initially raised by the memory management unit (MMU) on illegal access (if the referenced memory exists), as part of its memory protection feature, or an invalid page fault (if the referenced memory does not exist). If the problem is not an invalid logical address but instead an invalid physical address, a bus error is raised instead, though these are not always distinguished.
At the operating system level, this fault is caught and a signal is passed on to the offending process, activating the process’s handler for that signal. Different operating systems have different signal names to indicate that a segmentation fault has occurred. On Unix-like operating systems, a signal called SIGSEGV (abbreviated from segmentation violation) is sent to the offending process. On Microsoft Windows, the offending process receives a STATUS_ACCESS_VIOLATION exception.
gdb
很强大的功能,值得专门来一篇。gnu 提供的官方文档如下:
GDB: The GNU Project Debugger
特别想说明的是,从 gdb 6.1 开始,支持以 TUI(Terminal User Interface)支持交互模式开启。
gdb test -tui #程序为test
addr2line
addr2line 可以将 ip 所指的地址转换为源码行号。
推荐使用下面方式打印源代码函数:
addr2line -C -f -e <YourPrograme> <lineNumber> #其中YourPrograme也可以是库文件
dmesg
dmesg 用来打印或控制内核环形缓冲区。
nm
nm命令可以列出二进制文件中的符号表,包括符号地址、符号类型、符号名等,这样可以帮助定位在哪里发生了段错误。
objdump
查看二进制文件的内部信息,包含反汇编。
readelf
直接读取 Linux 平台的 ELF(Executable Linkable Format) 文件内容。如符号表、字符串表、段名字、重定位等。
可能引起的cause
发生Segmentation 的条件以及它们如何表现出来是特定于硬件和操作系统的:不同的硬件在给定的条件下会引发不同的故障,不同的操作系统会将这些故障转换为传递给进程的不同信号。 直接原因是内存访问冲突,而根本原因通常是某种软件错误。 确定根本原因——调试错误——在某些情况下可能很简单,其中程序会始终导致分段错误(例如,取消引用空指针),而在其他情况下,错误可能难以重现并取决于内存分配 在每次运行时(例如,取消引用悬空指针)。
以下是导致 Segmentation fault 的一些典型原因:
- 试图访问一个不存在的内存地址(在进程的地址空间之外);
- 试图访问程序无权访问的内存(例如进程上下文中的内核结构);
- 试图写入只读内存(如代码段);
以下通常是由导致无效内存访问的编程错误引起的:
- 解引用一个空指针,它通常指向一个不属于进程地址空间的地址;
- 解引用或分配给未初始化的指针(野指针,指向随机内存地址);
- 解引用或分配给已释放的指针(悬空指针,指向已释放/解除分配/删除的内存);
- 缓冲区溢出;
- 堆栈溢出;
疑问
借助万能的互联网,也只是了解到,可以通过 gdb、dmesg + addr2line + objdump 、starce + addr2line 等的方式调查coredump 问题,定位具体代码位置。
但是以上的方法都是相通的么,还是具体的要case by case。只有自己深入调查了才有发言权。
调试案例
方式一:gdb
使用 bt
(back trace)或者 bt full
来查看具体引起信息。
[root@localhost ~]# gdb /mydir/test SegmentationFile
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /mydir/test ...done.
[New LWP 15799]
[New LWP 15872]
[New LWP 15867]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/mydir/test'.
Program terminated with signal 11, Segmentation fault.
#0 size (this=<optimized out>, this=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:646
646 /usr/include/c++/4.8.2/bits/stl_vector.h: 没有那个文件或目录.
Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-28.el7.x86_64 boost-thread-1.53.0-28.el7.x86_64 elfutils-libelf-0.172-2.el7.x86_64 glib2-2.56.1-2.el7.x86_64 glibc-2.17-324.el7_9.x86_64
(gdb) bt
#0 size (this=<optimized out>, this=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:646
#1 vector (__x=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:312
#2 getAi (this=<optimized out>) at /mydir/test.c:58
#3 calculateAi (this=this@entry=0x25b9d81) at /mydir/test.cc:144
#4 0x00007f8565674f4f in handleNotification (
this=this@entry=0x7f856587d7a0 <getInstance()::instance_s>, callData_p=0x7f854402ee58, request_p=<optimized out>,
reply=<optimized out>) at /mydir/test.cc:36
从这里结合代码,还是很容易看出代码问题所在的,就是空指针导致的问题。
但是都说条条大路通罗马,真的是如此么?
继续使用其他的方式看看是不是可以推出一样的结果。
方式二:dmesg + addr2line + objdump
[root@localhost ~]# dmesg
[2437974.769928] test[15799]: segfault at 488 ip 00007f855c3ef328 sp 00007ffc9d1a6320 error 4 in libcommon.so.1.0.0[7f855c3d8000+23000]
可以看出:
- segmentation fault是发生在 libcommon.so.1.0.0 库;
- ip是指令地址,sp是堆栈指针地址。即:ip 00007f855c3ef328,sp 00007ffc9d1a6320;
- 从 libcommon.so.1.0.0[7f855c3d8000+23000] 可以看出 libcommon.so.1.0.0 在
test
程序中映射的内存基址为 7f855c3d8000; - 根据指令地址为:00007f855c3ef328 ,libcommon.so.1.0.0 指令的基地址为:7f855c3d8000,可以计算出该指令的相对地址为 17328(即:00007f855c3ef328 - 7f855c3d8000),我们可以根据相对地址 17328 ,找到需要找到代码段地址对应的函数。
疑问:上面的 488 又是代表什么含义呢?
[root@localhost ~]# addr2line -C -f -e /mydir/libcommon.so.1.0.0 17328
std::string::_Rep::_M_dispose(std::allocator<char> const&)
/usr/include/c++/4.8.2/bits/basic_string.h:240
但其实这里是没看出具体问题所在的,继续调查。 反汇编该 libcommon.so.1.0.0 库文件。
[root@localhost ~]# objdump -ld /mydir/libcommon.so.1.0.0 > tmp.Commonobjdump
从中可以查看到一些信息,但是貌似没有什么用:
13352 _ZNKSt6vectorI11NasStudentSaIS0_EE4sizeEv():
13353 /usr/include/c++/4.8.2/bits/stl_vector.h:646
13354 17328: 48 8b 90 88 04 00 00 mov 0x488(%rax),%rdx
13355 1732f: 48 2b 90 80 04 00 00 sub 0x480(%rax),%rdx
但其实结合前面的 gdb
,我们可以看出来,这里显示的只是 frame 0
信息。并没有看出来是空指针导致的问题。
可以说,以上方式并未清晰显示具体问题出现在哪里。
方式三:starce + addr2line
strace -i /mydir/test > tmp.strace 2>&1
打开文件,发现并没有看出来问题所在。更谈不上继续借用 addr2line 工具。
无语啊。粘贴部分信息如下:
5500 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 161, MSG_NOSIGNAL, NULL, 0) = 161
5501 [00007f6d813cfd19] gettid() = 20249
5502 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 207, MSG_NOSIGNAL, NULL, 0) = 207
5503 [00007f6d813cfd19] gettid() = 20249
5504 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 199, MSG_NOSIGNAL, NULL, 0) = 199
5505 [00007f6d813cfd19] gettid() = 20249
5506 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 193, MSG_NOSIGNAL, NULL, 0) = 193
5507 [00007f6d813cfd19] gettid() = 20249
5508 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 198, MSG_NOSIGNAL, NULL, 0) = 198
5509 [00007f6d813cfd19] gettid() = 20249
5510 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 192, MSG_NOSIGNAL, NULL, 0) = 192
5511 [00007f6d813cfd19] gettid() = 20249
5512 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 165, MSG_NOSIGNAL, NULL, 0) = 165
5513 [00007f6d813cfd19] gettid() = 20249
5514 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 138, MSG_NOSIGNAL, NULL, 0) = 138
5515 [00007f6d813cfd19] gettid() = 20249
5516 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 143, MSG_NOSIGNAL, NULL, 0) = 143
5517 [00007f6d813cfd19] gettid() = 20249
5518 [00007f6d813d6a1b] sendto(4, "<175>Jul 4 13:20:37 test: [2"..., 129, MSG_NOSIGNAL, NULL, 0) = 129
小结
至此,尝试了主流的几种方式后,只能感叹自己对程序的理解太浅,还需深入的学习程序、内存布局、汇编等基础知识。
继续海边捡贝壳~~~