Linux Segmentation fault (coredump)调试手段

简介

以下内容来自维基百科:Segmentation fault

A segmentation fault occurs when a program attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location, or to overwrite part of the operating system).

The term “segmentation” has various uses in computing; in the context of “segmentation fault”, a term used since the 1950s,[citation needed] it refers to the address space of a program.[6] With memory protection, only the program’s own address space is readable, and of this, only the stack and the read/write portion of the data segment of a program are writable, while read-only data and the code segment are not writable. Thus attempting to read outside of the program’s address space, or writing to a read-only segment of the address space, results in a segmentation fault, hence the name.


On systems using hardware memory segmentation to provide virtual memory, a segmentation fault occurs when the hardware detects an attempt to refer to a non-existent segment, or to refer to a location outside the bounds of a segment, or to refer to a location in a fashion not allowed by the permissions granted for that segment. On systems using only paging, an invalid page fault generally leads to a segmentation fault, and segmentation faults and page faults are both faults raised by the virtual memory management system. Segmentation faults can also occur independently of page faults: illegal access to a valid page is a segmentation fault, but not an invalid page fault, and segmentation faults can occur in the middle of a page (hence no page fault), for example in a buffer overflow that stays within a page but illegally overwrites memory.


At the hardware level, the fault is initially raised by the memory management unit (MMU) on illegal access (if the referenced memory exists), as part of its memory protection feature, or an invalid page fault (if the referenced memory does not exist). If the problem is not an invalid logical address but instead an invalid physical address, a bus error is raised instead, though these are not always distinguished.


At the operating system level, this fault is caught and a signal is passed on to the offending process, activating the process’s handler for that signal. Different operating systems have different signal names to indicate that a segmentation fault has occurred. On Unix-like operating systems, a signal called SIGSEGV (abbreviated from segmentation violation) is sent to the offending process. On Microsoft Windows, the offending process receives a STATUS_ACCESS_VIOLATION exception.

gdb

很强大的功能,值得专门来一篇。gnu 提供的官方文档如下:
GDB: The GNU Project Debugger
特别想说明的是,从 gdb 6.1 开始,支持以 TUI(Terminal User Interface)支持交互模式开启。

gdb test -tui              #程序为test

addr2line

addr2line 可以将 ip 所指的地址转换为源码行号。
推荐使用下面方式打印源代码函数:

addr2line -C -f -e <YourPrograme> <lineNumber>  #其中YourPrograme也可以是库文件

dmesg

dmesg 用来打印或控制内核环形缓冲区。

nm

nm命令可以列出二进制文件中的符号表,包括符号地址、符号类型、符号名等,这样可以帮助定位在哪里发生了段错误。

objdump

查看二进制文件的内部信息,包含反汇编。

readelf

直接读取 Linux 平台的 ELF(Executable Linkable Format) 文件内容。如符号表、字符串表、段名字、重定位等。

可能引起的cause

发生Segmentation 的条件以及它们如何表现出来是特定于硬件和操作系统的:不同的硬件在给定的条件下会引发不同的故障,不同的操作系统会将这些故障转换为传递给进程的不同信号。 直接原因是内存访问冲突,而根本原因通常是某种软件错误。 确定根本原因——调试错误——在某些情况下可能很简单,其中程序会始终导致分段错误(例如,取消引用空指针),而在其他情况下,错误可能难以重现并取决于内存分配 在每次运行时(例如,取消引用悬空指针)。

以下是导致 Segmentation fault 的一些典型原因:

  1. 试图访问一个不存在的内存地址(在进程的地址空间之外);
  2. 试图访问程序无权访问的内存(例如进程上下文中的内核结构);
  3. 试图写入只读内存(如代码段);

以下通常是由导致无效内存访问的编程错误引起的:

  1. 解引用一个空指针,它通常指向一个不属于进程地址空间的地址;
  2. 解引用或分配给未初始化的指针(野指针,指向随机内存地址);
  3. 解引用或分配给已释放的指针(悬空指针,指向已释放/解除分配/删除的内存);
  4. 缓冲区溢出;
  5. 堆栈溢出;

疑问

借助万能的互联网,也只是了解到,可以通过 gdb、dmesg + addr2line + objdump 、starce + addr2line 等的方式调查coredump 问题,定位具体代码位置。
但是以上的方法都是相通的么,还是具体的要case by case。只有自己深入调查了才有发言权。

调试案例

方式一:gdb

使用 bt (back trace)或者 bt full 来查看具体引起信息。

[root@localhost ~]# gdb /mydir/test SegmentationFile
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /mydir/test ...done.
[New LWP 15799]

[New LWP 15872]
[New LWP 15867]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/mydir/test'.
Program terminated with signal 11, Segmentation fault.
#0  size (this=<optimized out>, this=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:646
646     /usr/include/c++/4.8.2/bits/stl_vector.h: 没有那个文件或目录.
Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-28.el7.x86_64 boost-thread-1.53.0-28.el7.x86_64  elfutils-libelf-0.172-2.el7.x86_64 glib2-2.56.1-2.el7.x86_64 glibc-2.17-324.el7_9.x86_64 
(gdb) bt
#0  size (this=<optimized out>, this=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:646
#1  vector (__x=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:312
#2  getAi (this=<optimized out>) at /mydir/test.c:58
#3  calculateAi (this=this@entry=0x25b9d81) at /mydir/test.cc:144
#4  0x00007f8565674f4f in handleNotification (
    this=this@entry=0x7f856587d7a0 <getInstance()::instance_s>, callData_p=0x7f854402ee58, request_p=<optimized out>,
    reply=<optimized out>) at /mydir/test.cc:36

从这里结合代码,还是很容易看出代码问题所在的,就是空指针导致的问题。
但是都说条条大路通罗马,真的是如此么?
继续使用其他的方式看看是不是可以推出一样的结果。

方式二:dmesg + addr2line + objdump

[root@localhost ~]# dmesg
[2437974.769928] test[15799]: segfault at 488 ip 00007f855c3ef328 sp 00007ffc9d1a6320 error 4 in libcommon.so.1.0.0[7f855c3d8000+23000]

可以看出:

  1. segmentation fault是发生在 libcommon.so.1.0.0 库;
  2. ip是指令地址,sp是堆栈指针地址。即:ip 00007f855c3ef328,sp 00007ffc9d1a6320;
  3. 从 libcommon.so.1.0.0[7f855c3d8000+23000] 可以看出 libcommon.so.1.0.0 在 test 程序中映射的内存基址为 7f855c3d8000;
  4. 根据指令地址为:00007f855c3ef328 ,libcommon.so.1.0.0 指令的基地址为:7f855c3d8000,可以计算出该指令的相对地址为 17328(即:00007f855c3ef328 - 7f855c3d8000),我们可以根据相对地址 17328 ,找到需要找到代码段地址对应的函数。
  5. 疑问:上面的 488 又是代表什么含义呢?
[root@localhost ~]# addr2line -C -f -e /mydir/libcommon.so.1.0.0 17328
std::string::_Rep::_M_dispose(std::allocator<char> const&)
/usr/include/c++/4.8.2/bits/basic_string.h:240

但其实这里是没看出具体问题所在的,继续调查。 反汇编该 libcommon.so.1.0.0 库文件。

[root@localhost ~]# objdump -ld /mydir/libcommon.so.1.0.0 > tmp.Commonobjdump

从中可以查看到一些信息,但是貌似没有什么用:

13352 _ZNKSt6vectorI11NasStudentSaIS0_EE4sizeEv():
13353 /usr/include/c++/4.8.2/bits/stl_vector.h:646
13354    17328:       48 8b 90 88 04 00 00    mov    0x488(%rax),%rdx
13355    1732f:       48 2b 90 80 04 00 00    sub    0x480(%rax),%rdx

但其实结合前面的 gdb ,我们可以看出来,这里显示的只是 frame 0 信息。并没有看出来是空指针导致的问题。
可以说,以上方式并未清晰显示具体问题出现在哪里。

方式三:starce + addr2line

strace -i /mydir/test > tmp.strace 2>&1

打开文件,发现并没有看出来问题所在。更谈不上继续借用 addr2line 工具。
无语啊。粘贴部分信息如下:

5500 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 161, MSG_NOSIGNAL, NULL, 0) = 161
5501 [00007f6d813cfd19] gettid()             = 20249
5502 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 207, MSG_NOSIGNAL, NULL, 0) = 207
5503 [00007f6d813cfd19] gettid()             = 20249
5504 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 199, MSG_NOSIGNAL, NULL, 0) = 199
5505 [00007f6d813cfd19] gettid()             = 20249
5506 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 193, MSG_NOSIGNAL, NULL, 0) = 193
5507 [00007f6d813cfd19] gettid()             = 20249
5508 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 198, MSG_NOSIGNAL, NULL, 0) = 198
5509 [00007f6d813cfd19] gettid()             = 20249
5510 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 192, MSG_NOSIGNAL, NULL, 0) = 192
5511 [00007f6d813cfd19] gettid()             = 20249
5512 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 165, MSG_NOSIGNAL, NULL, 0) = 165
5513 [00007f6d813cfd19] gettid()             = 20249
5514 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 138, MSG_NOSIGNAL, NULL, 0) = 138
5515 [00007f6d813cfd19] gettid()             = 20249
5516 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 143, MSG_NOSIGNAL, NULL, 0) = 143
5517 [00007f6d813cfd19] gettid()             = 20249
5518 [00007f6d813d6a1b] sendto(4, "<175>Jul  4 13:20:37 test: [2"..., 129, MSG_NOSIGNAL, NULL, 0) = 129

小结

至此,尝试了主流的几种方式后,只能感叹自己对程序的理解太浅,还需深入的学习程序、内存布局、汇编等基础知识。
继续海边捡贝壳~~~

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Linux segmentation fault core dumped是一个常见的错误提示,它通常意味着程序在运行时出现了严重的问题,导致操作系统无法继续执行进程并产生了core dump文件。core dump文件是系统在发生异常时自动生成的文件,它包含了发生异常时的内存状态,可以帮助开发者进行问题排查和调试Segmentation fault通常是由于程序访问了不属于它的内存区域所导致的。这可能是由于程序中的指针错误、数组越界访问、非法内存访问等原因引起的。当程序发生segmentation fault时,操作系统会将进程的状态保存到一个core dump文件中,以便后续进行调试和分析。 要查看core dump文件,可以使用以下命令: ```shell $ gdb <program_name> <core_dump_file> ``` 其中,`<program_name>`是发生segmentation fault的程序名称,`<core_dump_file>`是生成的core dump文件的路径。使用gdb工具可以打开core dump文件并进行调试,以找出导致segmentation fault的具体原因。 要解决segmentation fault问题,可以采取以下步骤: 1. 检查程序中的指针和内存访问是否正确,避免越界访问和非法内存访问。 2. 检查程序是否使用了动态分配的内存,并确保在使用完毕后释放了所有分配的内存。 3. 调试程序,使用gdb工具打开core dump文件并逐步执行程序,查看在发生segmentation fault时的内存状态,找出问题所在。 4. 如果问题仍然无法解决,可以尝试使用其他工具或方法进行调试和分析,例如使用valgrind等内存检测工具。 总之,Linux segmentation fault core dumped是一个常见的错误提示,它通常是由于程序访问了不属于它的内存区域所导致的。通过查看core dump文件并进行调试和分析,可以找出导致segmentation fault的具体原因并加以解决。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值