一、背景
系统发生native crash时,针对内存异常访问、内存踩踏等疑难问题,由于tombstone信息量不足无法精确定位分析这类问题。
二、coredump介绍
2.1 什么是coredump
当用户程序运行过程中发生异常, 程序异常退出时, 由Linux内核把程序当前的内存状态信息(运行时的内存,寄存器状态,堆栈指针,各种函数调用堆栈信息等)存储在一个core文件中, 这个过程称作coredump.
2.2 coredump作用
coredump主要应用于解决NE问题(native exception)。用户进程发生native crash时,tombstone会抓取一些简单的backtrace信息,但是对于定位一些内存访问异常、内存被踩的疑难问题来说,tombstone信息量不充足导致无法精确定位分析问题,这个时候就需要使用到coredump分析这类问题。
2.3 什么情况下触发coredump
从进程发生异常类型维度来看,当native进程发生内存越界访问、堆栈溢出、非法指针等操作时,会触发coredump
从进程接收的信号类型来看,当native进程接收SIGQUIT、SIGABRT、SIGSEGV、SIGTRAP等信号时,会触发coredump
三、如何使用coredump
在Android平台默认关闭coredump,需要手动打开。
1.打开coredump开关
1) 检查系统 coredump 是否开启
ulimit -c // 返回 0,则未启用
2) 打开coredump
ulimit -c 1024 // 设置成 1024 byte
或者
ulimit -c unlimited // 设置成无限大
2.设置coredump生成文件的路径
// 如果不设置文件路径,core文件生成的位置默认是可执行文件所在的位置
echo "/data/corefile/core-%e-%p-%t" > /proc/sys/kernel/core_pattern
3.当检测到进程异常退出时,会在指定的路径下生成core文件(格式为elf),可以结合gdb工具调试分析
1)将可执行文件和core文件放在一个目录下
2)执行gdb {binary_name} {core_file_name}命令,解析core文件,定位分析问题
详见第五章Demo案例。
四、coredump实现原理
4.1 基本原理
用户程序发生某些错误或异常时,在Linux内核会捕获到异常,并给用户进程发送signal异常信号,进程在返回用户空间之前处理信号,调用Linux内核coredump,生成elf格式的core文件,保存到指定的路径。
4.2 核心代码段
调用 do_coredump 函数来生成 core文件。如下:
void do_coredump(const kernel_siginfo_t *siginfo)
{
......
binfmt = mm->binfmt;
if (!binfmt || !binfmt->core_dump)
goto fail;
if (!__get_dumpable(cprm.mm_flags))
goto fail;
......
// 1.生成core文件名称
ispipe = format_corename(&cn, &cprm, &argv, &argc);
......
// 2.创建core文件
cprm.file = file_open_root(&root, cn.corename, open_flags, 0600);
......
// 3.将进程的内存信息写入core文件
core_dumped = binfmt->core_dump(&cprm);
......
}
elf_core_dump 函数负责将进程的内存状态信息写入elf格式的core文件,以便后续的gdb调试和分析。如下:
// kernel_platform/msm-kernel/fs/binfmt_elf.c
static int elf_core_dump(struct coredump_params *cprm)
{
......
/*
* Collect all the non-memory information about the process for the
* notes. This also sets up the file header.
*/
// 1.函数填充 ELF 头部和 notes 信息
if (!fill_note_info(&elf, e_phnum, &info, cprm))
goto end_coredump;
has_dumped = 1;
// 2.计算 ELF 头部、程序头部和 notes 节的大小,并分配相应的内存
offset += sizeof(elf); /* Elf header */
offset += segs * sizeof(struct elf_phdr); /* Program headers */
......
/* Write program headers for segments dump */
for (i = 0; i < cprm->vma_count; i++) {
struct core_vma_metadata *meta = cprm->vma_meta + i;
struct elf_phdr phdr;
phdr.p_type = PT_LOAD;
phdr.p_offset = offset;
phdr.p_vaddr = meta->start;
phdr.p_paddr = 0;
phdr.p_filesz = meta->dump_size;
phdr.p_memsz = meta->end - meta->start;
offset += phdr.p_filesz;
phdr.p_flags = 0;
if (meta->flags & VM_READ)
phdr.p_flags |= PF_R;
if (meta->flags & VM_WRITE)
phdr.p_flags |= PF_W;
if (meta->flags & VM_EXEC)
phdr.p_flags |= PF_X;
phdr.p_align = ELF_EXEC_PAGESIZE;
if (!dump_emit(cprm, &phdr, sizeof(phdr)))
goto end_coredump;
}
// 3.写入 ELF 头部和程序头部
if (!elf_core_write_extra_phdrs(cprm, offset))
goto end_coredump;
/* write out the notes section */
// 4.写入 notes信息
if (!write_note_info(&info, cprm))
goto end_coredump;
/* For cell spufs */
// 5.写入数据段
if (elf_coredump_extra_notes_write(cprm))
goto end_coredump;
/* Align to page */
dump_skip_to(cprm, dataoff);
for (i = 0; i < cprm->vma_count; i++) {
struct core_vma_metadata *meta = cprm->vma_meta + i;
if (!dump_user_range(cprm, meta->start, meta->dump_size))
goto end_coredump;
}
// 6.写入扩展编号
if (!elf_core_write_extra_data(cprm))
goto end_coredump;
if (e_phnum == PN_XNUM) {
if (!dump_emit(cprm, shdr4extnum, sizeof(*shdr4extnum)))
goto end_coredump;
}
end_coredump:
free_note_info(&info);
kfree(shdr4extnum);
kfree(phdr4note);
return has_dumped;
}
4.3 代码时序
异常捕获、信号处理&生成core文件的功能逻辑的代码时序,如下:
4.4 core文件格式及内容
coredump抓取的core文件为elf格式,可以使用gdb调试,定位分析问题。
core文件内容,如下:
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: CORE (Core file)
Machine: AArch64
Version: 0x1
Entry point address: 0x0
Start of program headers: 64 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 138
Size of section headers: 0 (bytes)
Number of section headers: 0
Section header string table index: 0
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
NOTE 0x0000000000001e70 0x0000000000000000 0x0000000000000000
0x00000000000018a8 0x0000000000000000 0x0
LOAD 0x0000000000004000 0x000000560ca89000 0x0000000000000000
0x0000000000000000 0x0000000000002000 R 0x1000
LOAD 0x0000000000004000 0x000000560ca8b000 0x0000000000000000
0x0000000000000000 0x0000000000003000 R E 0x1000
LOAD 0x0000000000004000 0x000000560ca8e000 0x0000000000000000
0x0000000000001000 0x0000000000001000 R 0x1000
...
Displaying notes found at file offset 0x00001e70 with length 0x000018a8:
Owner Data size Description
CORE 0x00000188 NT_PRSTATUS (prstatus structure)
CORE 0x00000088 NT_PRPSINFO (prpsinfo structure)
CORE 0x00000080 NT_SIGINFO (siginfo_t data)
CORE 0x00000150 NT_AUXV (auxiliary vector)
CORE 0x00000f6e NT_FILE (mapped files)
Page size: 4096
Start End Page Offset
0x000000560ca89000 0x000000560ca8b000 0x0000000000000000
/system/bin/coredump-test-bin
0x000000560ca8b000 0x000000560ca8e000 0x0000000000000002
/system/bin/coredump-test-bin
...
CORE 0x00000210 NT_FPREGSET (floating point registers)
LINUX 0x00000010 NT_ARM_TLS (AArch TLS registers)
description data: 00 10 e4 45 7e 00 00 00 00 00 00 00 00 00 00 00
LINUX 0x00000108 NT_ARM_HW_BREAK (AArch hardware breakpoint registers)
description data: 06 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
LINUX 0x00000108 NT_ARM_HW_WATCH (AArch hardware watchpoint registers)
description data: 04 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
LINUX 0x00000004 Unknown note type: (0x00000404)
description data: ff ff ff ff
LINUX 0x00000010 Unknown note type: (0x00000406)
description data: 00 00 00 00 80 ff 7f 00 00 00 00 00 80 ff 7f 00
LINUX 0x00000008 Unknown note type: (0x0000040a)
description data: 0f 00 00 00 00 00 00 00
LINUX 0x00000008 Unknown note type: (0x00000409)
description data: 01 00 00 00 00 00 00 00
core文件内容主要包括ELF Header、Program Headers、NOTE segment.
ELF Header:用于记录core文件的基本信息和结构。
Program Headers: 记录内存中映射文件的信息,以及segment的权限和属性。
NOTE segment:记录进程崩溃时刻的进程状态、寄存器、信号信息、辅助向量和映射文件的详细信息。通过这些信息,gdb调试工具可以重建崩溃时的内存布局,分析崩溃原因,并帮助开发者精确定位分析问题。
五、Demo案例
1)Demo程序
进程发生异常crash后,抓取tombstone和core文件。
2)生成的tombstone文件
从抓取的tombstone文件分析,只能看出大致的原因,无法精确定位到根本原因或哪句代码出错导致进程crash.因此,需要借助coredump,抓取core文件来精确定位分析这类问题。
Cmdline: ../../system/bin/coredump-test-bin use-after-free
pid: 11966, tid: 11966, name: coredump-test-b >>> ../../system/bin/coredump-test-bin <<<
uid: 0
...
backtrace:
#01 pc 0000000000090088 /system/lib64/libc.so (__vfprintf+10416) (BuildId: 567e41669f1cb528e72fe319cd09033b)
#02 pc 00000000000ac06c /system/lib64/libc.so (vsnprintf+192) (BuildId: 567e41669f1cb528e72fe319cd09033b)
#03 pc 0000000000006afc /system/lib64/liblog.so (__android_log_print+184) (BuildId: 87ba6a9314f00fab650fb8fad7913d58)
#04 pc 00000000000010a4 /system/bin/coredump-test-bin (main+80) (BuildId: c97bade065c198c12dcca74f107c513c)
#05 pc 0000000000048768 /system/lib64/libc.so (__libc_init+96) (BuildId: 567e41669f1cb
...
3)生成的core文件
打开coredump功能,抓取core文件。core文件为elf格式,可以用gdb调试。
用gdb调试Demo程序和生成的core文件,执行gdb ./coredump-test-bin ./core-coredump-test-bin-11966-1720526041命令,可以精确定位到是源文件哪一行代码出错,如下:
--->
...
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000040053c in square (a=1, b=2) at test.c:7
7 *p = 666; # 可见在test.c中的第7行,出现了问题。
# (gdb) backtrace // 输入backtrace
--->
#0 0x000000000040053c in square (a=1, b=2) at test.c:7 // 可见在test.c中的第7行,出现了问题。
#1 0x0000000000400564 in doCalc (num1=1, num2=2) at test.c:14
#2 0x0000000000400591 in main () at test.c:22
六、风险及解决方案
打开coredump功能,存在以下风险:
1)若系统中存在native进程反复crash自启,尤其在研发阶段这种现象很普遍,会导致持续不断产生core文件,磁盘空间很快被占满。
解决方案:结合quota机制,core文件路径存储空间分配project_id,设置quota阈值(存储空间上限),超过阈值就自动覆盖老的文件