CoreDump使用与实现原理

一、背景

系统发生native crash时,针对内存异常访问、内存踩踏等疑难问题,由于tombstone信息量不足无法精确定位分析这类问题。

二、coredump介绍

2.1 什么是coredump

当用户程序运行过程中发生异常, 程序异常退出时, 由Linux内核把程序当前的内存状态信息(运行时的内存,寄存器状态,堆栈指针,各种函数调用堆栈信息等)存储在一个core文件中, 这个过程称作coredump.

2.2 coredump作用

coredump主要应用于解决NE问题(native exception)。用户进程发生native crash时,tombstone会抓取一些简单的backtrace信息,但是对于定位一些内存访问异常、内存被踩的疑难问题来说,tombstone信息量不充足导致无法精确定位分析问题,这个时候就需要使用到coredump分析这类问题。

2.3 什么情况下触发coredump

从进程发生异常类型维度来看,当native进程发生内存越界访问、堆栈溢出、非法指针等操作时,会触发coredump

从进程接收的信号类型来看,当native进程接收SIGQUIT、SIGABRT、SIGSEGV、SIGTRAP等信号时,会触发coredump

三、如何使用coredump

在Android平台默认关闭coredump,需要手动打开。

1.打开coredump开关

1) 检查系统 coredump 是否开启
    ulimit -c  // 返回 0,则未启用
2) 打开coredump   
   ulimit -c 1024 // 设置成 1024 byte    
   或者
   ulimit -c unlimited  // 设置成无限大

2.设置coredump生成文件的路径

// 如果不设置文件路径,core文件生成的位置默认是可执行文件所在的位置 
echo "/data/corefile/core-%e-%p-%t" > /proc/sys/kernel/core_pattern

3.当检测到进程异常退出时,会在指定的路径下生成core文件(格式为elf),可以结合gdb工具调试分析

1)将可执行文件和core文件放在一个目录下

2)执行gdb {binary_name} {core_file_name}命令,解析core文件,定位分析问题

详见第五章Demo案例。

四、coredump实现原理

4.1 基本原理

用户程序发生某些错误或异常时,在Linux内核会捕获到异常,并给用户进程发送signal异常信号,进程在返回用户空间之前处理信号,调用Linux内核coredump,生成elf格式的core文件,保存到指定的路径。

4.2 核心代码段

调用 do_coredump 函数来生成 core文件。如下:

void do_coredump(const kernel_siginfo_t *siginfo)
{
        ......
        
        binfmt = mm->binfmt;
        if (!binfmt || !binfmt->core_dump)
                goto fail;
        if (!__get_dumpable(cprm.mm_flags))
                goto fail;
        ......

        // 1.生成core文件名称
        ispipe = format_corename(&cn, &cprm, &argv, &argc);

        ......
              
        // 2.创建core文件
        cprm.file = file_open_root(&root, cn.corename, open_flags, 0600);
        ......
        
        // 3.将进程的内存信息写入core文件
        core_dumped = binfmt->core_dump(&cprm);
       ......
}

elf_core_dump 函数负责将进程的内存状态信息写入elf格式的core文件,以便后续的gdb调试和分析。如下:

// kernel_platform/msm-kernel/fs/binfmt_elf.c

static int elf_core_dump(struct coredump_params *cprm)
{
        ......

        /*
         * Collect all the non-memory information about the process for the
         * notes.  This also sets up the file header.
         */
         // 1.函数填充 ELF 头部和 notes 信息
        if (!fill_note_info(&elf, e_phnum, &info, cprm))
                goto end_coredump;

        has_dumped = 1;
        // 2.计算 ELF 头部、程序头部和 notes 节的大小,并分配相应的内存
        offset += sizeof(elf);                                /* Elf header */
        offset += segs * sizeof(struct elf_phdr);        /* Program headers */

       ......

        /* Write program headers for segments dump */
        for (i = 0; i < cprm->vma_count; i++) {
                struct core_vma_metadata *meta = cprm->vma_meta + i;
                struct elf_phdr phdr;

                phdr.p_type = PT_LOAD;
                phdr.p_offset = offset;
                phdr.p_vaddr = meta->start;
                phdr.p_paddr = 0;
                phdr.p_filesz = meta->dump_size;
                phdr.p_memsz = meta->end - meta->start;
                offset += phdr.p_filesz;
                phdr.p_flags = 0;
                if (meta->flags & VM_READ)
                        phdr.p_flags |= PF_R;
                if (meta->flags & VM_WRITE)
                        phdr.p_flags |= PF_W;
                if (meta->flags & VM_EXEC)
                        phdr.p_flags |= PF_X;
                phdr.p_align = ELF_EXEC_PAGESIZE;

                if (!dump_emit(cprm, &phdr, sizeof(phdr)))
                        goto end_coredump;
        }
        // 3.写入 ELF 头部和程序头部
        if (!elf_core_write_extra_phdrs(cprm, offset))
                goto end_coredump;

         /* write out the notes section */
         // 4.写入 notes信息
        if (!write_note_info(&info, cprm))
                goto end_coredump;

        /* For cell spufs */
        // 5.写入数据段
        if (elf_coredump_extra_notes_write(cprm))
                goto end_coredump;

        /* Align to page */
        dump_skip_to(cprm, dataoff);

        for (i = 0; i < cprm->vma_count; i++) {
                struct core_vma_metadata *meta = cprm->vma_meta + i;

                if (!dump_user_range(cprm, meta->start, meta->dump_size))
                        goto end_coredump;
        }
        // 6.写入扩展编号
        if (!elf_core_write_extra_data(cprm))
                goto end_coredump;

        if (e_phnum == PN_XNUM) {
                if (!dump_emit(cprm, shdr4extnum, sizeof(*shdr4extnum)))
                        goto end_coredump;
        }

end_coredump:
        free_note_info(&info);
        kfree(shdr4extnum);
        kfree(phdr4note);
        return has_dumped;
}

4.3 代码时序

异常捕获、信号处理&生成core文件的功能逻辑的代码时序,如下:

4.4 core文件格式及内容

coredump抓取的core文件为elf格式,可以使用gdb调试,定位分析问题。

core文件内容,如下:

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              CORE (Core file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         138
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0
  
  Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000001e70 0x0000000000000000 0x0000000000000000
                 0x00000000000018a8 0x0000000000000000         0x0
  LOAD           0x0000000000004000 0x000000560ca89000 0x0000000000000000
                 0x0000000000000000 0x0000000000002000  R      0x1000
  LOAD           0x0000000000004000 0x000000560ca8b000 0x0000000000000000
                 0x0000000000000000 0x0000000000003000  R E    0x1000
  LOAD           0x0000000000004000 0x000000560ca8e000 0x0000000000000000
                 0x0000000000001000 0x0000000000001000  R      0x1000
...


Displaying notes found at file offset 0x00001e70 with length 0x000018a8:
  Owner                Data size         Description
  CORE                 0x00000188        NT_PRSTATUS (prstatus structure)
  CORE                 0x00000088        NT_PRPSINFO (prpsinfo structure)
  CORE                 0x00000080        NT_SIGINFO (siginfo_t data)
  CORE                 0x00000150        NT_AUXV (auxiliary vector)
  CORE                 0x00000f6e        NT_FILE (mapped files)
    Page size: 4096
                 Start                 End         Page Offset
    0x000000560ca89000  0x000000560ca8b000  0x0000000000000000
        /system/bin/coredump-test-bin
    0x000000560ca8b000  0x000000560ca8e000  0x0000000000000002
        /system/bin/coredump-test-bin
...
CORE                 0x00000210        NT_FPREGSET (floating point registers)
  LINUX                0x00000010        NT_ARM_TLS (AArch TLS registers)
   description data: 00 10 e4 45 7e 00 00 00 00 00 00 00 00 00 00 00 
  LINUX                0x00000108        NT_ARM_HW_BREAK (AArch hardware breakpoint registers)
   description data: 06 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
  LINUX                0x00000108        NT_ARM_HW_WATCH (AArch hardware watchpoint registers)
   description data: 04 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
  LINUX                0x00000004        Unknown note type: (0x00000404)
   description data: ff ff ff ff 
  LINUX                0x00000010        Unknown note type: (0x00000406)
   description data: 00 00 00 00 80 ff 7f 00 00 00 00 00 80 ff 7f 00 
  LINUX                0x00000008        Unknown note type: (0x0000040a)
   description data: 0f 00 00 00 00 00 00 00 
  LINUX                0x00000008        Unknown note type: (0x00000409)
   description data: 01 00 00 00 00 00 00 00

core文件内容主要包括ELF Header、Program Headers、NOTE segment.

ELF Header:用于记录core文件的基本信息和结构。

Program Headers: 记录内存中映射文件的信息,以及segment的权限和属性。

NOTE segment:记录进程崩溃时刻的进程状态、寄存器、信号信息、辅助向量和映射文件的详细信息。通过这些信息,gdb调试工具可以重建崩溃时的内存布局,分析崩溃原因,并帮助开发者精确定位分析问题。

五、Demo案例

1)Demo程序

进程发生异常crash后,抓取tombstone和core文件。

2)生成的tombstone文件

从抓取的tombstone文件分析,只能看出大致的原因,无法精确定位到根本原因或哪句代码出错导致进程crash.因此,需要借助coredump,抓取core文件来精确定位分析这类问题。

Cmdline: ../../system/bin/coredump-test-bin use-after-free
pid: 11966, tid: 11966, name: coredump-test-b  >>> ../../system/bin/coredump-test-bin <<<
uid: 0
...
backtrace:
      #01 pc 0000000000090088  /system/lib64/libc.so (__vfprintf+10416) (BuildId: 567e41669f1cb528e72fe319cd09033b)
      #02 pc 00000000000ac06c  /system/lib64/libc.so (vsnprintf+192) (BuildId: 567e41669f1cb528e72fe319cd09033b)
      #03 pc 0000000000006afc  /system/lib64/liblog.so (__android_log_print+184) (BuildId: 87ba6a9314f00fab650fb8fad7913d58)
      #04 pc 00000000000010a4  /system/bin/coredump-test-bin (main+80) (BuildId: c97bade065c198c12dcca74f107c513c)
      #05 pc 0000000000048768  /system/lib64/libc.so (__libc_init+96) (BuildId: 567e41669f1cb
...

3)生成的core文件

打开coredump功能,抓取core文件。core文件为elf格式,可以用gdb调试。

用gdb调试Demo程序和生成的core文件,执行gdb ./coredump-test-bin ./core-coredump-test-bin-11966-1720526041命令,可以精确定位到是源文件哪一行代码出错,如下:

--->             
  ...            
    Program terminated with signal SIGSEGV,       Segmentation fault.            
    #0  0x000000000040053c in square (a=1, b=2) at test.c:7            
    7               *p = 666;  # 可见在test.c中的第7行,出现了问题。
# (gdb) backtrace // 输入backtrace    
   --->             
    #0  0x000000000040053c in square (a=1, b=2) at test.c:7   // 可见在test.c中的第7行,出现了问题。            
    #1  0x0000000000400564 in doCalc (num1=1, num2=2) at test.c:14             
    #2  0x0000000000400591 in main () at test.c:22

六、风险及解决方案

打开coredump功能,存在以下风险:

1)若系统中存在native进程反复crash自启,尤其在研发阶段这种现象很普遍,会导致持续不断产生core文件,磁盘空间很快被占满。

解决方案:结合quota机制,core文件路径存储空间分配project_id,设置quota阈值(存储空间上限),超过阈值就自动覆盖老的文件

  • 20
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值