服务器内核崩溃 Kump 分析

服务器内核崩溃 Kump 分析

1、登陆异常服务器收集 vmcore 文件

  • 异常服务器需安装有kump服务,且出现异常前kump服务运行正常;
  • 每一次内核崩溃都会在 /var/crash 目录下创建一个 127.0.0.1-$date +‘’%Y-%m-%d-%H:%M:%S"(后缀是产生该目录时的具体时间)对应的目录,里面有 vmcore 和 vmcore-dmesg.txt 文件。
# vmcore-dmesg.txt 	也就是平时我们在用的dmesg信息;
# vmcore 						文件为kdump收集的操作系统core dump信息,其相当于整个物理内存的镜像,所以其中包括了最全面、最完整的信息,对于分析服务器重启有极大的帮助。

2、准备 Crash 和 vmlinux 工具

  • crash 是内核奔溃转储文件分析工具,为避免为生产服务器的影响,我每次分析都将收集vmcore文件导入测试环境,在测试环境进行分析。
1、安装crash工具
yum install crash
2、安装vmlinux工具(注意有较为严格的内核要求,内核版本需和故障服务器一致)
# wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-`uname -r`.rpm
# wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-`uname -r`.rpm

3、分析 vmcore 文件

  • 以 2022年06月09日 IP地址某服务器故障为例,
[root@nginx ~]# cd /var/crash/127.0.0.1-2022-06-09-19\:15\:16/
[root@nginx ~]# crash /lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux vmcore
crash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 192crash: seek error: kernel virtual address: ffffffffffffffff  type: "cpu_online_map"

        DATE: Thu Jun  9 19:15:10 2022
      UPTIME: 23 days, 09:19:34
LOAD AVERAGE: 3.94, 3.96, 4.00
       TASKS: 2332
    NODENAME: XXXdamaXXX
     RELEASE: 3.10.0-514.el7.x86_64
     VERSION: #1 SMP Wed Oct 19 11:24:13 EDT 2016
     MACHINE: x86_64  (2399 Mhz)
      MEMORY: 63.9 GB
       PANIC: "kernel BUG at fs/xfs/xfs_aops.c:1062!"
         PID: 505
     COMMAND: "kworker/u385:9"
        TASK: ffff88085a04af10  [THREAD_INFO: ffff88085a068000]
         CPU: 14
       STATE: TASK_RUNNING (PANIC)

crash> 
  • 通过上述的“PANIC”字段信息,结合Google大法已经能大致分析出服务器异常重启的原因;

  • 仍查不出原因的话,可以结合以下几个常用命令继续分析(以下操作需要一定的汇编基础,反正作者我是技能不足。。)

    1. log 命令

      log命令可查看内核dmesg日志
      crash> log
      [    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      [    0.000000] Initializing cgroup subsys cpuacct
      [    0.000000] Linux version 3.10.0-514.el7.x86_64 (mockbuild@x86-039.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Oct 19 11:24:13 EDT 2016
      [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-514.el7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet LANG=en_US.UTF-8
      ............
      
    2. bt 命令

      执行bt命令+上述的“PID”字段,可以打印问题进程的栈信息:
      crash> bt 505
      PID: 505    TASK: ffff88085a04af10  CPU: 14  COMMAND: "kworker/u385:9"
      bt: seek error: kernel virtual address: ffffffffffffffff  type: "cpu_online_map"
       #0 [ffff88085a06b5f0] machine_kexec at ffffffff81059cdb
       #1 [ffff88085a06b650] __crash_kexec at ffffffff81105182
       #2 [ffff88085a06b720] crash_kexec at ffffffff81105270
       #3 [ffff88085a06b738] oops_end at ffffffff8168ed88
       #4 [ffff88085a06b760] die at ffffffff8102e93b
       #5 [ffff88085a06b790] end_repeat_nmi at ffffffff8168e440
       #6 [ffff88085a06b7e0] do_invalid_op at ffffffff8102b144
       #7 [ffff88085a06b890] trace_irq_work_interrupt at ffffffff81697d5e
       #8 [ffff88085a06b9f8] __writepage at ffffffff8118b3b3
       #9 [ffff88085a06ba10] write_cache_pages at ffffffff8118bed1
      #10 [ffff88085a06bb28] generic_writepages at ffffffff8118c19d
      #11 [ffff88085a06bb88] xfs_vm_writepages at ffffffffa032a063 [xfs]
      #12 [ffff88085a06bbb8] do_writepages at ffffffff8118d24e
      #13 [ffff88085a06bbc8] __writeback_single_inode at ffffffff81228730
      #14 [ffff88085a06bc08] writeback_sb_inodes at ffffffff8122941e
      #15 [ffff88085a06bcb0] __writeback_inodes_wb at ffffffff8122967f
      #16 [ffff88085a06bcf8] wb_writeback at ffffffff81229ec3
      #17 [ffff88085a06bd70] bdi_writeback_workfn at ffffffff8122bebb
      #18 [ffff88085a06be20] process_one_work at ffffffff810a7f3b
      #19 [ffff88085a06be68] worker_thread at ffffffff810a8d76
      #20 [ffff88085a06bec8] kthread at ffffffff810b052f
      #21 [ffff88085a06bf50] update_deref_fetch_param at ffffffff81696418
      crash> 
      
    3. dis、sym 命令一般结合bt命令使用,详情请见官方文档

    4. ps 命令

      ps命令为查看线程状态,可查看重启前的线程资源使用情况,作为参考使用
      crash> ps
         PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      >     0      0   0  ffffffff819c1460  RU   0.0       0      0  [swapper/0]
      >     0      0   1  ffff8808fcd88000  RU   0.0       0      0  [swapper/1]
      >     0      0   2  ffff88017caeedd0  RU   0.0       0      0  [swapper/2]
      >     0      0   3  ffff8808fcd8edd0  RU   0.0       0      0  [swapper/3]
      >     0      0   4  ffff88017caeaf10  RU   0.0       0      0  [swapper/4]
      >     0      0   5  ffff8808fcd88fb0  RU   0.0       0      0  [swapper/5]
            0      0   6  ffff88017caede20  RU   0.0       0      0  [swapper/6]
      >     0      0   7  ffff8808fcd8de20  RU   0.0       0      0  [swapper/7]
      >     0      0   8  ffff88017caebec0  RU   0.0       0      0  [swapper/8]
      >     0      0   9  ffff8808fcd89f60  RU   0.0       0      0  [swapper/9]
      
    5. 其他crash命令可通过help查看

参考

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_monitoring_and_updating_the_kernel/analyzing-a-core-dump_managing-monitoring-and-updating-the-kernel
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值