Kexec and Kdump

Section #3. Kexec and Kdump

Now that you have learned how to use Kprobes, let’s continue and look at more facets of Linux RAS. Kexec and kdump are serviceability features introduced in the 2.6 kernel.

既然您已经学会了如何使用Kprobes,那么让我们继续看看Linux RAS的其它方面。 Kexeckdump2.6内核中引入的可维护性功能。

Kexec uses the image overlay philosophy of the UNIX exec() system call to spawn a new kernel over a running kernel without the overhead of boot firmware. This can save several seconds of reboot time because boot firmware spends cycles walking buses and recognizing devices. The less the reboot latency, the less the system downtime; so, this was one of the main motivations for developing kexec. However, kexec’s most popular user is kdump. Capturing a dump after a kernel crash is inherently unreliable because kernel code that accesses the dump device might be in an unstable state. Kdump circumvents this problem by collecting the dump after booting into a healthy kernel via kexec.

Kexec使用UNIX exec()系统调用的映像覆盖原理在运行的内核上生成新内核,而不需要启动固件的开销。这可以节省几秒的重启时间,因为启动的固件会花费时间遍历总线和识别设备。重启延迟越少,系统停机时间越短; 所以,这是开发kexec的主要目的之一。 但是,kexec最受欢迎的用户是kdump。在内核崩溃后捕获转储本质上是不可靠的,因为访问转储设备的内核代码可能处于不稳定状态。Kdump通过kexec引导到健康的内核后收集转储来解决这个问题。

Kexec

Before you can kexec a kernel, you need to do some preparations: 在你可以使用内核之前,你需要做一些准备工作:

  1. Compile and boot into a kernel that has kexec support. For this, turn on CONFIG_KEXEC (Processor Type and Features Kexec System Call) in the kernel configuration menu. This kernel is called the first kernel or the running kernel.
  2. 编译并启动到具有kexec支持的内核。为此,在内核配置菜单中打开CONFIG_KEXEC(Processor Type and Features→Kexec System Call)。该内核称为第一个内核或正在运行的内核。
  3. Prepare the kernel that is to be kexec-ed. This second kernel can be the same as the first kernel.
  4. 准备要被kexec-ed的内核。第二个内核可以与第一个内核相同。
  5. Download the kexec-tools package source tar ball from www.kernel.org/pub/linux/kernel/people/horms/kexec-tools/kexec-tools-testing.tar.gz. Build and produce the user-space tool called kexec.
  6. 从www.kernel.org/pub/linux/kernel/people/horms/kexec-tools/kexec-tools-testing.tar.gz下载kexec-tools代码压缩包。构建并生成名为kexec的用户空间工具。

The kexec tool built in Step 3 is invoked in two stages. The first stage loads the second kernel image into the buffers of the running kernel, whereas the second stage actually overlays the running kernel:

步骤3中构建的kexec工具分两个阶段调用。第一阶段将第二个内核映像加载到正在运行的内核的缓冲区中,而第二个阶段实际上覆盖正在运行的内核:

  1. Load the second (overlay) kernel using the kexec command:
  2. 使用kexec命令加载第二个(覆盖)内核:

bash> kexec -l /path/to/kernelsources/arch/x86/boot/bzImage --
append="root=/dev/hdaX" --initrd=/boot/myinitrd.img

bzImage is the second kernel, hdaX is the root device, and myinitrd.img is the initial root filesystem. The kernel implementation of this stage is mostly architecture-independent. At the heart of this stage is the sys_kexec() system call. The kexec command loads the new kernel image into the running kernel’s buffers using the services of this system call.

bzImage是第二个内核,hdaXroot设备,myinitrd.img是初始root文件系统。此阶段的内核实现主要是与体系结构无关的。这个阶段的核心是sys_kexec()系统调用。 kexec命令调用系统调用sys_kexec将新内核映像加载到正在运行的内核的缓冲区中。

  1. Boot into the second kernel: 引导进入第二个内核

Kexec abruptly starts the new kernel without gracefully halting the operating system. To shut down prior to reboot, invoke kexec from the bottom of the halt script (usually /etc/rc.d/rc0.d/S01halt) and invoke halt instead.

Kexec突然启动新内核而没有优雅地停止操作系统。要在重新启动之前关闭,请从halt脚本的底部调用kexec(通常是/etc/rc.d/rc0.d/S01halt)并调用halt

The implementation of the second stage is architecture-dependent. The crux of this stage is a reboot_code_buffer that contains assembly code to put the new kernel in place to boot.

第二阶段的实施取决于架构。这个阶段的关键是一个reboot_code_buffer,它包含汇编代码,用于将新内核置于引导位置。

Kexec bypasses the initial kernel code that invokes the services of boot firmware and directly jumps to the protected mode entry point (for x86 processors). An important challenge to implement kexec is the interaction that happens between the kernel and the boot firmware (BIOS on x86-based systems, Openfirmware on POWER-based machines, and so on). On x86 systems, information such as the e820 memory map passed to the kernel by the BIOS needs to be supplied to the kexec-ed kernel, too.

Kexec绕过调用引导固件服务的初始化内核代码,直接跳转到受保护模式入口点(对于x86处理器)。实现kexec的一个重要挑战是内核和引导固件之间的交互(基于x86的系统上的BIOS,基于POWER的机器上的Openfirmware等)。 x86系统上,BIOS等传递给内核的e820内存映射等信息也需要提供给kexec-ed内核。

Kexec with Kdump

The kexec invocation semantics are somewhat special when it’s used in tandem with kdump. In this case, kexec is required to automatically boot a new kernel when it encounters a kernel panic. If the running kernel crashes, the new kernel (called the capture kernel) is booted to reliably collect the dump. A typical call syntax is this:

kexec调用语义与kdump一起使用时,它有些特殊。在这种情况下,kexec需要在遇到内核崩溃时自动引导新内核。如果正在运行的内核崩溃,则引导新内核(称为捕获内核)以可靠地收集转储。典型的调用语法是这样的:

The -p option asks kexec to trigger a reboot when a kernel panic occurs. A vmlinux ELF kernel image is used as the capture kernel. Because vmlinux is a general ELF boot image and because kexec is theoretically OS-agnostic, you need to specify via the --args-linux option that the following arguments have to be interpreted in a Linux-specific manner. The capture kernel boots asynchronously during a kernel crash, so device drivers using shared interrupts may fatally express their unhappiness during boot. To be nice to such drivers, specify irqpoll in the command line passed to the capture kernel using --append.

-p选项要求kexec在发生内核严重错误时触发重新启动。vmlinux ELF内核映像用作捕获内核。因为vmlinux是一般的ELF启动映像,并且因为kexec在理论上与操作系统无关,所以您需要通过--args-linux选项指定必须以特定于Linux的方式解释以下参数。捕获内核在内核崩溃期间异步引导,因此使用共享中断的设备驱动程序可能会在引导期间致命地表达其不满。要让这些驱动程序也好过一点,请使用--append在传递给捕获内核的命令行中指定irqpoll

To use kexec with kdump, you need some additional kernel configuration settings. The capture kernel requires access to kernel memory of the crashed kernel to generate a full dump, so the latter cannot just overwrite the former as was done by kexec in the non-kdump case. The running kernel needs to reserve a memory region to run the capture kernel. To mark this region:

要将kexeckdump一起使用,您需要一些额外的内核配置设置。捕获内核需要访问崩溃内核的内核空间内存以生成完全转储,因此后者不能像非kdump情况下的kexec那样覆盖前者。正在运行的内核需要保留一个内存区域来运行捕获内核。要标记此区域:

• Boot the first kernel with the command-line argument crashkernel=64M@16M (or other suitable size@start values). Also include debug symbols in the kernel image by enabling CONFIG_DEBUG_INFO (Kernel Hacking Compile the Kernel with Debug Info) in the configuration menu.

使用命令行参数crashkernel = 64M @ 16M(或其他合适size@start值)引导第一个内核。还可以通过在配置菜单中启用CONFIG_DEBUG_INFOKernel Hacking Compile the Kernel with Debug Info)在内核映像中包含调试符号。

• While configuring the capture kernel, set CONFIG_PHYSICAL_START to the same start value assigned above (16M in this case). If you kexec into the capture kernel and peek inside /proc/meminfo, you will find that size (64M in this case) is the total amount of physical memory that this kernel can see.

配置捕获内核时,将CONFIG_PHYSICAL_START设置为上面指定的相同起始值(在本例中为16M)。如果您进入捕获内核并查看/ proc / meminfo,您会发现该大小(在这种情况下为64M)是该内核可以看到的物理内存总量。

Now that you’re comfortable with kexec and have mastered it from the perspective of a kdump user, let’s delve into kdump and use it to analyze some real-world kernel crashes.

现在你熟悉了kexec并且从kdump用户的角度掌握了它,那么让我们深入研究kdump并用它来分析一些真实的内核崩溃。

Kdump

An image of system memory captured after a kernel crash or hang is called a crash dump. Analyzing a crash dump can give valuable clues for postmortem analyses of kernel problems. However, obtaining a dump after a kernel crash is inherently unreliable because the storage driver responsible for logging data onto the dump device might be in an undefined state.

内核崩溃或挂起后捕获的系统内存映像称为故障转储。分析崩溃转储可以为内核问题的事后分析提供有价值的线索。但是,在内核崩溃后获取转储本质上是不可靠的,因为负责将数据记录到转储设备上的存储驱动程序可能处于未定义状态。

Until the advent of kdump, Linux Kernel Crash Dump (LKCD) was the popular mechanism to obtain and analyze dumps. LKCD uses a temporary dump device (such as the swap partition) to capture the dump. It then warm reboots back to a healthy state and copies the dump from the temporary device to a permanent location. A tool called lcrash is used to analyze the dump. The disadvantages with LKCD include the following:

kdump出现之前,Linux Kernel Crash DumpLKCD)是获取和分析转储的流行机制。 LKCD使用临时转储设备(例如交换分区)来捕获转储。然后,它将热启动回到健康状态,并将转储从临时设备复制到永久位置。一个名为lcrash的工具用于分析转储。LKCD的缺点包括:

• Even copying the dump to a temporary device might be unreliable on a crashed kernel.

• Dump device configuration is nontrivial.

• The reboot might be slow because swap space can be activated only after the dump has been safely saved away to a permanent location.

• LKCD is not part of the mainline kernel, so installing the proper patches for your kernel version is a hurdle.

Kdump is not burdened with these shortfalls. It eliminates indeterminism by collecting the dump after booting into a healthy kernel via kexec. Also, because memory state is preserved after a kexec reboot, the memory image can be accurately accessed from the capture kernel.

Kdump没有这些不足。它通过kexec引导到健康的内核后收集转储消除了不确定性。此外,由于在kexec重启后保留了内存状态,因此可以从捕获内核准确地访问故障内核内存映像。

Let’s first get the preliminary kdump setup out of the way: 让我们首先得到初步的kdump设置:

  1. Ask the running kernel to kexec into a capture kernel if it encounters a panic. The capture kernel should additionally have CONFIG_HIMEM and CONFIG_CRASH_DUMP turned on. (Both these options sit inside Processor type and Features in the kernel configuration menu.) 1.如果遇到panic,需要从当前正在运行的内核通过kexec进入捕获内核。捕获内核还应该打开CONFIG_HIMEM和CONFIG_CRASH_DUMP。(这两个选项都位于内核配置菜单中的Processor type and Features内。)
  2. After the capture kernel boots, copy the collected dump information from /proc/vmcore (obtained by enabling CONFIG_PROC_VMCORE in the kernel configuration menu) to a file on your hard disk: 2.捕获内核启动后,将收集的转储信息从/proc/vmcore(通过在内核配置菜单中启用CONFIG_PROC_VMCORE获得)复制到硬盘上:

bash> cp /proc/vmcore /dump/vmcore.dump

You can also save other information such as the raw memory snapshot of the crashed kernel, via /dev/oldmem. 您还可以通过/dev/oldmem保存其他信息,例如崩溃内核的原始内存快照。

  1. Boot back into the first kernel. You are now ready to start dump analysis. 3.引导回第一个内核。您现在可以开始转储分析了。

Let’s use the collected dump file and the crash tool to analyze some example kernel crashes. Introduce this bug inside the interrupt handler of the RTC driver (drivers/char/rtc.c):

让我们使用收集的转储文件和crash工具来分析一些内核崩溃的例子。在RTC驱动程序的中断处理程序(drivers/char/rtc.c)中引入此错误:

Trigger execution of the handler by enabling interrupts via the hwclock command: 通过hwclock命令启用中断来触发处理程序的执行:

Save /proc/vmcore to /dump/vmcore.dump, reboot back into the first (crashed) kernel, and start analysis using the crash tool. In a real-world situation, of course, the dump might be captured at a customer site, whereas the analysis is done at a support center:

/proc/vmcore保存到/dump/vmcore.dump,重新引导回第一个(崩溃的)内核,然后使用crash工具开始分析。当然,在实际情况下,转储可能会在客户站点捕获,而分析则在支持中心完成:

Examine the stack trace to understand the cause of the crash: 检查堆栈跟踪以了解崩溃的原因:

The stack trace points the needle of suspicion at rtc_interrupt(). Let’s disassemble the instructions near rtc_interrupt():

堆栈跟踪指向rtc_interrupt()中被怀疑指针。让我们反汇编rtc_interrupt()附近的指令:

The instruction at address 0xf8a8c004 is attempting to move the contents of the EAX register to address 0xff, which is clearly the invalid deference that caused the crash. Fix this and build a new kernel.

地址0xf8a8c004处的指令试图将EAX寄存器的内容移动到地址0xff,这显然是导致崩溃的不正确的访问。修复此问题并构建新内核。

If you use the irq command, you can figure out the identity of the interrupt that was in progress during the time of the crash. In this case, the output confirms that the RTC interrupt was indeed active:

如果使用irq命令,则可以确定崩溃期间正在进行的中断的标识。在这种情况下,输出确认RTC中断确实处于活动状态:

Let’s now shift gears and look at a case where the kernel freezes, rather than generate an “oops.” Consider the following buggy driver init() routine:

现在让我们换档并查看内核冻结的情况,而不是生成“oops”。考虑以下错误的驱动程序init()函数:

The code is erroneously using a spinlock before initializing it. Effectively, the CPU spins forever trying to acquire the lock, and the kernel appears to hang. Let’s debug this problem using kdump. In this case, there will be no auto-trigger because there is no panic, so force a crash dump by pressing the magic Sysrq key combination, Alt-Sysrq-c. You may need to enable Sysrq by writing a 1 to /proc/sys/kernel/sysrq:

代码在初始化之前错误地使用了自旋锁。实际上,CPU会一直试图获取锁定,内核似乎会挂起。让我们使用kdump调试这个问题。在这种情况下,因为没有恐慌,所以没有自动触发kdump。现在通过按下神奇的Sysrq组合键Alt-Sysrq-c强制进行崩溃转储。您可能还需要通过将1写入/proc/sys/kernel/sysrq来启用Sysrq

bash> echo 1 > /proc/sys/kernel/sysrq
bash> insmod mydrv.ko

This induces the kernel to hang inside mydrv_init(). Press the Alt-Sysrq-c key combination to trigger a crash dump:

这会导致内核挂在mydrv_init()中。按Alt-Sysrq-c组合键以触发故障转储:

Save the dump to disk after kexec boots the capture kernel, boot back to the original kernel, and run crash on the saved dump:

kexec引导捕获内核后,将转储保存到磁盘,引导回原始内核并在保存的转储上运行crash

Test the waters by checking the identity of the process that was running at the time of the crash. In this case, it was apparently insmod (of mydrv.ko):

通过检查崩溃时正在运行的进程的ID来确定问题。在这种情况下,出错的显然是insmodmydrv.ko):

The stack trace doesn’t yield much information other than telling you that a Sysrq key press was responsible for the dump:

除了告知Sysrq按键负责转储之外,堆栈跟踪不会产生太多信息:

Let’s next try peeking at the log messages generated by the crashed kernel. The log command reads the messages from the kernel printk ring buffer present on the dump file:

让我们接下来尝试查看内核崩溃时生成的日志消息。log命令从转储文件中的printk环形缓冲区中读取消息:

The log offers two useful pieces of debug information. First, it lets you know that a soft lockup was detected on the crashed kernel. The kernel detects soft lockups as follows: A kernel watchdog thread runs once a second and touches a per-CPU timestamp variable. If the CPU sits in a tight loop, the watchdog thread cannot update this timestamp. An update check is carried out during timer interrupts using softlockup_tick() (defined in kernel/softlockup.c). If the watchdog timestamp is more than 10 seconds old, it concludes that a soft lockup has occurred and emits a kernel message to that effect.

该日志提供了两个有用的调试信息。首先,它让您知道在崩溃的内核上检测到软锁定。内核检测到软锁定,如下所示:内核监视程序线程每秒运行一次,并访问per-CPU时间戳变量。如果CPU处于紧密循环中,则监视程序线程无法更新此时间戳。使用softlockup_tick()(在kernel/softlockup.c中定义)在定时器中断期间执行更新检查。如果监视程序时间戳超过10秒,则会断定已发生软锁定并发出内核消息。

Second, the log frowns upon mydrv_init()+0xd (or 0xf893d00), so let’s look at the disassembly of the surrounding code region:

其次,日志显示在mydrv_init()+ 0xd(或0xf893d00)上出错,所以让我们看一下周围代码区域的反汇编:

The return address in the stack is 0xf893d00d, so the kernel is hanging inside the previous instruction, which is a call to spin_lock(). If you co-relate this with the earlier source snippet and look at it in the eye, you can see the error sequence, spin_lock()/spin_lock_init(), staring sorrowfully back at you. Fix the problem by swapping the sequence.

堆栈中的返回地址是0xf893d00d,因此内核挂在前一条指令内,这是对spin_lock()的调用。如果你将它与前面的源代码片段相关联并查看它,你可以看到错误序列,spin_lock()/spin_lock_init()。通过交换序列来解决问题。

You can also use crash to peek at data structures of interest, but be aware that memory regions that were swapped out during the crash are not part of the dump. In the preceding example, you can examine mydrv_wq as follows:

您还可以使用crash来查看感兴趣的数据结构,但请注意,在崩溃期间换出的内存区域不是转储的一部分。在前面的示例中,您可以按如下方式检查mydrv_wq

Gdb is integrated with crash, so you can pass commands from crash to gdb for evaluation. For example, you can use gdb’s p command to print data structures.

Gdbcrash集成在一起,因此您可以将命令从crash传递给gdb进行评估。例如,您可以使用gdbp命令来打印数据结构。

Looking at the Sources

Architecture-dependent portions of kexec reside in arch/your-arch/kernel/machine_kexec.c and arch/your-arch/kernel/relocate_kernel.S. The generic parts live in kernel/kexec.c (and include/linux/kexec.h). Peek inside arch/your-arch/kernel/crash.c and arch/your-arch/kernel crash_dump.c for the kdump implementation. Documentation/kdump/kdump.txt contains installation information.

kexec的体系结构相关部分位于arch/your-arch/kernel/machine_kexec.carch/your-arch/ kernel/relocate_kernel.S中。通用部分存在于kernel/kexec.c(和include/linux/kexec.h)中。 arch/your-arch/kernel/crash.carch/your-arch/kernel crash_dump.c中查看kdump实现。 Documentation/kdump/kdump.txt包含安装信息。

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值