linux之panic与oops

最新推荐文章于 2024-07-11 15:30:26 发布

楓潇潇

最新推荐文章于 2024-07-11 15:30:26 发布

阅读量557

点赞数

分类专栏： # Linux crash案例分享文章标签： linux panic oops

本文链接：https://blog.csdn.net/u013836909/article/details/129894870

版权

Linux crash案例分享专栏收录该内容

4 篇文章 0 订阅

订阅专栏

linux之panic与oops

文章目录

linux之panic与oops

kernel panic/oops

什么是 Kernel PANIC

panic 是英文中是惊慌的意思, Linux Kernel panic 正如其名, Linux Kernel 不知道如何走了, 它会尽可能把它此时能获取的全部信息都打印出来, 为开发人员调试提供帮助.

有两种主要类型 Kernel panic

Hard Panic(也就是Aieee信息输出)
Soft Panic(也就是Oops信息输出)

什么会导致Linux Kernel Panic

只有加载到内核空间的驱动模块才能直接导致 kernel panic, 你可以在系统正常的情况下, 使用 lsmod 查看当前系统加载了哪些模块. 除此之外, 内建在内核里的组件(比如 memory map 等)也能导致panic.

常见Linux Kernel Panic报错内容

Kernel panic-not syncing fatal exception in interrupt
kernel panic – not syncing: Attempted to kill the idle task!
kernel panic – not syncing: killing interrupt handler!
Kernel Panic – not syncing：Attempted to kill init !

一般出现下面的情况，就认为是发生了 kernel panic :

机器彻底被锁定，不能使用
如果在终端下，应该可以看到内核dump出来的信息（包括一段”Aieee”信息或者”Oops”信息）

hard panic

对于 hard panic 而言, 最大的可能性是驱动模块的中断处理(interrupt handler)导致的, 一般是因为驱动模块在中断处理程序中访问一个空指针(null pointer).

一旦发生这种情况，驱动模块就无法处理新的中断请求，最终导致系统崩溃.

原因

对于 hard panic 而言，最大的可能性是驱动模块的中断处理(interrupt handler)导致的，一般是因为驱动模块在中断处理程序中访问一个空指针(null pointre).

一旦发生这种情况, 驱动模块就无法处理新的中断请求, 最终导致系统崩溃.

信息收集

根据 panic 的状态不同, 内核将记录所有在系统锁定之前的信息. 因为 kenrel panic 是一种很严重的错误, 不能确定系统能记录多少信息, 下面是一些需要收集的关键信息, 他们非常重要，因此尽可能收集全，当然如果系统启动的时候就kernel panic，那就无法只知道能收集到多少有用的信息了。

1. /var/log/messages, 幸运的时候, 整个 kernel panic 栈跟踪信息都能记录在这里. 要确认是否有一个足够的栈跟踪信息，你只要查找包含”EIP”的一行, 它显示了是什么函数和模块调用时导致panic.
1. 应用程序/库日志: 可能可以从这些日志信息里能看到发生 panic 之前发生了什么。
1. 其他发生 panic 之前的信息, 或者知道如何重现panic那一刻的状态

排查方法

栈跟踪信息(stack trace)是排查 kernel panic 最重要的信息, 该信息如果在/var/log/messages日志里当然最好，因为可以看到全部的信息，如果仅仅只是在屏幕上，那么最上面的信息可能因为滚屏消失了，只剩下栈跟踪信息的一部分。如果你有一个完整栈跟踪信息的话，那么就可能根据这些充分的信息来定位panic的根本原因。要确认是否有一个足够的栈跟踪信息，你只要查找包含”EIP”的一行，它显示了是什么函数和模块调用时导致panic

使用内核调试工具(kenrel debugger ,aka KDB)
如果跟踪信息只有一部分且不足以用来定位问题的根本原因时, kernel debugger(KDB)就需要请出来了。

KDB编译到内核里，panic发生时，他将内核引导到一个shell环境而不是锁定。这样，我们就可以收集一些与panic相关的信息了，这对我们定位问题的根本原因有很大的帮助。

soft panic

症状没有 hard panic严重

通常导致段错误(segmentation fault)

以看到一个 oops 信息, /var/log/messages里可以搜索到’Oops’.

机器稍微还能用(但是收集信息后，应该会重启系统).

原因

凡是非中断处理引发的模块崩溃都将导致 soft panic

在这种情况下, 驱动本身会崩溃, 但是还不至于让系统出现致命性失败, 因为它没有锁定中断处理例程. 导致 hard panic 的原因同样对 soft panic 也有用(比如在运行时访问一个空指针).

信息收集

当 soft panic发生时, 内核将产生一个包含内核符号(kernel symbols)信息的 dump数据, 这个将记录在 /var/log/messages 里.

排查方法

为了开始排查故障, 可以使用 ksymoops 工具来把内核符号信息转成有意义的数据.

为了生成 ksymoops 文件,需要

从 /var/log/messages 里找到的堆栈跟踪文本信息保存为一个新文件。确保删除了时间戳(timestamp)，否则 ksymoops 会失败.

Oops 可以看成是内核级的 Segmentation Fault.

应用程序如果进行了非法内存访问或执行了非法指令, 会得到 Segfault 信号, 一般的行为是 coredump, 应用程序也可以自己截获Segfault 信号，自行处理

如果内核自己犯了这样的错误, 则会打出 Oops 信息.

处理器使用的所有地址几乎都是通过一个复杂的页表结构对物理地址映射而得到的虚拟地址(除了内存管理子系统自己所使用的物理地址)。当一个非法的指针被废弃时，内存分页机制将不能为指针映射一个物理地址，处理器就会向操作系统发出一个页故障信号。如果地址不合法，那么内核将不能在该地址“布页”；这时如果处理器处于超级用户模式，内核就会生成一条oops消息。

panic函数分析

//源码：kernel/panic.c

/**
 *	panic - halt the system
 *	@fmt: The text string to print
 *
 *	Display a message, then perform cleanups.
 *
 *	This function never returns.
 */
void panic(const char *fmt, ...)
{
	static char buf[1024];
	va_list args;
	long i, i_next = 0, len;
	int state = 0;
	int old_cpu, this_cpu;
	bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers;

	/*
	 * Disable local interrupts. This will prevent panic_smp_self_stop
	 * from deadlocking the first cpu that invokes the panic, since
	 * there is nothing to prevent an interrupt handler (that runs
	 * after setting panic_cpu) from invoking panic() again.
	 */
	// 禁用本地中断，避免出现死锁，因为无法防止
	// 中断处理程序（在获得panic锁后运行）再次被调用panic
	local_irq_disable();
	preempt_disable_notrace();

	/*
	 * It's possible to come here directly from a panic-assertion and
	 * not have preempt disabled. Some functions called from here want
	 * preempt to be disabled. No point enabling it later though...
	 *
	 * Only one CPU is allowed to execute the panic code from here. For
	 * multiple parallel invocations of panic, all other CPUs either
	 * stop themself or will wait until they are stopped by the 1st CPU
	 * with smp_send_stop().
	 *
	 * `old_cpu == PANIC_CPU_INVALID' means this is the 1st CPU which
	 * comes here, so go ahead.
	 * `old_cpu == this_cpu' means we came from nmi_panic() which sets
	 * panic_cpu to this CPU.  In this case, this is also the 1st CPU.
	 */
	this_cpu = raw_smp_processor_id();
	old_cpu  = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);

	// 只允许一个CPU执行该代码。对于SMP，可能会对panic进行多个并行调用，
	// 这里使用panic_smp_self_stop()函数（注：该函数与具体架构相关）
	// 保证当一个CPU执行panic时，其他CPU处于停止，或者等待状态。
	if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)
		panic_smp_self_stop();

	console_verbose();
	bust_spinlocks(1);
	va_start(args, fmt);
	len = vscnprintf(buf, sizeof(buf), fmt, args);
	va_end(args);

	if (len && buf[len - 1] == '\n')
		buf[len - 1] = '\0';

	pr_emerg("Kernel panic - not syncing: %s\n", buf);
#ifdef CONFIG_DEBUG_BUGVERBOSE
	/*
	 * Avoid nested stack-dumping if a panic occurs during oops processing
	 */
	if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
		dump_stack();
#endif

	/*
	 * If kgdb is enabled, give it a chance to run before we stop all
	 * the other CPUs or else we won't be able to debug processes left
	 * running on them.
	 */
	kgdb_panic(buf);

	/*
	 * If we have crashed and we have a crash kernel loaded let it handle
	 * everything else.
	 * If we want to run this after calling panic_notifiers, pass
	 * the "crash_kexec_post_notifiers" option to the kernel.
	 *
	 * Bypass the panic_cpu check and call __crash_kexec directly.
	 */
	if (!_crash_kexec_post_notifiers) {
		printk_safe_flush_on_panic();
		__crash_kexec(NULL);

		/*
		 * Note smp_send_stop is the usual smp shutdown function, which
		 * unfortunately means it may not be hardened to work in a
		 * panic situation.
		 */
		smp_send_stop(); // 通用的smp shutdown功能
	} else {
		/*
		 * If we want to do crash dump after notifier calls and
		 * kmsg_dump, we will need architecture dependent extra
		 * works in addition to stopping other CPUs.
		 */
		crash_smp_send_stop();
	}

	/*
	 * Run any panic handlers, including those that might need to
	 * add information to the kmsg dump output.
	 */
	// 回调 kmsg处理函数，如：mtdoops、ramoops等
	atomic_notifier_call_chain(&panic_notifier_list, 0, buf);

	/* Call flush even twice. It tries harder with a single online CPU */
	printk_safe_flush_on_panic();
	kmsg_dump(KMSG_DUMP_PANIC);

	/*
	 * If you doubt kdump always works fine in any situation,
	 * "crash_kexec_post_notifiers" offers you a chance to run
	 * panic_notifiers and dumping kmsg before kdump.
	 * Note: since some panic_notifiers can make crashed kernel
	 * more unstable, it can increase risks of the kdump failure too.
	 *
	 * Bypass the panic_cpu check and call __crash_kexec directly.
	 */
	if (_crash_kexec_post_notifiers)
		__crash_kexec(NULL);

#ifdef CONFIG_VT
	unblank_screen();
#endif
	console_unblank();

	/*
	 * We may have ended up stopping the CPU holding the lock (in
	 * smp_send_stop()) while still having some valuable data in the console
	 * buffer.  Try to acquire the lock then release it regardless of the
	 * result.  The release will also print the buffers out.  Locks debug
	 * should be disabled to avoid reporting bad unlock balance when
	 * panic() is not being callled from OOPS.
	 */
	debug_locks_off();
	console_flush_on_panic(CONSOLE_FLUSH_PENDING);

	panic_print_sys_info();

	if (!panic_blink)
		panic_blink = no_blink;

	// 判断超时时间panic_timeout是否大于0
	if (panic_timeout > 0) {
		/*
		 * Delay timeout seconds before rebooting the machine.
		 * We can't use the "normal" timers since we just panicked.
		 */
		pr_emerg("Rebooting in %d seconds..\n", panic_timeout);

		// 根据超时时间长短执行信息输出，并打印出相关时间信息。
		for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
			touch_nmi_watchdog();
			if (i >= i_next) {
				i += panic_blink(state ^= 1);
				i_next = i + 3600 / PANIC_BLINK_SPD;
			}
			mdelay(PANIC_TIMER_STEP);
		}
	}

	// 如果超时时间panic_timeout不等于0，
	// 则调用emergency_restart()函数执行紧急重启操作
	if (panic_timeout != 0) {
		/*
		 * This will not be a clean reboot, with everything
		 * shutting down.  But if there is a chance of
		 * rebooting the system it will be rebooted.
		 */
		if (panic_reboot_mode != REBOOT_UNDEFINED)
			reboot_mode = panic_reboot_mode;

		// emergency_restart()最终会调用架构接口API machine_restart() 
		// 实现具体架构下的系统重启操作。该重启操作需要等待系统中
		// 所有的东西都关闭了，然后内核会找机会重新启动系统。
		emergency_restart();
	}
#ifdef __sparc__
	{
		extern int stop_a_enabled;
		/* Make sure the user can actually press Stop-A (L1-A) */
		stop_a_enabled = 1;
		pr_emerg("Press Stop-A (L1-A) from sun keyboard or send break\n"
			 "twice on console to return to the boot prom\n");
	}
#endif
#if defined(CONFIG_S390)
	disabled_wait();
#endif
	pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);

	/* Do not scroll important messages printed above */
	suppress_printk = 1;
	local_irq_enable();
	for (i = 0; ; i += PANIC_TIMER_STEP) {
		touch_softlockup_watchdog();
		if (i >= i_next) {
			i += panic_blink(state ^= 1);
			i_next = i + 3600 / PANIC_BLINK_SPD;
		}
		mdelay(PANIC_TIMER_STEP);
	}
}