linux内核线程死锁或死循环(soft lockup)之后如何让系统宕机重启

在开发内核模块或驱动时,如果处理失误,导致内核线程中出现死锁或者死循环,你会发现,除了重启之外,你没有任何可以做的。这时你的输入不起任何作用,终端(不是指远程的ssh工具)只会在那重复的输出类似“BUG: soft lockup - CPU#0 stuck for 67s! [fclustertool:2043]”,更无奈的是你重启之后导致系统挂起的堆栈信息也看不到,你所能做的就是一遍遍的加调试信息,一遍遍的重启机器(这是我的经历,现在想想很傻)。
  这种情况你肯定不是第一个遇到的,所以内核肯定会提供处理这种情况的一些机制。但是如何来找到这些机制在哪个地方,或者说根据什么信息去google呢?最有用的就是这句话“BUG: soft lockup - CPU#0 stuck for 67s! [fclustertool:2043]”,因为这句话提供你的信息量很大。首先,这条信息可以输出,说明即使发生死锁或者死循环,还是有代码可以执行。第二,可以通过这个日志信息,找到对应的处理函数,这个函数所在的模块就是用来处理CPU被过度使用时用到的。所以通过这个事情,可以看到内核打印出的只言片语都有可能成为你解决问题的关键,一定要从重视这些信息,从中找出有用的东西。
  我经常看的内核版本是官方的2.6.32内核,这个版本中我找到的函数是 softlockup_tick (),这个函数在时钟中断的处理函数run_local_timers()中调用。这个函数会首先检查watchdog线程是否被挂起,如果不是watchdog线程,会检查当前占有CPU的线程占有的时间是否超过系统配置的阈值,即softlockup_thresh。如果当前占有CPU的时间过长,则会在系统日志中输出我们上面看到的那条日志。接下来才是最关键的,就是输出模块信息、寄存器信息和堆栈信息,检查softlockup_panic的值是否为1。如果softlockup_panic为1,则调用panic()让内核挂起,输出OOPS信息。代码如下所示:
[cpp]  view plain  copy
  1. /* 
  2. * This callback runs from the timer interrupt, and checks 
  3. * whether the watchdog thread has hung or not: 
  4. */  
  5. void softlockup_tick(void)  
  6. {  
  7.     int this_cpu = smp_processor_id();  
  8.     unsigned long touch_timestamp = per_cpu(touch_timestamp, this_cpu);  
  9.     unsigned long print_timestamp;  
  10.     struct pt_regs *regs = get_irq_regs();  
  11.     unsigned long now;  
  12.     ......  
  13.   
  14.     /* Warn about unreasonable delays: */  
  15.     if (now <= (touch_timestamp + softlockup_thresh))  
  16.         return;  
  17.   
  18.     per_cpu(print_timestamp, this_cpu) = touch_timestamp;  
  19.     spin_lock(&print_lock);  
  20.     printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",  
  21.             this_cpu, now - touch_timestamp,  
  22.             current->comm, task_pid_nr(current));  
  23.     print_modules();  
  24.     print_irqtrace_events(current);  
  25.     if (regs)  
  26.         show_regs(regs);  
  27.     else  
  28.         dump_stack();  
  29.     spin_unlock(&print_lock);  
  30.   
  31.   
  32.     if (softlockup_panic)  
  33.         panic("softlockup: hung tasks");  
  34. }  
  但是softlockup_panic的值默认竟然是0,所以在出现死锁或者死循环的时候,会一直只输出日志信息,而不会宕机,这个真是好坑啊!所以你得手动修改/proc/sys/kernel/softlockup_panic的值,让内核可以在死锁或者死循环的时候可以宕机。如果你的机器中安装了kdump,在重启之后,你会得到一份内核的core文件,这时从core文件中查找问题就方便很多了,而且再也不用手动重启机器了。如果你的内核是标准内核的话,可以通过修改/proc/sys/kernel/softlockup_thresh来修改超时的阈值,如果是CentOS内核的话,对应的文件是/proc/sys/kernel/watchdog_thresh。CentOS内核和标准内核还有一个地方不一样,就是处理CPU占用时间过长的函数,CentOS下是 watchdog_timer_fn ()函数。
  这里介绍下lockup的概念。lockup分为soft lockup和hard lockup。 soft lockup是指内核中有BUG导致在内核模式下一直循环的时间超过10s(根据实现和配置有所不同),而其他进程得不到运行的机会。hard softlockup是指内核已经挂起,可以通过watchdog这样的机制来获取详细信息。这两个概念比较类似。如果你想了解更多关于lockup的信息,可以参考这篇文档: http://www.mjmwired.net/kernel/Documentation/lockup-watchdogs.txt
  注意上面说的这些,都是在内核线程中有效,对用户态的死循环没用。如果要监视用户态的死循环,或者内存不足等资源的情况,强烈推荐软件层面的watchdog。你可以自己开发基于soft watchdog的监控程序,也可以安装watchdog包通过修改配置来做,非常的方便。具体的操作可以参考下面的文章,都写的非常好,非常实用:
http://purplegrape.blog.51cto.com/1330104/1131910

http://www.ibm.com/developerworks/cn/linux/l-cn-watchdog/index.html#resources


附上一段bug信息:

[html]  view plain  copy
  在CODE上查看代码片 派生到我的代码片
  1. [2098 : 2098 CRITICAL][utils.c:664]Erase flash at 0x00000000                                                                                                        
  2.                                                                                                                                                                     
  3. [2098 : 2098 CRITICAL][utils.c:539]EraseFlashArea (0x0, 0x10000)                                                                                                    
  4.                                                                                                                                                                     
  5. [2098 : 2098 CRITICAL][utils.c:561]EI Start is 0, size is 10000                                                                                                     
  6.                                                                                                                                                                     
  7. [  356.270000] BUG: soft lockup - CPU#0 stuck for 22s! [hostflasher:2098]                                                                                           
  8. [  356.270000] Modules linked in: videocap(O) hid(O) hdisk(O) floppy(O) ncsi(O) cdrom(O) usb1_hw(O) iUSB(O) usbe(O) bt_hw(O) bt(O) host_spi_flash_hw(O) host_spi_)  
  9. [  356.270000]                                                                                                                                                      
  10. [  356.270000] CPU: 0 PID: 2098 Comm: hostflasher Tainted: P           O 3.14.17-ami #1                                                                             
  11. [  356.270000] task: c5fbbc20 ti: c4878000 task.ti: c4878000                                                                                                        
  12. [  356.270000] PC is at ast_host_spi_flash_transfer+0x34/0x1dc [host_spi_flash_hw]                                                                                  
  13. [  356.270000] LR is at ast_host_spi_flash_transfer+0x28/0x1dc [host_spi_flash_hw]                                                                                  
  14. [  356.270000] pc : [<bf0e60e4>]    lr : [<bf0e60d8>]    psr: 60000013                                                                                              
  15. [  356.270000] sp : c4879c60  ip : 00000004  fp : c4879c84                                                                                                          
  16. [  356.270000] r10: bf0e2348  r9 : 00000005  r8 : 00000000                                                                                                          
  17. [  356.270000] r7 : 00000001  r6 : c4879c97  r5 : 00000001  r4 : c4879c96                                                                                           
  18. [  356.270000] r3 : cd946000  r2 : 7100d2e6  r1 : bf0e6730  r0 : 00000000                                                                                           
  19. [  356.270000] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user                                                                                    
  20. [  356.270000] Control: 00c5387d  Table: 84808008  DAC: 00000015                                                                                                    
  21. [  356.270000] CPU: 0 PID: 2098 Comm: hostflasher Tainted: P           O 3.14.17-ami #1                                                                             
  22. [  356.270000] Backtrace:                                                                                                                                           
  23. [  356.270000] [<c000b494>] (dump_backtrace) from [<c000b83c>] (show_stack+0x18/0x1c)                                                                               
  24. [  356.270000]  r6:c4879c18 r5:00000135 r4:c4879c18 r3:c04d2008                                                                                                     
  25. [  356.270000] [<c000b824>] (show_stack) from [<c0386a84>] (dump_stack+0x20/0x28)                                                                                   
  26. [  356.270000] [<c0386a64>] (dump_stack) from [<c0009f9c>] (show_regs+0x24/0x2c)                                                                                    
  27. [  356.270000] [<c0009f78>] (show_regs) from [<c004dfc4>] (watchdog_timer_fn+0xe0/0x130)                                                                            
  28. [  356.270000]  r4:c04dc9f8 r3:a3febb35                                                                                                                             
  29. [  356.270000] [<c004dee4>] (<span style="color:#ff0000;">watchdog_timer_fn</span>) from [<c002da60>] (hrtimer_run_queues+0x120/0x1c0)                                                                  
  30. [  356.270000]  r9:1439d6a1 r8:3a814c28 r7:ffffffff r6:c04d9858 r5:c04d9858 r4:c04dca08                                                                             
  31. [  356.270000] [<c002d940>] (hrtimer_run_queues) from [<c001c614>] (run_local_timers+0x10/0x1c)                                                                     
  32. [  356.270000]  r10:bf0e2348 r9:00000005 r8:00000000 r7:00000000 r6:00000000 r5:c5fbbc20                                                                            
  33. [  356.270000]  r4:c4878000                                                                                                                                         
  34. [  356.270000] [<c001c604>] (run_local_timers) from [<c001c650>] (update_process_times+0x30/0x5c)                                                                   
  35. [  356.270000] [<c001c620>] (update_process_times) from [<c000b3d4>] (timer_tick+0x30/0x38)                                                                         
  36. [  356.270000]  r6:00000010 r5:c04dacfc r4:c04d7fd0 r3:60000013                                                                                                     
  37. [  356.270000] [<c000b3a4>] (timer_tick) from [<c0010d98>] (ast_timer_interrupt+0x20/0x2c)                                                                          
  38. [  356.270000] [<c0010d78>] (ast_timer_interrupt) from [<c003c8d4>] (handle_irq_event_percpu+0x34/0x19c)                                                            
  39. [  356.270000] [<c003c8a0>] (handle_irq_event_percpu) from [<c003ca6c>] (handle_irq_event+0x30/0x40)                                                                
  40. [  356.270000]  r10:bf0e2348 r9:00000005 r8:00000000 r7:c4879c4c r6:fe6c0000 r5:00000000                                                                            
  41. [  356.270000]  r4:c04dacfc                                                                                                                                         
  42. [  356.270000] [<c003ca3c>] (handle_irq_event) from [<c003f198>] (handle_level_irq+0xc8/0xd8)                                                                       
  43. [  356.270000]  r4:c04dacfc r3:00020000                                                                                                                             
  44. [  356.270000] [<c003f0d0>] (handle_level_irq) from [<c003c2f4>] (generic_handle_irq+0x28/0x38)                                                                     
  45. [  356.270000]  r4:00000010 r3:c003f0d0                                                                                                                             
  46. [  356.270000] [<c003c2cc>] (generic_handle_irq) from [<c0009b30>] (handle_IRQ+0x68/0x8c)                                                                           
  47. [  356.270000]  r4:00000010 r3:00000040                                                                                                                             
  48. [  356.270000] [<c0009ac8>] (handle_IRQ) from [<c0008210>] (asm_do_IRQ+0x10/0x14)                                                                                   
  49. [  356.270000]  r5:60000013 r4:bf0e60e4                                                                                                                             
  50. [  356.270000] [<c0008200>] (asm_do_IRQ) from [<c000c274>] (__irq_svc+0x34/0x104)                                                                                   
  51. [  356.270000] Exception stack(0xc4879c18 to 0xc4879c60)                                                                                                            
  52. [  356.270000] 9c00:                                                       00000000 bf0e6730                                                                        
  53. [  356.270000] 9c20: 7100d2e6 cd946000 c4879c96 00000001 c4879c97 00000001 00000000 00000005                                                                        
  54. [  356.270000] 9c40: bf0e2348 c4879c84 00000004 c4879c60 bf0e60d8 bf0e60e4 60000013 ffffffff                                                                        
  55. [  356.270000] [<bf0e60b0>] (ast_host_spi_flash_transfer [host_spi_flash_hw]) from [<bf0e1250>] (host_spi_flash_unregister+0x9c/0x2f8 [host_spi_flash])             
  56. [  356.270000]  r8:00000000 r7:00000000 r6:c4879c96 r5:bf0e65e0 r4:01827fe5 r3:00000001                                                                             
  57. [  356.270000] [<bf0e11ec>] (host_spi_flash_unregister [host_spi_flash]) from [<bf0e1668>] (host_spi_flash_erase+0x1bc/0x294 [host_spi_flash])                      
  58. [  356.270000]  r8:00010000 r7:bf0e65e4 r6:00010000 r5:bf0e65e0 r4:00000000                                                                                         
  59. [  356.270000] [<bf0e14ac>] (host_spi_flash_erase [host_spi_flash]) from [<c023a4c8>] (part_erase+0x38/0x88)                                                        
  60. [  356.270000]  r10:00000000 r9:00000000 r8:000001b8 r7:c34a68c0 r6:c89a5800 r5:00000000                                                                            
  61. [  356.270000]  r4:00000000                                                                                                                                         
  62. [  356.270000] [<c023a490>] (part_erase) from [<c0237d48>] (mtd_erase+0x78/0x8c)                                                                                    
  63. [  356.270000]  r8:00010000 r7:c4879e9c r6:c34a68c0 r5:00000000 r4:00000000 r3:00010000                                                                             
  64. [  356.270000] [<c0237cd0>] (mtd_erase) from [<c023c560>] (mtdchar_ioctl+0x46c/0xe94)                                                                               
  65. [  356.270000]  r9:c4878000 r8:c4878000 r6:c34a68c0 r5:c89a5800 r4:becd4b80 r3:00000000                                                                             
  66. [  356.270000] [<c023c0f4>] (mtdchar_ioctl) from [<c023cfbc>] (mtdchar_unlocked_ioctl+0x34/0x4c)                                                                    
  67. [  356.270000]  r9:c4878000 r8:00000006 r7:40084d02 r6:becd4b80 r5:c3435820 r4:c04eeaf0                                                                             
  68. [  356.270000] [<c023cf88>] (mtdchar_unlocked_ioctl) from [<c008a558>] (vfs_ioctl+0x20/0x3c)                                                                        
  69. [  356.270000]  r7:40084d02 r6:00000006 r5:c929eaa0 r4:becd4b80                                                                                                     
  70. [  356.270000] [<c008a538>] (vfs_ioctl) from [<c008aff4>] (do_vfs_ioctl+0x4f8/0x54c)                                                                                
  71. [  356.270000] [<c008aafc>] (do_vfs_ioctl) from [<c008b088>] (SyS_ioctl+0x40/0x5c)                                                                                  
  72. [  356.270000]  r9:c4878000 r8:00000006 r7:40084d02 r6:becd4b80 r5:c3435820 r4:c3435820                                                                             
  73. [  356.270000] [<c008b048>] (SyS_ioctl) from [<c0009220>] (ret_fast_syscall+0x0/0x30)                                                                               
  74. [  356.270000]  r8:c00093a4 r7:00000036 r6:00010000 r5:00000000 r4:00010000 r3:00000001       



from: http://blog.csdn.net/justlinux2010/article/details/8739395

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值