How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

最新推荐文章于 2024-09-26 16:35:35 发布

原创最新推荐文章于 2024-09-26 16:35:35 发布 · 1.1w 阅读

0 ·

CC 4.0 BY-SA版权

Mysql Management 同时被 2 个专栏收录

64 篇文章

订阅专栏

LINUX Management

51 篇文章

订阅专栏

本文介绍了当Linux系统出现hang住现象，无法通过SSH访问但还能ping通的情况下的解决办法。核心在于调整vm.dirty_ratio和vm.dirty_background_ratio参数，并优化占用大量资源的进程。

Author：Skate
Time:2015/03/04

How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

现象：系统hang住，可以ping通，但ssh无响应

查看message log
[1379100.801689] [<ffffffff81536f95>] page_fault+0x25/0x30
[1379100.801693] INFO: task java:710923 blocked for more than 120 seconds.
[1379100.801766] Not tainted 2.6.32-042stab104.1 #1
[1379100.801835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1379100.801963] java D ffff8800372d7200 0 710923 709954 67084186 0x00000000
[1379100.801968] ffff880e57e71cf0 0000000000000082 ffffea00021a8fc0 ffff880e57e71c68
[1379100.801972] ffffffff81155c60 ffff8800372d7200 ffffea00021a8fc0 ffff88100c409638
[1379100.801976] 00000007fa23bffc ffff880e57e71c78 ffffffff81155cd1 ffff880e57e71ca8
[1379100.801980] Call Trace:
[1379100.801984] [<ffffffff81155c60>] ? __lru_cache_add+0x40/0x90
[1379100.801988] [<ffffffff81155cd1>] ? lru_cache_add_lru+0x21/0x40
[1379100.801992] [<ffffffff81172c9c>] ? handle_pte_fault+0x65c/0x1040
[1379100.801996] [<ffffffff81536705>] rwsem_down_failed_common+0x95/0x1d0
[1379100.802000] [<ffffffff81536896>] rwsem_down_read_failed+0x26/0x30
[1379100.802004] [<ffffffff812a6a34>] call_rwsem_down_read_failed+0x14/0x30
[1379100.802008] [<ffffffff81535d94>] ? down_read+0x24/0x30
[1379100.802011] [<ffffffff8104dffe>] __do_page_fault+0x18e/0x480
[1379100.802015] [<ffffffff8106f0c8>] ? finish_task_switch+0xc8/0x120
[1379100.802019] [<ffffffff81539c2e>] do_page_fault+0x3e/0xa0
[1379100.802022] [<ffffffff81536f95>] page_fault+0x25/0x30
Show Vitaly Medvedev added a comment - Yesterday 10:34 PM [1379100.801682] [<ffffffff81015019>] ? read_tsc+0x9/0x20 [1379100.801685] [<ffffffff81539c2e>] do_page_fault+0x3e/0xa0 [1379100.801689] [<ffffffff81536f95>] page_fault+0x25/0x30 [1379100.801693] INFO: task java:710923 blocked for more than 120 seconds. [1379100.801766] Not tainted 2.6.32-042stab104.1 #1 [1379100.801835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1379100.801963] java D ffff8800372d7200 0 710923 709954 67084186 0x00000000 [1379100.801968] ffff880e57e71cf0 0000000000000082 ffffea00021a8fc0 ffff880e57e71c68 [1379100.801972] ffffffff81155c60 ffff8800372d7200 ffffea00021a8fc0 ffff88100c409638 [1379100.801976] 00000007fa23bffc ffff880e57e71c78 ffffffff81155cd1 ffff880e57e71ca8 [1379100.801980] Call Trace: [1379100.801984] [<ffffffff81155c60>] ? __lru_cache_add+0x40/0x90 [1379100.801988] [<ffffffff81155cd1>] ? lru_cache_add_lru+0x21/0x40 [1379100.801992] [<ffffffff81172c9c>] ? handle_pte_fault+0x65c/0x1040 [1379100.801996] [<ffffffff81536705>] rwsem_down_failed_common+0x95/0x1d0 [1379100.802000] [<ffffffff81536896>] rwsem_down_read_failed+0x26/0x30 [1379100.802004] [<ffffffff812a6a34>] call_rwsem_down_read_failed+0x14/0x30 [1379100.802008] [<ffffffff81535d94>] ? down_read+0x24/0x30 [1379100.802011] [<ffffffff8104dffe>] __do_page_fault+0x18e/0x480 [1379100.802015] [<ffffffff8106f0c8>] ? finish_task_switch+0xc8/0x120 [1379100.802019] [<ffffffff81539c2e>] do_page_fault+0x3e/0xa0 [1379100.802022] [<ffffffff81536f95>] page_fault+0x25/0x30

宿主机的load达到460左右

By default Linux uses up to 40% of the available memory for file system caching.
After this mark has been reached the file system flushes all outstanding data to
disk causing all following IOs going synchronous. For flushing out this data to
disk this there is a time limit of 120 seconds by default. In the case here the
IO subsystem is not fast enough to flush the data withing 120 seconds. As IO
subsystem responds slowly and more requests are served, System Memory gets filled
up resulting in the above error, thus serving HTTP requests.

解决方案：

1. 修改参数 vm.dirty_ratio 和 vm.dirty_backgroud_ratio 可以避免这个问题

# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5

立即生效：
# sysctl -p

永久修改（需要reboot生效）：
# vi /etc/sysctl.conf
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

2.找到好资源的进程，然后对其优化

参考:http://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/

-------end-------