硬件的AHS日志未发现任何报警信息,在服务器死机的时间段也没有发现异常,只记录有在死机发生后不久的人为触发的服务器重启记录。另外,服务器的BIOS和P440ar控制器固件版本稍微低些,不是最新的。
分析操作系统的SOSReprot日志发现在服务器死机之前的时间段有OOM(Out Of Memory)内存溢出记录,最好发现是用户l3fw进程导致的,关闭l3fw进程后,故障未复现,确认是由于用户自己的进程导致内存溢出,最后产生的服务器死机无响应的问题,与服务器硬件无关。
具体的日志分析过程如下:
1.13,14,15,16,17号messages日志里都记录有大量的内存溢出而杀死l3fw进程,如下:
Mar 13 19:43:03 localhost kernel: Out of memory: Kill process 8676 (l3fw) score 982 or sacrifice child
Mar 13 19:43:03 localhost kernel:Killed process 8676 (l3fw)total-vm:97243004kB, anon-rss:63878444kB, file-rss:0kB
Mar 13 19:43:03 localhost kernel: l3fw: page allocation failure: order:0, mode:0x2015a
Mar 13 19:43:03 localhost kernel: CPU: 0 PID: 8676 Comm: l3fw Not tainted 3.10.0-123.el7.x86_64 #1
Mar 14 09:27:18 localhost kernel:Out of memory: Kill process 4748 (l3fw) score 982 or sacrifice child
Mar 14 09:27:18 localhost kernel:Killed process 4748 (l3fw) total-vm:97241980kB, anon-rss:63826664kB, file-rss:0kB
Mar 15 13:21:31localhost kernel: Out of memory: Kill process 7628 (l3fw) score 981 or sacrifice child
Mar 15 13:21:31 localhost kernel:Killed process 7628 (l3fw) total-vm:97111932kB, anon-rss:63811384kB, file-rss:356kB
Mar 16 10:44:47 localhost kernel: Out of memory: Kill process 12456 (l3fw)score 980 or sacrifice child
Mar 16 10:44:47 localhost kernel: Killed process 12456 (l3fw)total-vm:97045372kB, anon-rss:63801988kB, file-rss:0kB
Mar 17 10:42:41 localhost kernel:Out of memory: Kill process 6881 (l3fw) score 980 or sacrifice child
Mar 17 10:42:41 localhost kernel: Killed process 6881 (l3fw) total-vm:96980860kB, anon-rss:63894712kB, file-rss:564kB
2.内存的溢出导致了机器系统无相应,但是硬件没有任何报错的产生,以13号日志为例:
Mar 13 19:43:03localhost kernel: l3fw invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Mar 13 19:43:03 localhost kernel: l3fw cpuset=/ mems_allowed=0-1
Mar 13 19:43:03 localhost kernel: CPU: 13 PID: 8696 Comm: l3fw Not tainted 3.10.0-123.el7.x86_64 #1
Mar 13 19:43:03 localhost kernel: active_anon:15173182 inactive_anon:979308 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:52843 slab_reclaimable:14200 slab_unreclaimable:25628
mapped:2296 shmem:2312 pagetables:50753 bounce:0
free_cma:0
Mar 13 19:43:03 localhost kernel: Node 0 DMA free:15748kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 13 19:43:03 localhost kernel: lowmem_reserve[]: 0 1641 31847 31847
Mar 13 19:43:03 localhost kernel: Node 0 DMA32 free:121684kB min:2304kB low:2880kB high:3456kB active_anon:1150040kB inactive_anon:411036kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1948156kB managed:1681388kB mlocked:0kB dirty:0kB writeback:0kB mapped:44kB shmem:40kB slab_reclaimable:1212kB slab_unreclaimable:3268kB kernel_stack:40kB pagetables:4500kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11 all_unreclaimable? yes
Mar 13 19:43:03 localhost kernel: lowmem_reser