title: tomcat 生产环境crash tags:
- Tomcat
- crash
- 生产
- oom categories: 工作日志 date: 2016-10-25 18:18:54
最近生产系统erp不太稳定,时不时出现tomcat crash,并且log下面没有留下任何证据。
收到邮件如下
告警主机:f6_web0
告警时间:2016.11.03 09:27:00
告警等级:Average
告警信息: f6_web0 is not reachable
告警项目:jmx[“java.lang:type=Runtime”,Uptime]
分析存在几种可能:
-
jvm出现crash,此时一般情况下会在shell执行目录出现 hs_err_pid.log类似文件 可以通过
-XX:ErrorFile指定错误文件生成路径 复制代码
-
使用kill 命令发起关闭信号,一般这种情况会log中出现相关容器destory的日志比如
org.apache.coyote.AbstractProtocol pause
Pausing ProtocolHandler
org.apache.catalina.core.StandardService stopInternal
Stopping service Catalina
org.apache.coyote.AbstractProtocol stop
Stopping ProtocolHandler
org.apache.coyote.AbstractProtocol destroy
Destroying ProtocolHandler -
使用kill -9 不会存在任何证据,可以使用 history命令是否存在执行该指令的shell或者命令
从现象来看符合第三种表现
线索如下:
- 目前只有f6_web0出现tomcat crash
- f6_web0内存8G
- f6_web0机器上存放了比较多的web应用
- 一般出现f6_web0都是crash出现在保存盘点单类似大量数据提交
[root@iZ11036paqyZ ~]# free -m
total used free shared buffers cached
Mem: 7872 6865 1006 0 28 126
-/+ buffers/cache: 6710 1161
Swap: 0 0 0
[root@iZ11036paqyZ ~]# ps -ef|grep erp
root 22077 1 4 09:29 pts/5 00:02:01 /usr/java/jdk1.7.0_80/bin/java -Djava.util.logging.config.file=/mnt/apache-tomcat-7.0.70-erp/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Xms1024m -Xmx2048m -XX:PermSize=128m -XX:MaxNewSize=256m -XX:MaxPermSize=256m -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -Xloggc:../logs/gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+DisableExplicitGC -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=XXXX -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=XXXXX -Djava.endorsed.dirs=/mnt/apache-tomcat-7.0.70-erp/endorsed -classpath /mnt/apache-tomcat-7.0.70-erp/bin/bootstrap.jar:/mnt/apache-tomcat-7.0.70-erp/bin/tomcat-juli.jar -Dcatalina.base=/mnt/apache-tomcat-7.0.70-erp -Dcatalina.home=/mnt/apache-tomcat-7.0.70-erp -Djava.io.tmpdir=/mnt/apache-tomcat-7.0.70-erp/temp org.apache.catalina.startup.Bootstrap start
root 22748 22473 0 10:15 pts/6 00:00:00 grep erp
初步猜想java在接收到请求后出现年轻代大量晋升,因此jvm决定出现扩容。
-Xms1024m -Xmx2048m 当扩容后发现申请的内存本身机器无法提供时,直接被kernel作为无关进程杀掉,因此表现和kill -9 相似。
复制代码
寻找kernel杀掉oom进程日志如下/var/log/messages
Nov 3 09:22:16 iZ11036paqyZ kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 Nov 3 09:22:16 iZ11036paqyZ kernel: java cpuset=/ mems_allowed=0 Nov 3 09:22:16 iZ11036paqyZ kernel: Pid: 7159, comm: java Not tainted 2.6.32-642.1.1.el6.x86_64 #1 Nov 3 09:22:16 iZ11036paqyZ kernel: Call Trace: Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81131640>] ? dump_header+0x90/0x1b0 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff8123c20c>] ? security_real_capable_noaudit+0x3c/0x70 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81131ac2>] ? oom_kill_process+0x82/0x2a0 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81131a01>] ? select_bad_process+0xe1/0x120 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81131f00>] ? out_of_memory+0x220/0x3c0 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff8113e8dc>] ? __alloc_pages_nodemask+0x93c/0x950 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81177c2a>] ? alloc_pages_vma+0x9a/0x150 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81159fad>] ? handle_pte_fault+0x73d/0xb20 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff810567c7>] ? pte_alloc_one+0x37/0x50 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81194199>] ? do_huge_pmd_anonymous_page+0xb9/0x3b0 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff8115a629>] ? handle_mm_fault+0x299/0x3d0 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81052156>] ? __do_page_fault+0x146/0x500 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff81160bf5>] ? do_mmap_pgoff+0x335/0x380 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff8154dc8e>] ? do_page_fault+0x3e/0xa0 Nov 3 09:22:16 iZ11036paqyZ kernel: [<ffffffff8154af95>] ? page_fault+0x25/0x30 Nov 3 09:22:16 iZ11036paqyZ kernel: Mem-Info: Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 DMA per-cpu: Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 0: hi: 0, btch: 1 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 1: hi: 0, btch: 1 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 2: hi: 0, btch: 1 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 3: hi: 0, btch: 1 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 DMA32 per-cpu: Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 0: hi: 186, btch: 31 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 1: hi: 186, btch: 31 usd: 8 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 2: hi: 186, btch: 31 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 3: hi: 186, btch: 31 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 Normal per-cpu: Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 0: hi: 186, btch: 31 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 1: hi: 186, btch: 31 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 2: hi: 186, btch: 31 usd: 30 Nov 3 09:22:16 iZ11036paqyZ kernel: CPU 3: hi: 186, btch: 31 usd: 0 Nov 3 09:22:16 iZ11036paqyZ kernel: active_anon:1952293 inactive_anon:633 isolated_anon:0 Nov 3 09:22:16 iZ11036paqyZ kernel: active_file:182 inactive_file:112 isolated_file:0 Nov 3 09:22:16 iZ11036paqyZ kernel: unevictable:0 dirty:69 writeback:0 unstable:0 Nov 3 09:22:16 iZ11036paqyZ kernel: free:25814 slab_reclaimable:4259 slab_unreclaimable:7076 Nov 3 09:22:16 iZ11036paqyZ kernel: mapped:225 shmem:132 pagetables:5934 bounce:0 Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 DMA free:15752kB min:124kB low:152kB high:184kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Nov 3 09:22:16 iZ11036paqyZ kernel: lowmem_reserve[]: 0 3000 8050 8050 Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 DMA32 free:45344kB min:25140kB low:31424kB high:37708kB active_anon:2749340kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072096kB mlocked:0kB dirty:52kB writeback:0kB mapped:132kB shmem:0kB slab_reclaimable:1468kB slab_unreclaimable:888kB kernel_stack:1712kB pagetables:3396kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:89 all_unreclaimable? no Nov 3 09:22:16 iZ11036paqyZ kernel: lowmem_reserve[]: 0 0 5050 5050 Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 Normal free:42160kB min:42316kB low:52892kB high:63472kB active_anon:5059832kB inactive_anon:2532kB active_file:644kB inactive_file:464kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:224kB writeback:0kB mapped:768kB shmem:528kB slab_reclaimable:15568kB slab_unreclaimable:27416kB kernel_stack:11360kB pagetables:20340kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1664 all_unreclaimable? yes Nov 3 09:22:16 iZ11036paqyZ kernel: lowmem_reserve[]: 0 0 0 0 Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15752kB Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 DMA32: 284*4kB 253*8kB 180*16kB 90*32kB 45*64kB 35*128kB 22*256kB 14*512kB 10*1024kB 3*2048kB 0*4096kB = 45464kB Nov 3 09:22:16 iZ11036paqyZ kernel: Node 0 Normal: 1203*4kB 811*8kB 471*16kB 277*32kB 116*64kB 30*128kB 7*256kB 3*512kB 0*1024kB 0*2048kB 0*4096kB = 42292kB Nov 3 09:22:16 iZ11036paqyZ kernel: 393 total pagecache pages Nov 3 09:22:16 iZ11036paqyZ kernel: 0 pages in swap cache Nov 3 09:22:16 iZ11036paqyZ kernel: Swap cache stats: add 0, delete 0, find 0/0 Nov 3 09:22:16 iZ11036paqyZ kernel: Free swap = 0kB Nov 3 09:22:16 iZ11036paqyZ kernel: Swap cache stats: add 0, delete 0, find 0/0 Nov 3 09:22:16 iZ11036paqyZ kernel: Free swap = 0kB Nov 3 09:22:16 iZ11036paqyZ kernel: Total swap = 0kB Nov 3 09:22:16 iZ11036paqyZ kernel: 2097151 pages RAM Nov 3 09:22:16 iZ11036paqyZ kernel: 81868 pages reserved Nov 3 09:22:16 iZ11036paqyZ kernel: 1287 pages shared Nov 3 09:22:16 iZ11036paqyZ kernel: 1983342 pages non-shared Nov 3 09:22:16 iZ11036paqyZ kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Nov 3 09:22:16 iZ11036paqyZ kernel: [ 556] 0 556 2721 164 0 -17 -1000 udevd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 953] 0 953 373 24 1 0 0 aliyun-service Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1209] 0 1209 62798 307 3 0 0 rsyslogd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1231] 28 1231 157561 128 0 0 0 nscd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1259] 0 1259 6130 139 0 0 0 AliYunDunUpdate Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1304] 0 1304 16552 575 1 0 0 AliYunDun Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1332] 0 1332 16559 183 2 -17 -1000 sshd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1344] 38 1344 6650 132 0 0 0 ntpd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1358] 0 1358 29218 157 1 0 0 crond Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1373] 500 1373 4477 47 3 0 0 zabbix_agentd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1375] 500 1375 4477 139 2 0 0 zabbix_agentd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1376] 500 1376 4477 78 0 0 0 zabbix_agentd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1377] 500 1377 4477 78 0 0 0 zabbix_agentd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1379] 500 1379 4477 78 0 0 0 zabbix_agentd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1380] 500 1380 4479 63 1 0 0 zabbix_agentd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1389] 0 1389 29891 1197 0 0 0 AliHids Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1462] 0 1462 1016 20 1 0 0 mingetty Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1464] 0 1464 1016 20 0 0 0 mingetty Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1466] 0 1466 1016 19 1 0 0 mingetty Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1468] 0 1468 1016 19 2 0 0 mingetty Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1470] 0 1470 1016 19 2 0 0 mingetty Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1472] 0 1472 1016 20 3 0 0 mingetty Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1475] 0 1475 2720 162 3 -17 -1000 udevd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1476] 0 1476 2720 162 2 -17 -1000 udevd Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1636] 0 1636 1119815 196884 0 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [ 1707] 0 1707 1123366 141326 2 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [ 2051] 0 2051 749880 52004 3 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [ 2294] 0 2294 1119162 91797 3 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [ 2518] 0 2518 749399 44434 0 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [26904] 0 26904 1163149 271991 3 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [ 955] 0 955 1125460 245260 3 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [16735] 0 16735 736674 85586 3 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [17618] 0 17618 1187364 94599 3 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [18299] 0 18299 1122105 126932 0 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [28259] 0 28259 1122651 72567 0 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [ 7154] 0 7154 1226757 439871 0 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: [12281] 0 12281 1121307 86830 2 0 0 java Nov 3 09:22:16 iZ11036paqyZ kernel: Out of memory: Kill process 7154 (java) score 188 or sacrifice child Nov 3 09:22:16 iZ11036paqyZ kernel: Killed process 7154, UID 0, (java) total-vm:4907028kB, anon-rss:1759328kB, file-rss:156kB 复制代码
基本可以判断8G内存不足导致需要swap,而该服务器并没有swap文件可供使用,因此出现oom kill现象
解决方案:
1.将xmx xms设置大小相同,可以在运行阶段直接暴露此问题
2.创建swap文件或者更换大内存
3.停掉无关进程
复制代码