昨天项目上线,在完成验证之后,我深吸一口气,完美上线,未出任何问题。然而半个小时之后,收到运维团队发过来的告警,说是有一台服务器的几个端口未被探测到。当时听说之后,我心中一紧,不对呀,我们服务器之后都观察了系统日志,交易一切正常,如果系统启动异常,日志是不会打印的。于是,我赶紧去操作间登录服务器,检测端口。
netstat -anpl | grep 1099
netstat -anpl | grep 8900
输出显示,端口确实未启动。但系统对外服务的端口一切正常。于是我们找到系统启动日志,定位到如下异常:
于是在网上搜索关于Failed to write core dump的文章,很多,但跟我们这次遇到的问题并不一致。
于此同时,我们发现在启动脚本目录下多了一个日志文件hs_err_pid13893.log,这个文件就是JVM在启动时出现异常的线程堆栈信息,内容描述如下:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGBUS (0x7) at pc=0x00007fa0a94e14e2, pid=13893, tid=0x00007f9e8fefe700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_172-b11) (build 1.8.0_172-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.172-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libzip.so+0x124e2] newEntry+0x62
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
--------------- T H R E A D ---------------
Current thread (0x00007f9ed4111000): JavaThread "pool-FramePools-thread2" [_thread_in_native, id=14081, stack(0x00007f9e8fdfe000,0x00007f9e8feff000)]
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007f9efc0cb0fa
Registers:
RAX=0x00007f9ee830cc60, RBX=0x00007f9f6e0e1e00, RCX=0x0000000000000000, RDX=0x00007f9ee830cc60
RSP=0x00007f9e8fefb4b0, RBP=0x00007f9e8fefb500, RSI=0x0000000000000000, RDI=0x00007f9ee8000020
R8 =0x00007f9ee8000088, R9 =0x00007f9e8fefb580, R10=0x0000000000000048, R11=0x00000000c6bfe5f0
R12=0x00007f9efc0cb0dd, R13=0x00007f9ee830cc60, R14=0x000000001cb26b69, R15=0x00007f9f6e0de110
RIP=0x00007fa0a94e14e2, EFLAGS=0x0000000000010206, CSGSFS=0x0000000000000033, ERR=0x0000000000000004
TRAPNO=0x000000000000000e
Top of Stack: (sp=0x00007f9e8fefb4b0)
0x00007f9e8fefb4b0: 00000000c6b13fe0 0000000100000000
0x00007f9e8fefb4c0: 00007f9f6e0de1e0 00007fa0b4c1a02e
0x00007f9e8fefb4d0: 00007f9ee830cc60 00007f9f6e0e1e00
0x00007f9e8fefb4e0: 00007f9f6e0e10b0 00007f9f6e0de110
0x00007f9e8fefb4f0: 000000001cb26b69 000000000000004f
0x00007f9e8fefb500: 00007f9e8fefb560 00007fa0a94e1c3f
0x00007f9e8fefb510: 00007f9ed4111000 0100000000000000
0x00007f9e8fefb520: 00007f9e8fefb580 00007f9e8fefb5d0
0x00007f9e8fefb530: 00007f9e8fefb5ce 00007f9e8fefb580
0x00007f9e8fefb540: 000000000000004f 00007f9ed41111f8
0x00007f9e8fefb550: 00007f9e8fefb9d8 0000000000000001
0x00007f9e8fefb560: 00007f9e8fefb9c0 00007fa0a94d2e90
0x00007f9e8fefb570: 00007fa09a4cf7a7 00007f9f6e0de110
0x00007f9e8fefb580: 672f63642f6d6f63 636e616e7265766f
0x00007f9e8fefb590: 61646174656d2f65 736c706d692f6174
0x00007f9e8fefb5a0: 656d69746e75722f 5f72656b6361502f
0x00007f9e8fefb5b0: 6465525f736e6f63 56626d4362625a6d
0x00007f9e8fefb5c0: 4f534a5f6f745f31 007373616c632e4e
0x00007f9e8fefb5d0: 00007f9e8fefb650 0000000000800000
0x00007f9e8fefb5e0: 00007f9ee8000078 00007f9ee8000020
0x00007f9e8fefb5f0: 00007f9ee8000020 0000000000000000
0x00007f9e8fefb600: 0000000000008010 00000038ad67a567
0x00007f9e8fefb610: 00007f9ed4111000 0000000000008000
0x00007f9e8fefb620: 00007f9e8fefb690 00007f9e8fefb868
0x00007f9e8fefb630: 000000768fefb690 0000000000000200
0x00007f9e8fefb640: 0000000000000040 0000000000000000
0x00007f9e8fefb650: 0000000000000001 0000000000000008
0x00007f9e8fefb660: 0000000000008030 0000023000000801
0x00007f9e8fefb670: 0000007c0000009b 0000007600000078
0x00007f9e8fefb680: 00007fa09a515f20 00007f9ee8000020
0x00007f9e8fefb690: 0000000000008000 0000000000000000
0x00007f9e8fefb6a0: 0000000000000019 00000000c6bfe409
Instructions: (pc=0x00007fa0a94e14e2)
0x00007fa0a94e14c2: 00 48 c7 40 28 00 00 00 00 41 80 7f 30 00 0f 84
0x00007fa0a94e14d2: 8a 02 00 00 4c 8b 63 08 4d 2b 67 28 4d 03 67 18
0x00007fa0a94e14e2: 41 0f b6 5c 24 1d 41 0f b6 44 24 1c c1 e3 08 09
0x00007fa0a94e14f2: c3 41 0f b6 44 24 1e 88 45 bd 41 0f b6 54 24 20
Register to memory mapping:
RAX=0x00007f9ee830cc60 is an unknown value
RBX=0x00007f9f6e0e1e00 is an unknown value
RCX=0x0000000000000000 is an unknown value
RDX=0x00007f9ee830cc60 is an unknown value
RSP=0x00007f9e8fefb4b0 is pointing into the stack for thread: 0x00007f9ed4111000
RBP=0x00007f9e8fefb500 is pointing into the stack for thread: 0x00007f9ed4111000
RSI=0x0000000000000000 is an unknown value
RDI=0x00007f9ee8000020 is an unknown value
R8 =0x00007f9ee8000088 is an unknown value
R9 =0x00007f9e8fefb580 is pointing into the stack for thread: 0x00007f9ed4111000
R10=0x0000000000000048 is an unknown value
R11=
关键内容如下:
大致含义是java虚拟机在启动的时候发生了原始异常,产生问题的框架是:C [libzip.so+0x124e2] newEntry+0x62。于是又在网上搜索C [libzip.so+0x124e2] newEntry+0x62,最终搜到这篇文章:https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8214788,这是JVM的一个BUG,java在操作zip文件的时候会调用linux系统libzip.so库,如果在jvm运行的时候做了zip文件操作,而此时zip文件做了修改,就会有JVM崩溃的风险。在oracle的java bug database中搜索,发现有很多类似的问题:
当然,Oracle官方也给出了解决方案:
1、在JVM启动参数中加入:-Dsun.zip.disableMemoryMapping=true ,但该方案有一个风险,会导致jvm在磁盘读取jar文件的时候有巨大的性能风险,当然前提是你一遍又一遍的读。然而,我在网上搜索关于类似的问题的时候,发现有网友加了该启动参数却又遇到了其他问题。
2、升级到JDK9。该问题在这个版本得到了完美解决。
总结: 我们系统在启动的时候会加载zip文件,可能当时我们的操作人员未等其他服务器启动完毕,就立即重启了下一台服务器,碰巧两台服务器都在对同一个zip文件做操作,就导致JVM没有完全启动。