今日线上机器发生服务卡顿现象,根据运维人员给的报警记录知道是那台机器出现的故障
解决思路
- 是否内存溢出
- 是否线程跑满无法接收新请求
根据这2个思路进行进一步的排查
内存溢出排查
内存溢出的排查步骤
第一步:进程是否存活
通过命令看到,tomcat的进程存在,则说明进程是存活的,如果是内存溢出导致的宕机的话,可以在
cat /var/log/messages
可以看到机器是在什么时候溢出的,这样的话,我们只能从dump文件来分析内存的大对象信息了
第二步:查看JVM内存是否正常
通过命令查看
jmap -heap pid
$ jmap -heap 2723
Attaching to process ID 2723, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14
using thread-local object allocation.
Garbage-First (G1) GC with 4 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 4294967296 (4096.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 2575302656 (2456.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 2097152 (2.0MB)
Heap Usage:
G1 Heap:
regions = 2048
capacity = 4294967296 (4096.0MB)
used = 3687839368 (3516.997688293457MB)
free = 607127928 (579.002311706543MB)
85.86420137435198% used
G1 Young Generation:
Eden Space:
regions = 885
capacity = 2006974464 (1914.0MB)
used = 1855979520 (1770.0MB)
free = 150994944 (144.0MB)
92.47648902821317% used
Survivor Space:
regions = 15
capacity = 31457280 (30.0MB)
used = 31457280 (30.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 907
capacity = 2256535552 (2152.0MB)
used = 1798305416 (1714.997688293457MB)
free = 458230136 (437.00231170654297MB)
79.69320112887812% used
200785 interned Strings occupying 21207448 bytes.
从堆信息来看,内存充足,说明不是内存的问题
第三步:查看对象信息
如果是内存出了问题,我们一定可以查到是那个类的实例太多导致了我们系统的问题
同样是使用堆分析命令
jmap -histo:live 2723 >aaa.log
由于堆内存对象信息特别多因此建议输出在文本中查看,一般看前50行足够,如果没有我们自己写的类,那一般可以肯定问题不是出现在内存上了
num #instances #bytes class name
----------------------------------------------
1: 1377882 721392080 [C
2: 1140506 100364528 java.lang.reflect.Method
3: 87973 52810296 [B
4: 1368121 32834904 java.lang.String
5: 980746 31383872 java.util.HashMap$Node
6: 652719 31330512 org.aspectj.weaver.reflect.ShadowMatchImpl
7: 762876 30515040 java.util.LinkedHashMap$Entry
8: 769210 24614720 java.util.concurrent.ConcurrentHashMap$Node
9: 260191 23543120 [Ljava.util.HashMap$Node;
10: 924804 23440400 [Ljava.lang.String;
11: 652719 20887008 org.aspectj.weaver.patterns.ExposedState
12: 826190 18004968 [Z
13: 293162 17204016 [Ljava.lang.Object;
14: 652719 14069760 [Lorg.aspectj.weaver.ast.Var;
15: 268954 12909792 java.util.HashMap
16: 175403 9822568 java.util.LinkedHashMap
17: 69779 9434712 [[Ljava.lang.String;
18: 220558 8822320 java.lang.ref.SoftReference
19: 298086 7154064 java.util.ArrayList
20: 204250 6536000 java.lang.ref.WeakReference
21: 52528 6363696 java.lang.Class
22: 390 6136272 [Ljava.util.concurrent.ConcurrentHashMap$Node;
23: 141958 5649848 [I
24: 59065 4725200 java.lang.reflect.Constructor
25: 227110 4592400 [Ljava.lang.Class;
26: 110306 4412240 java.util.WeakHashMap$Entry
27: 55838 3837608 [Ljava.util.Hashtable$Entry;
28: 57099 3654336 org.hibernate.mapping.Column
29: 225098 3601568 java.lang.Integer
30: 74124 3557952 org.apache.commons.collections.ReferenceMap$SoftRef
31: 57097 3197432 org.hibernate.mapping.Property
32: 57099 2740752 org.hibernate.mapping.SimpleValue
33: 107268 2574432 java.beans.MethodRef
34: 53218 2554464 java.util.Hashtable
35: 45524 2549344 java.beans.MethodDescriptor
36: 28447 2503336 org.apache.catalina.session.StandardSession
37: 55435 2217400 org.hibernate.annotations.common.reflection.java.JavaXProperty
38: 53884 2155360 org.hibernate.tuple.StandardProperty
39: 28081 2021832 java.lang.reflect.Field
40: 61638 1972416 org.apache.commons.collections.SequencedHashMap$Entry
41: 29913 1914432 java.util.concurrent.ConcurrentHashMap
42: 55406 1772992 org.hibernate.annotations.common.reflection.java.JavaXSimpleType
43: 24472 1761984 sun.reflect.DelegatingClassLoader
44: 72766 1746384 sun.reflect.annotation.AnnotationInvocationHandler
45: 103277 1652432 java.util.HashSet
46: 49103 1571296 java.util.Vector
47: 22653 1449792 java.net.URL
48: 43082 1378624 java.util.Hashtable$Entry
49: 57097 1370328 org.hibernate.property.BasicPropertyAccessor$BasicGetter
50: 57097 1370328 org.hibernate.property.BasicPropertyAccessor$BasicSetter
这么多实例中没有看到我们自己写的类,当然也有可能是框架故障导致,不过此类问题一般较少
第四步:排查线程信息
jstack pid
线程的信息也比较多,一般可以输出到文本中查看
jstack 2723 >a.log
下载文件之后我们看到,java.lang.Thread.State: WAITING (parking)
线程处于WAITING状态
"http-nio-8190-exec-69" #126 daemon prio=5 os_prio=0 tid=0x00007fe9800de800 nid=0xbc3 waiting on condition [0x00007fe8e9402000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006fc703a10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:104)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:32)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745)
"http-nio-8190-exec-68" #125 daemon prio=5 os_prio=0 tid=0x00007fe99502e800 nid=0xbc2 waiting on condition [0x00007fe8e9483000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006fc703a10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:104)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:32)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745)
明确告知我们锁对象a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject
明确告知了ThreadPoolExecutor
中出现了锁,如果能看到我们自己写的代码,则可以直接断定是哪里出了问题,如果不是,需要我们排查系统中所有使用到线程池的地方
统计一下进程中有多少线程处于等待状态
grep WAITING |wc -l
104
发现有104个线程处于等待,一共只有151个线程
150个是tomcat线程 在server.xml中配置
1个是守护线程
稍后我会丰富一下WAITING 等内容的含义
第五步:排查代码
经过代码排查发现,我们系统中有一段代码
ExecutorService savePool = Executors.newSingleThreadExecutor();
这段代码中的savePool是一个成员变量,且受Spring管理
在使用这段代码时候
savePool.execute(new Runnable() {
@Override
public void run() {
try {
IgnoreLoginController.ignoreLogin();
// 2016-01-28 by cary 来自于ECM的销售出库单审核成功后,要将状态反馈给上游ECM
saveOtherInfo(orders, operate);
} catch (Exception e) {
log.error(e);
} finally {
IgnoreLoginController.cancelIgnore();
orders.clear();
}
}
});
很明显这段代码没有执行线程池的shutdown()
方法
shutdown() 的含义是,停止接收新任务,但是线程池中已有的任务人可以继续执行
shutdownNow() 的含义是,停止接收新任务,现有线程池中的任务也停止执行
因此只需要在这段代码最后加上 savePool.shutdown();
即可
注意:jmap jstack命令只能在root用户,或者系统项目的用户自己可以执行,非本用户启动的项目可能出现以下错误
[yianju@prd-server ~]$jmap -heap 20846
Attaching to process ID 20846, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: cannot open binary file
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: cannot open binary file
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jmap.JMap.runTool(JMap.java:201)
at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: cannot open binary file
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)