问题现象
近期,一个新上线的项目线上出现了“启动后cpu就飙升到100%”的问题。内存状况基本维持正常且gc日志正常。
首先怀疑的情况是某个代码死循环了,如递归,while循环等,导致cpu空转。
之前其他项目部署采用的都是weblogic+jdk1.8进行部署的。
出现问题的环境:docker + tomcat + jdk1.8 + wlfullclient(为了支持原有的ejb接口调用)
针对此种场景,我们首先将问题定位(以下步骤不是生产真实结果,只是说明步骤):
排查步骤
问题发现
通过排查,有很多线程都卡在了一行代码处,如下:
"http-nio-8080-exec-81" #4474 daemon prio=5 os_prio=0 tid=0x00007f2a44284800 nid=0x119c runnable [0x00007f28fdcfa000]
java.lang.Thread.State: RUNNABLE
at java.util.WeakHashMap.get(WeakHashMap.java:403)
at weblogic.rmi.internal.ClientRuntimeDescriptor.getInterfaces(ClientRuntimeDescriptor.java:132)
at weblogic.rmi.internal.StubInfo.getInterfaces(StubInfo.java:78)
at weblogic.rmi.internal.StubGenerator.<init>(StubGenerator.java:87)
at weblogic.rmi.internal.StubGenerator.hotCodeGenClass(StubGenerator.java:775)
at weblogic.rmi.internal.StubGenerator.getStubClass(StubGenerator.java:759)
at weblogic.rmi.internal.StubGenerator.generateStub(StubGenerator.java:786)
at weblogic.rmi.internal.StubGenerator.generateStub(StubGenerator.java:779)
at weblogic.rmi.extensions.StubFactory.getStub(StubFactory.java:74)
at weblogic.rmi.internal.StubInfo.resolveObject(StubInfo.java:213)
at weblogic.rmi.internal.StubInfo.readResolve(StubInfo.java:207)
at sun.reflect.GeneratedMethodAccessor163.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2078)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at weblogic.ejb.container.internal.RemoteBusinessIntfProxy.readObject(RemoteBusinessIntfProxy.java:215)
at sun.reflect.GeneratedMethodAccessor160.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
at weblogic.rmi.internal.ObjectIO.readObject(ObjectIO.java:62)
at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.java:240)
at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:348)
at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:259)
at weblogic.jndi.internal.ServerNamingNode_1030_WLStub.lookup(Unknown Source)
at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:392)
at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:380)
at javax.naming.InitialContext.lookup(InitialContext.java:417)
at org.springframework.jndi.JndiTemplate$1.doInContext(JndiTemplate.java:155)
at org.springframework.jndi.JndiTemplate.execute(JndiTemplate.java:87)
at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:152)
at org.springframework.jndi.JndiTemplate.lookup(JndiTemplate.java:179)
at org.springframework.jndi.JndiLocatorSupport.lookup(JndiLocatorSupport.java:104)
at org.springframework.jndi.JndiObjectLocator.lookup(JndiObjectLocator.java:106)
at org.springframework.ejb.access.AbstractRemoteSlsbInvokerInterceptor.lookup(AbstractRemoteSlsbInvokerInterceptor.java:99)
<!--Ejb调用查找调用方-->
at org.springframework.ejb.access.AbstractSlsbInvokerInterceptor.getHome(AbstractSlsbInvokerInterceptor.java:159)
at org.springframework.ejb.access.AbstractSlsbInvokerInterceptor.create(AbstractSlsbInvokerInterceptor.java:215)
at org.springframework.ejb.access.AbstractRemoteSlsbInvokerInterceptor.newSessionBeanInstance(AbstractRemoteSlsbInvokerInterceptor.java:224)
at org.springframework.ejb.access.SimpleRemoteSlsbInvokerInterceptor.getSessionBeanInstance(SimpleRemoteSlsbInvokerInterceptor.java:141)
at org.springframework.ejb.access.SimpleRemoteSlsbInvokerInterceptor.doInvoke(SimpleRemoteSlsbInvokerInterceptor.java:97)
at org.springframework.ejb.access.AbstractRemoteSlsbInvokerInterceptor.invokeInContext(AbstractRemoteSlsbInvokerInterceptor.java:140)
at org.springframework.ejb.access.AbstractSlsbInvokerInterceptor.invoke(AbstractSlsbInvokerInterceptor.java:189)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213)
at com.sun.proxy.$Proxy59.receiveMessageXML(Unknown Source)
观察该线程栈,我们发现这是一个ejb方法调用的过程,一直在寻找调用方,并且卡在了:
at java.util.WeakHashMap.get(WeakHashMap.java:403)
at weblogic.rmi.internal.ClientRuntimeDescriptor.getInterfaces(ClientRuntimeDescriptor.java:132)
我们先看下WeakHashMap.java:403:
是在map中通过链表进行元素比对获取元素的过程,正常情况下不会一直待在这个地方呀,即使链表过长,也会因为扩容,hash函数优化解决,不会消耗过多的时间。这应该不是链表过长引起的。难道是链表成环了?
我们知道,在jdk1.8之前的hashMap因为头插法,在resize的工程中,当多线程同时访问时,在扩容后会出现链表成环的问题。
链表成环的原因:JDK1.8之前,HashMap#get操作死循环问题剖析
我们来看下ClientRuntimeDescriptor.java:132,看看这里的WeakHashMap是否存在链表成环的条件:
1、头插法:
我们看到在原来tab[i]赋值给了新节点的next节点,在1.8中,WeakHashMap仍然采用的是头插法。
2、线程是否安全
public ClientRuntimeDescriptor(String[] var1, String var2, Map var3, ClientMethodDescriptor var4, String var5, String var6) {
// cache使用的直接是new WeakHashMap();并非是线程安全的
this.cache = new WeakHashMap();
this.interfaceNames = var1;
this.applicationName = var2;
this.descriptorBySignature = var3;
this.stubName = var5;
this.defaultClientMD = var4;
this.codebase = var6;
this.computeHashCode();
}
Class[] getInterfaces() {
ClassLoader var1 = this.findLoader();
// 关注cache的初始化
WeakReference var2 = (WeakReference)this.cache.get(var1);
return var2 != null && var2.get() != null ? (Class[])((Class[])var2.get()) : this.computeInterfaces();
}
private Class[] computeInterfaces() {
Class[] var1 = new Class[this.interfaceNames.length];
// ...省略部分代码
// 此处会存在并发的put操作
this.cache.put(var2, new WeakReference(var1));
return var1;
}
可见,这里对WeakHashMap的使用不是线程安全的。
问题分析
基于前面说的,WeakHashMap在1.8采用的仍然是头插法,且存在并发访问问题,在并发高的时候,就会成环了。而本次事故刚好发生在接口调用位置,并发高,所以出现了链表成环的问题。
jdk1.8之前HashMap会产生死循环的原因:
可是我们查看的类是在wlfullclient.jar中的,是weblogic自己生成的,并不是自己开发,为啥会存在问题呢?
原因是我们用的这个jar,版本太老了,是10.3.5。在10.3.6版本,这个问题就修复了:
public ClientRuntimeDescriptor(String[] var1, String var2, Map var3, ClientMethodDescriptor var4, String var5, String var6) {
// Collections.synchronizedMap
this.cache = Collections.synchronizedMap(new WeakHashMap());
this.interfaceNames = var1;
this.applicationName = var2;
this.descriptorBySignature = var3;
this.stubName = var5;
this.defaultClientMD = var4;
this.codebase = var6;
this.computeHashCode();
}
对WeakHashMap采用了Collections.synchronizedMap()修饰,虽然性能没有很高,但至少线程安全了。