线上案例
(1) mq发送时间过长导致dubbo线程池沾满
16:54:34,026 WARN [New I/O server worker #1-4] - [DUBBO] Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-10.8.64.57:20880, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200), Task: 5823 (completed: 5623), Executor status:(isShutdown:false, isTerminated:false, isTerminating:false), in dubbo://10.8.64.57:20880!, dubbo version: 2.5.3, current host: 10.8.64.57
16:54:34,027 WARN [New I/O server worker #1-4] - [DUBBO] Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-10.8.64.57:20880, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200), Task: 5823 (completed: 5623), Executor status:(isShutdown:false, isTerminated:false, isTerminating:false), in dubbo://10.8.64.57:20880!, dubbo version: 2.5.3, current host: 10.8.64.57
16:54:34,027 WARN [New I/O server worker #1-4] - [DUBBO] An exception was thrown by a user handler while handling an exception event ([id: 0x7434600d, /10.8.64.57:59102 => /10.8.64.57:20880] EXCEPTION: com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process received event .), dubbo version: 2.5.3, current host: 10.8.64.57
com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process caught event .
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:67)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractPeer.caught(AbstractPeer.java:127)
at com.alibaba.dubbo.remoting.transport.netty.NettyHandler.exceptionCaught(NettyHandler.java:112)
报的是dubbo的线程池满了,那到底是什么业务阻塞了?
ps 查看了进程号,然后jstack dump出了线程堆栈,发现
"DubboServerHandler-10.8.64.57:20880-thread-111" daemon prio=10 tid=0x00007fd574193800 nid=0x16cf1 waiting for monitor entry [0x00007fd691887000..0x00007fd691888810]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:40)
- waiting to lock <0x00007fd6c9fa4ba8> (a java.lang.Object)
at org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60)
at org.apache.activemq.ActiveMQConnection.doAsyncSendPacket(ActiveMQConnection.java:1265)
at org.apache.activemq.ActiveMQConnection.asyncSendPacket(ActiveMQConnection.java:1259)
at org.apache.activemq.ActiveMQSession.<init>(ActiveMQSession.java:250)
at org.apache.activemq.ActiveMQConnection.createSession(ActiveMQConnection.java:318)
at sun.reflect.GeneratedMethodAccessor222.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
线程状态:java.lang.Thread.State: BLOCKED
上面分析过了,是synchronized拿不到锁导致阻塞的。
通过
at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:40)
找到代码block的位置, 原因就是 其它的地方没有释放writeMutex监视器锁。
搜索 这个 writeMutex ,发现用到的都是 mq的发送消息方法。
问题原因就知道了,mq的同步发送消息一直阻塞着,导致客户端这边卡死在发送方法上,导致dubbo线程池耗尽。
mq发生了瓶颈,叫运维处理了。
(2) 下面这个案例相当经典
public class Main {
static final AtomicInteger threadNumber = new AtomicInteger(1);
static ThreadFactory factory = new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
return new Thread(r, "DubboServerHandler-10.0.105.140:29997-thread-" + threadNumber.getAndIncrement());
}
};
private static Lock lock = new ReentrantLock();
public static void main(String[] args) throws Exception {
ExecutorService service = createDubboThreadPools();
for (int i = 0; i < 10; i++) {
try {
service.execute(new Runnable() {
@Override
public void run() {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
doService();
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
//第二波提交任务
for (int i = 0; i < 10; i++) {
service.execute(new Runnable() {
@Override
public void run() {
while (true) {
System.err.println(Thread.currentThread().getName() + " 第二波执行==========");
}
}
});
Thread.sleep(100);
}
}
public static void doService() {
try {
lock.lock();
boolean ret = new Random().nextBoolean();
if (ret || !ret) {
// 模拟业务出异常了
System.err.println(Thread.currentThread().getName() + " 开始执行业务异常,没有unlock");
throw new RuntimeException("业务异常了");
}
lock.unlock();
System.err.println("业务执行正常 , unlock");
} catch (Exception e) {
}
}
public static ExecutorService createDubboThreadPools() {
ExecutorService s =
new ThreadPoolExecutor(3, 4, 1, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>(1000), factory);
return s;
}
}
上面代码解释:
1.模拟创建了dubbo线程池,指定了线程名称。
2.先提交了10个任务执行doService方法,注意:doService模拟了业务直接抛出异常,unlock失败。
3.在提交10个任务执行业务。
上述代码执行结果十分恐怖:
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 开始执行业务异常,没有unlock
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
DubboServerHandler-10.0.105.140:29997-thread-2 第二波执行==========
可以看到 整个过程就只有DubboServerHandler-xx-thread-2 一个线程在执行。我们开的线程池线程大小是3个,真正执行的只有一个,说明我们的其它线程都阻塞了,这在线上就会导致dubbo线程池沾满。
我们用jstack分析上面线程堆栈:
发现 我们核心3个线程是已经开启了的,只有线程2处于RUNNABLE 运行状态,其它的都是Waiting也就是阻塞在lock.lock那里了(从Main58行代码看出)。