Druid communications link failture排查过程

问题:

2023-01-16, 21:33:22  2023-01-16 21:33:22.661 ERROR 7 --- [167930-thread-1] c.a.druid.pool.DruidPooledStatement : CommunicationsException, druid version 1.2.5, jdbcUrl : jdbc:mysql://bds.haier.net:9035/di_fw?connectTimeout=300000&socketTimeout=300000, testWhileIdle true, idle millis 19862, minIdle 10, poolingCount 1, timeBetweenEvictionRunsMillis 60000, lastValidIdleMillis 49499, driver com.mysql.jdbc.Driver, exceptionSorter com.alibaba.druid.pool.vendor.MySqlExceptionSorter
2023-01-16, 21:33:22  2023-01-16 21:33:22.661 ERROR 7 --- [167930-thread-1] com.alibaba.druid.pool.DruidDataSource : discard connection
2023-01-16, 21:33:22  
2023-01-16, 21:33:22  com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
2023-01-16, 21:33:22  
2023-01-16, 21:33:22  The last packet successfully received from the server was 49,500 milliseconds ago. The last packet sent successfully to the server was 49,515 milliseconds ago. at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery$original$ICdb496O(ClientPreparedStatement.java:1003) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery$original$ICdb496O$accessor$Y4IPLmuo(ClientPreparedStatement.java) at com.mysql.cj.jdbc.ClientPreparedStatement$auxiliary$DOhLhsKX.call(Unknown Source) at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java) at com.alibaba.druid.pool.DruidPooledPreparedStatement.executeQuery(DruidPooledPreparedStatement.java:227) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
2023-01-16, 21:33:22  
2023-01-16, 21:33:22  The last packet successfully received from the server was 49,500 milliseconds ago. The last packet sent successfully to the server was 49,515 milliseconds ago. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61) at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105) at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151) at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167) at com.mysql.cj.protocol.a.NativeProtocol.clearInputStream(NativeProtocol.java:815) at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:630) at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:948) at com.mysql.cj.NativeSession.execSQL(NativeSession.java:1075) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:930) ... 10 common frames omitted Caused by: java.io.IOException: Socket is closed at com.mysql.cj.protocol.AbstractSocketConnection.getMysqlInput(AbstractSocketConnection.java:72) at com.mysql.cj.protocol.a.NativeProtocol.clearInputStream(NativeProtocol.java:811) ... 14 common frames omitted

com.alibaba.druid.pool.DruidDataSource : discard connection

druid config:

testWhileIdle = true testOnBorrow = false timeBetweenEvictionRunsMillis = 60s minEvictableIdleTimeMillis = 5 min maxEvictableIdleTimeMillis = 50min

感觉 get connection出啥问题了:

public DruidPooledConnection getConnectionDirect(long maxWaitMillis) throws SQLException {
    if (testOnBorrow) {...}
    else {
           if (testWhileIdle) {
                    ...
                    if (idleMillis >= timeBetweenEvictionRunsMillis
                            || idleMillis < 0 // unexcepted branch
                            ) {
                        boolean validate = testConnectionInternal(poolableConnection.holder, poolableConnection.conn);
                        if (!validate) {
                            if (LOG.isDebugEnabled()) {
                                LOG.debug("skip not validate connection.");
                            }

                            discardConnection(poolableConnection.holder);
                             continue;
                        }
                    }
                    ...      
           } 
    }
 }

如果最近active的时间大于的判断周期(也就是60s),或没有active过,就要检查下连接,如果连接有问题,比如已经close了,validate就是false,就要discardConnection,discard的逻辑也很简单,就是close connection,然后更新池的状态;这里报错应该就是close的报错,为什么会报错的呢,猜测服务端已经关闭了连接。但是druid不是有清理线程的吗,为什么没有把已经close的connection的关掉呢?

继续看 清理线程逻辑:

启动清理线程

protected void createAndStartDestroyThread() {
    destroyTask = new DestroyTask();

    if (destroyScheduler != null) {
        long period = timeBetweenEvictionRunsMillis;
       ...
        destroySchedulerFuture = destroyScheduler.scheduleAtFixedRate(destroyTask, period, period,
                                                                      TimeUnit.MILLISECONDS);
       ...
    }

...
}

可以看出来,timeBetweenEvictionRunsMillis参数确定了清理线程的执行频率

public class DestroyTask implements Runnable {
    public DestroyTask() {

    }

    @Override
    public void run() {
        shrink(true, keepAlive);

        if (isRemoveAbandoned()) {
            removeAbandoned();
        }
    }

}

可以看到,逻辑都在shrink里面。

这个方法主要干两件事情:

  1. 将要清理的连接加入evictConnections
  2. 关闭之,update池子状态

那什么情况下的连接会被认为是要清理的呢?

  1. 设置了phyTimeoutMillos,连接保持的时间大于phy timeout 加入
if (phyTimeoutMillis > 0) {
    long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis;
    if (phyConnectTimeMillis > phyTimeoutMillis) {
        evictConnections[evictCount++] = connection;
        continue;
    }
}

2. 空闲时间 > min idel (1. > min idel count 2. 空闲时间>max idel)

if (idleMillis >= minEvictableIdleTimeMillis) {
    if (checkTime && i < checkCount) {
        evictConnections[evictCount++] = connection;
        continue;
    } else if (idleMillis > maxEvictableIdleTimeMillis) {
        evictConnections[evictCount++] = connection;
        continue;
    }
}

3. 该场景 keep alive设置的false,不会触发这个逻辑

if (keepAlive && idleMillis >= keepAliveBetweenTimeMillis) {
    keepAliveConnections[keepAliveCount++] = connection;
}

什么情况下,会认为连接不需要处理呢?

当目前空闲的时间小于最小空闲保持时间(5min),并且小于keepAliveBetweenTimeMillis(该参数默认2分钟)

long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;

if (idleMillis < minEvictableIdleTimeMillis
        && idleMillis < keepAliveBetweenTimeMillis
) {
    break;
}

也就是说,两分钟内如果连接已经被关闭,或者已经失效,但这个连接不会被加入到清理的队列。但是正常数据库的超时时间是五分钟,是不会触发这个问题的,why?

后来从运维doris的同学口中得知,报错的那个时间点,正好重启了doris,那么整个原因明确了:因为重启了server端,所有的连接其实都close了,但是因为上述druid的逻辑,两分钟内连接没有加入清理队列,所有test idle的时候close就报错了。

服务告警从21:31持续到21:34后恢复正常,整个过程持续了3分钟左右,也从侧面证实了推断的正确

有什么办法能避免吗?

  1. testWhileIdle = false,拿连接的时候不检查,但是由于拿的连接有问题,会在执行的时候出问题,不解决根本问题
  2. 减小phyTimeoutMillis,加快连接的关闭速度,但在正确情况下,会拖慢响应速度

其实服务器都已经挂了,池中的连接已经都是无效的,唯一能做的就是加快清理连接的速度,但这样的调整,会拖慢平时的连接获取速度,所以其实啥都做不了.....

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值