最近线上的项目查询经常出现500,查询日志发现出错的原因是数据库链接失效的问题Communications link failure。
项目使用的是阿里巴巴druid连接池+proxysql代理+mysql(mgr集群)。
版本:
druid:1.1.16
proxysql:2.2.0
mysql:8.0.20
1、检查druid链接池的配置,该配置的都配置了。
<bean id="masterDataSource" class="com.alibaba.druid.pool.DruidDataSource" init-method="init" destroy-method="close">
<property name="driverClassName" value="${master.jdbc.driverClassName}" />
<property name="url" value="${master.jdbc.url}" />
<property name="username" value="${master.jdbc.username}" />
<property name="password" value="${master.jdbc.password}" />
<property name="initialSize" value="10" />
<property name="minIdle" value="50" />
<property name="maxActive" value="100" />
<property name="maxWait" value="60000" />
<property name="timeBetweenEvictionRunsMillis" value="60000" />
<property name="minEvictableIdleTimeMillis" value="300000" />
<property name="validationQuery" value="select 1" />
<property name="testWhileIdle" value="true" />
<property name="testOnBorrow" value="false" />
<property name="testOnReturn" value="false" />
<property name="filters" value="stat" />
</bean>
默认使用ping的方式来检查链接是否失效,源码:
MySqlValidConnectionChecker类的isValidConnection方法。
可以增加jvm参数控制usePingMedthod的值,参数:-Ddruid.mysql.usePingMethod=false
public boolean isValidConnection(Connection conn, String validateQuery, int validationQueryTimeout) throws Exception {
if (conn.isClosed()) {
return false;
}
if (usePingMethod) {
if (conn instanceof DruidPooledConnection) {
conn = ((DruidPooledConnection) conn).getConnection();
}
if (conn instanceof ConnectionProxy) {
conn = ((ConnectionProxy) conn).getRawObject();
}
if (clazz.isAssignableFrom(conn.getClass())) {
if (validationQueryTimeout < 0) {
validationQueryTimeout = DEFAULT_VALIDATION_QUERY_TIMEOUT;
}
try {
ping.invoke(conn, true, validationQueryTimeout * 1000);
} catch (InvocationTargetException e) {
Throwable cause = e.getCause();
if (cause instanceof SQLException) {
throw (SQLException) cause;
}
throw e;
}
return true;
}
}
String query = validateQuery;
if (validateQuery == null || validateQuery.isEmpty()) {
query = DEFAULT_VALIDATION_QUERY;
}
Statement stmt = null;
ResultSet rs = null;
try {
stmt = conn.createStatement();
if (validationQueryTimeout > 0) {
stmt.setQueryTimeout(validationQueryTimeout);
}
rs = stmt.executeQuery(query);
return true;
} finally {
JdbcUtils.close(rs);
JdbcUtils.close(stmt);
}
}
2、查询mysql链接有效期,本项目修改为8小时
show variables like '%timeout%';
set global WAIT_TIMEOUT=28800;
set global INTERACTIVE_TIMEOUT=28800;
set WAIT_TIMEOUT=28800;
set INTERACTIVE_TIMEOUT=28800;
set session interactive_timeout=28800;
set session wait_timeout=28800;
3、检查proxysql的参数配置
show variables like '%timeout%';
set mysql-wait_timeout=28000000;
检查proxysql的mysq server情况
select * from mysql_servers;
select * from runtime_mysql_servers ;
发现有状态为OFFLINE_HARD的server:
查询proxysql日志,为什么会OFFLINE_HARD
grep -i OFFLINE /var/lib/proxysql/proxysql.log | grep because | cut -b 124- | sort | uniq -c
4 setting host 192.168.1.38:3305 offline because: slave is lagging
6 setting host 192.168.1.41:3305 offline because: slave is lagging
原因是由于slave is lagging,从节点复制延迟导致proxysql更改状态,导致链接关闭。
配置参数:
具体含义:https://proxysql.com/Documentation/global-variables/mysql-monitor-variables/
set mysql-monitor_replication_lag_interval=300000;
set mysql-monitor_replication_lag_timeout=5000;
set mysql-monitor_slave_lag_when_null=300;
set mysql-monitor_groupreplication_max_transactions_behind_count=5;
如果组复制max_transactions_behind超过阈值一次,则该节点将从集群中驱逐.
查询当前的max_transactions_behind,从runtime_mysql_group_replication_hostgroups表中
select * from runtime_mysql_group_replication_hostgroups;
修改值,并加载到内存中,使之生效:
update mysql_group_replication_hostgroups set max_transactions_behind=2000;
LOAD MYSQL VARIABLES TO RUNTIME;
LOAD MYSQL SERVERS TO RUNTIME;
备注
此修改只是放大了延迟事务及时间,治标不治本,本质问题还是mysql集群主从复制延迟的问题。