TIDB事务过大触发Transaction is too large异常


前言

项目今年发的一个版本,在最近发现一个统计图表中最忌一小时的数据有缺失,且每天都有。反馈后进行了修改,当时经过排查后终于找到问题,本地和测试环境验证没有问题后就将补丁发出去了,个人以为万事大吉,问题终于解决。没想到,因为这个补丁触发了之前数据堆积的问题,最近客户又反馈数据直接没了,刚听到问题直到远程到客户环境上都还有点懵逼,好好的数据采集和统计任务怎么就不灵了呢。
看到现场,首先需要确定定位问题的步骤,根据这个业务的数据流向,图表数据是查询的统计表,当前的统计表是根据采集基础表规约出来的,要么是数据采集线程出问题了,要么是数据统计线程出问题了。根据日志和arthas很容易判断采集线程是正常运行的,那么问题就出在统计线程上了,跟踪代码后便发现了上一篇文章提到的问题:ScheduledExecutorService定时任务停止问题

可以确认是由于定时任务的某次执行过程中出现了异常,导致该定时任务后续的执行被抑制了,所以页面上看不到统计数据。实际上这个任务每次执行只做两件事,一是删除采集基础表的过期数据防止数据堆积,二是根据最近的数据统计出上一层时间维度的数据供图表展示。问题肯定就出在这俩兄弟上了,但是由于之前没有捕获异常,所以无法确认具体原因。虽然预感是删除时候出问题,但是没有证据。所以本想增加异常捕获后就转测补丁,好在下班时候验证补丁的环境复现了该问题,且打出了日志:

2023-01-05 18:31:37:902 [pool-12-thread-1] ERROR [xxx-monitor,xxx-monitor-59b7547c9f-7lcrr,,,,13] com.xxx.xxx.monitor.port.service.statistic.InfFlowStatistic.aggregateOneMinuteData:296 - aggregate minute statistic error:
org.springframework.jdbc.UncategorizedSQLException:
### Error updating database.  Cause: java.sql.SQLException: Transaction is too large, size: 104857600
### The error may exist in com/xxx/xxx/monitor/port/service/statistic/dao/IInterfaceCountInfoDao.java (best guess)
### The error may involve com.xxx.xxx.monitor.port.service.statistic.dao.IInterfaceCountInfoDao.clearFixTimeAndTDBTableData-Inline
### The error occurred while setting parameters
### SQL: delete from xxx_interface_count_info where time_millisecond between 1641376915106 and 1672912915105
### Cause: java.sql.SQLException: Transaction is too large, size: 104857600
; uncategorized SQLException; SQL state [HY000]; error code [8004]; Transaction is too large, size: 104857600; nested exception is java.sql.SQLException: Transaction is too large, size: 104857600
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:89)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81)
        at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:88)
        at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:440)
        at com.sun.proxy.$Proxy135.delete(Unknown Source)
        at org.mybatis.spring.SqlSessionTemplate.delete(SqlSessionTemplate.java:303)
        at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:72)
        at org.apache.ibatis.binding.MapperProxy$PlaixxxethodInvoker.invoke(MapperProxy.java:152)
        at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:85)
        at com.sun.proxy.$Proxy146.clearFixTimeAndTDBTableData(Unknown Source)
        at com.xxx.xxx.monitor.port.service.statistic.InfFlowStatistic.clearFixTimeAndTDBTableData(InfFlowStatistic.java:308)
        at com.xxx.xxx.monitor.port.service.statistic.InfFlowStatistic.aggregateOneMinuteData(InfFlowStatistic.java:269)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:855)
Caused by: java.sql.SQLException: Transaction is too large, size: 104857600
        at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
        at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
        at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
        at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953)
        at com.mysql.cj.jdbc.ClientPreparedStatement.execute(ClientPreparedStatement.java:370)
        at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)
        at org.apache.ibatis.executor.statement.PreparedStatementHandler.update(PreparedStatementHandler.java:47)
        at org.apache.ibatis.executor.statement.RoutingStatementHandler.update(RoutingStatementHandler.java:74)
        at org.apache.ibatis.executor.SimpleExecutor.doUpdate(SimpleExecutor.java:50)
        at org.apache.ibatis.executor.BaseExecutor.update(BaseExecutor.java:117)
        at org.apache.ibatis.executor.CachingExecutor.update(CachingExecutor.java:76)
        at sun.reflect.GeneratedMethodAccessor295.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63)
        at com.sun.proxy.$Proxy183.update(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor295.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49)
        at com.xxx.xxx.common.db.AutoIdInterceptor.intercept(AutoIdInterceptor.java:83)
        at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)
        at com.sun.proxy.$Proxy183.update(Unknown Source)
        at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:197)
        at org.apache.ibatis.session.defaults.DefaultSqlSession.delete(DefaultSqlSession.java:212)
        at sun.reflect.GeneratedMethodAccessor298.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:426)
        ... 15 common frames omitted


根本原因在于:java.sql.SQLException: Transaction is too large, size: 104857600,可以判断任务失败原因就是删除数据量过大,导致tidb事务超出配置允许范围。出现该问题的原因是发布版本由于删除问题导致数据堆积,为了可以清理过期的数据,上次补丁时候增大了删除范围,但是事务超大导致任务失败。这一点是自己没有想到的,需要反思。其中还有一点,虽然隐约有怀疑到数据量大导致删除失败的问题上,但是没有实际动手执行删除sql在性能环境中去验证一下,否则不是运气好在补丁验证环境复现,还解决不了根本问题,这也是需要反思的,大胆假设小心求证。

那么接下来的问题,就是怎么去恢复环境和彻底解决问题。


一、出问题环境的数据量

在这里插入图片描述
在这里插入图片描述
可见数据量已经在千万级别了;

二、手动执行对应sql语句复现问题

在这里插入图片描述
这里是删除最近一年的数据。

三、解决方案

根据博客和tidb官网,确定有以下两种常用的解决方案:
参考:
TiDB删除大数据的处理
TiDB Best Practice
从一个简单的Delete删数据场景谈TiDB数据库开发规范的重要性

1.开启批量提交,使用不带事务的非安全执行过程

不过根据网上提供的开启批量删除模式,仍然失败:

set tidb_batch_delete=1;
delete from nm_interface_count_info where time_millisecond between 1643087626208 and 1674623626208;
set tidb_batch_delete=0;

在这里插入图片描述

2.尝试寻找关闭事务的方式

很遗憾通过官网文档发现,项目现在使用的tidb版本不具备非事务DML执行的特性,该特性要6.1版本才可以使用。官网批量删除示例
在这里插入图片描述

使用非事务删除数据语法:
BATCH ON id LIMIT 2 DELETE FROM t where v < 6;

项目tidb版本:

mysql> select tidb_version();
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v4.0.0
Edition: Community
Git Commit Hash: 689a6b6439ae7835947fcaccf329a3fc303986cb
Git Branch: heads/refs/tags/v4.0.0
UTC Build Time: 2020-05-28 01:37:40
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

没有办法,只有根据建议,使用分批删除的方式清理。

3.分批删除

该实现参考tidb的推荐最佳实践文档:TiDB Best Practice,但是如果一次删除太多又会导致删除标记太多,影响后续的删除效率,越来越慢,需要将区间细化来规避该问题;
在这里插入图片描述

// 代码实现,此处没有细化删除区间
private void clearXxxData(String table) {
    long currentTimeMillis = System.currentTimeMillis();
    int count = 1;
    int effectRowsTotal = 0;
    int effectRows = this.xxxDao.clearXxxTableData(table, BATCH_DEL_NUM);
    int effectRowsTotal = effectRowsTotal + effectRows;
    logger.info("第1次执行xxx表清理sql,table:{},effectRows:{}", table, effectRows);

    while(effectRows >= BATCH_DEL_NUM) {
        effectRows = this.xxxDao.clearXxxTableData(table, BATCH_DEL_NUM);
        effectRowsTotal += effectRows;
        ++count;
        logger.warn("第{}次执行xxx表清理sql,table:{},effectRows:{}", count);
    }

    logger.info("执行xxx表清理sql,table:{} effectRowsTotal:{} timeUse:{} ms", table, effectRowsTotal, System.currentTimeMillis() - currentTimeMillis);
}

进行以上修改后,在性能环境中验证,发现确实有删除较慢的问题,但是目前看功能问题解决。遂想通过优化sql来提高效率,最开始使用between确定区间,通过以下explain语句测试,以为增加order by语句可以提高删除效率,但实际并不然。

mysql> explain delete from xxx_table where time_millisecond between 1643087626208 and 1674623626208;
+---------------------------+-----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| id                        | estRows   | task      | access object                 | operator info                                                                                                                  |
+---------------------------+-----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Delete_4                  | N/A       | root      |                               | N/A                                                                                                                            |
| └─TableReader_8           | 428191.00 | root      |                               | data:Selection_7                                                                                                               |
|   └─Selection_7           | 428191.00 | cop[tikv] |                               | ge(nm.xxx_table.time_millisecond, 1643087626208), le(nm.xxx_table.time_millisecond, 1674623626208) |
|     └─TableFullScan_6     | 428191.00 | cop[tikv] | table:xxx_table | keep order:false                                                                                                               |
+---------------------------+-----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
mysql> explain delete from xxx_table where time_millisecond between 1643087626208 and 1674623626208 order by time_millisecond asc limit 10000;
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+------------------------------------------------------+
| id                               | estRows  | task      | access object                                                              | operator info                                        |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+------------------------------------------------------+
| Delete_6                         | N/A      | root      |                                                                            | N/A                                                  |
| └─IndexLookUp_32                 | 10000.00 | root      |                                                                            | limit embedded(offset:0, count:10000)                |
|   ├─Limit_31(Build)              | 10000.00 | cop[tikv] |                                                                            | offset:0, count:10000                                |
|   │ └─IndexRangeScan_29          | 10000.00 | cop[tikv] | table:xxx_table, index:idx_timemillisecond(time_millisecond) | range:[1643087626208,1674623626208], keep order:true |
|   └─TableRowIDScan_30(Probe)     | 10000.00 | cop[tikv] | table:xxx_table                                              | keep order:false, stats:pseudo                       |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+------------------------------------------------------+
5 rows in set (0.00 sec)
mysql> explain delete from xxx_table where time_millisecond < 1674623626208;
+---------------------------+-----------+-----------+-------------------------------+----------------------------------------------------------------+
| id                        | estRows   | task      | access object                 | operator info                                                  |
+---------------------------+-----------+-----------+-------------------------------+----------------------------------------------------------------+
| Delete_4                  | N/A       | root      |                               | N/A                                                            |
| └─TableReader_8           | 441570.00 | root      |                               | data:Selection_7                                               |
|   └─Selection_7           | 441570.00 | cop[tikv] |                               | lt(nm.xxx_table.time_millisecond, 1674623626208) |
|     └─TableFullScan_6     | 441570.00 | cop[tikv] | table:xxx_table | keep order:false                                               |
+---------------------------+-----------+-----------+-------------------------------+----------------------------------------------------------------+
4 rows in set (0.00 sec)
mysql> explain delete from xxx_table where time_millisecond < 1674623626208 order by time_millisecond asc limit 10000;
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+---------------------------------------------+
| id                               | estRows  | task      | access object                                                              | operator info                               |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+---------------------------------------------+
| Delete_6                         | N/A      | root      |                                                                            | N/A                                         |
| └─IndexLookUp_32                 | 10000.00 | root      |                                                                            | limit embedded(offset:0, count:10000)       |
|   ├─Limit_31(Build)              | 10000.00 | cop[tikv] |                                                                            | offset:0, count:10000                       |
|   │ └─IndexRangeScan_29          | 10000.00 | cop[tikv] | table:xxx_table, index:idx_timemillisecond(time_millisecond) | range:[-inf,1674623626208), keep order:true |
|   └─TableRowIDScan_30(Probe)     | 10000.00 | cop[tikv] | table:xxx_table                                              | keep order:false, stats:pseudo              |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+---------------------------------------------+
5 rows in set (0.00 sec)

可以看到扫描的行数比不加order by语句要少很多,但是实际执行并不满意,见trace截图:
使用between确定范围:
在这里插入图片描述
使用<确定范围:
在这里插入图片描述
可见效果都不明显,后续还需继续探索,目前的方案是减少批量删除的数据试一下,目前是10000,改为5000试下。客户环境就需要truncate进行清理,因为这些表目前的数据都是过期无用数据。

如果严格按照最佳实践推荐的方式,需要确定范围,根据业务差异实现方式也不同,比如可以通过查询结合的方式,不过查询也有一定开销,具体怎么平衡还需要根据实际业务评估和验证。
我这里采用的方式为封装数据清理方法:

private int clearData(String table, long endTime) {
    Long startTime = this.interfaceCountInfoDao.getOldestTime(table);
    if (startTime != null && startTime <= endTime) {
        return this.interfaceCountInfoDao.clearXxxTableData(table, startTime, endTime);
    }
    logger.info("开始时间大于结束时间,不需要清理 {}  >  {}", new Object[]{startTime, endTime});
    return 0;
}


总结

  1. 在开发过程中,应该评估好性能瓶颈
  2. 解决问题过程中,想清楚突破口,多动手实践验证
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值