com.pingcap.tikv.exception.TiClientInternalException,Error reading region

在执行临时Job FillAppInfo时,遇到Tidb加载数据时频繁报com.pingcap.tikv.exception.TiClientInternalException错误,问题源于TiDB CPU和硬盘压力过大,尝试通过调整SQL语句和提交参数缓解,但最终解决办法是发现Tikv节点IO瓶颈,通过更换物理机并迁移解决。

执行临时Job FillAppInfo,对Tidb加载数据时,Job频繁报错:

20/02/25 05:41:59 INFO TaskSetManager: Starting task 9.1 in stage 3.0 (TID 3771, n45-14.fn.ams.osa, partition 9, PROCESS_LOCAL, 5706 bytes)
20/02/25 05:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 3771 on executor id: 4 hostname: n45-14.fn.ams.osa.
20/02/25 05:41:59 WARN TaskSetManager: Lost task 6.0 in stage 3.0 (TID 3765, n45-14.fn.ams.osa): java.sql.BatchUpdateException: (conn=3979867) unexpected end of stream, read 0 bytes from 4 (socket was closed by server)
        at org.mariadb.jdbc.MariaDbStatement.executeBatchExceptionEpilogue(MariaDbStatement.java:288)
        at org.mariadb.jdbc.ClientSidePreparedStatement.executeBatch(ClientSidePreparedStatement.java:301)
        at com.opera.adx.infra.JDBCTool$$anonfun$exportWith$1.apply(ExportTool.scala:50)
        at com.opera.adx.infra.JDBCTool$$anonfun$exportWith$1.apply(ExportTool.scala:41)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLNonTransientConnectionException: (conn=3979867) unexpected end of stream, read 0 bytes from 4 (socket was closed by server)
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:234)
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:165)
        at org.mariadb.jdbc.MariaDbStatement.executeBatchExceptionEpilogue(MariaDbStatement.java:285)
        ... 13 more
Caused by: java.sql.SQLNonTransientConnectionException: unexpected end of stream, read 0 bytes from 4 (socket was closed by server)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.handleIoException(AbstractQueryProtocol.java:1894)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.readPacket(AbstractQueryProtocol.java:1437)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.getResult(AbstractQueryProtocol.java:1415)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeBatchRewrite(AbstractQueryProtocol.java:908)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeBatchClient(AbstractQueryProtocol.java:364)
        at org.mariadb.jdbc.ClientSidePreparedStatement.executeInternalBatch(ClientSidePreparedStatement.java:360)
        at org.mariadb.jdbc.ClientSidePreparedStatement.executeBatch(ClientSidePreparedStatement.java:296)
        ... 12 more
Caused by: java.io.EOFException: unexpected end of stream, read 0 bytes from 4 (socket was closed by server)
        at org.mariadb.jdbc.internal.io.input.StandardPacketInputStream.getPacketArray(StandardPacketInputStream.java:246)
        at org.mariadb.jdbc.internal.io.input.StandardPacketInputStream.getPacket(StandardPacketInputStream.java:215)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.readPacket(AbstractQueryProtocol.java:1435)
        ... 17 more

20/02/25 05:42:15 INFO TaskSetManager: Starting task 6.1 in stage 3.0 (TID 3772, n05-17.fn.ams.osa, partition 6, PROCESS_LOCAL, 5706 bytes)
20/02/25 05:42:15 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 3772 on executor id: 10 hostname: n05-17.fn.ams.osa.
20/02/25 05:42:15 WARN TaskSetManager: Lost task 8.0 in stage 3.0 (TID 3767, n05-17.fn.ams.osa): java.sql.BatchUpdateException: (conn=3979864) TiKV server is busy[try again later]
        at org.mariadb.jdbc.MariaDbStatement.executeBatchExceptionEpilogue(MariaDbStatement.java:288)
        at org.mariadb.jdbc.ClientSidePreparedStatement.executeBatch(ClientSidePreparedStatement.java:301)
        at com.opera.adx.infra.JDBCTool$$anonfun$exportWith$1.apply(ExportTool.scala:50)
        at com.opera.adx.infra.JDBCTool$$anonfun$exportWith$1.apply(ExportTool.scala:41)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: (conn=3979864) TiKV server is busy[try again later]
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:255)
        at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:165)
        at org.mariadb.jdbc.MariaDbStatement.executeBatchExceptionEpilogue(MariaDbStatement.java:285)
        ... 13 more
Caused by: java.sql.SQLException: TiKV server is busy[try again later]
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.readErrorPacket(AbstractQueryProtocol.java:1594)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.readPacket(AbstractQueryProtocol.java:1453)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.getResult(AbstractQueryProtocol.java:1415)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeBatchRewrite(AbstractQueryProtocol.java:908)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeBatchClient(AbstractQueryProtocol.java:364)
        at org.mariadb.jdbc.ClientSidePreparedStatement.executeInternalBatch(ClientSidePreparedStatement.java:360)
        at org.mariadb.jdbc.ClientSidePreparedStatement.executeBatch(ClientSidePreparedStatement.java:296)
        ... 12 more
20/02/24 09:14:22 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,0,ShuffleMapTask,ExceptionFailure(com.pingcap.tikv.exception.TiClientInternalException,Error reading region:,[Ljava.lang.StackTraceElement;@2bf6e421,com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
        at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:153)
        at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:92)
        at org.apache.spark.sql.tispark.TiRDD$$anon$2.hasNext(TiRDD.scala:89)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:148)
        ... 13 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
        at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:186)
        at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:66)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
Caused by: com.pingcap.tikv.exception.GrpcException: request outdated.
        at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:258)
        at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:236)
        at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:177)

程序在加载过程中即报错,无法进入插入阶段。
网上和官网无相关有效解决方案。

解决方案:
1.怀疑是TiDB CPU和硬盘压力过大,TiContext一次性读取数据过大引起,更改代码中的sql语句,将查询压力下推至Spark,检查有效,但未解决问题,仍然报错。
2.更改submit参数
修改executors-num以及executor-cores为1 parellism为1,TiDB一次性查询压力,有效解决Bug不再出现,但程序执行16小时仍未结束,效率过低.

最终报给运维发现Tikv2个节点IO达到瓶颈,对2个节点更换物理机,进行迁移后解决。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值