错误如下:
2021-02-01 17:11:13 ERROR TaskSetManager:73 - Task 0 in stage 4.0 failed 1 times; aborting job
2021-02-01 17:11:13 INFO TaskSchedulerImpl:57 - Removed TaskSet 4.0, whose tasks have all completed, from pool
2021-02-01 17:11:13 INFO TaskSchedulerImpl:57 - Cancelling stage 4
2021-02-01 17:11:13 INFO TaskSchedulerImpl:57 - Killing all running tasks in stage 4: Stage cancelled
2021-02-01 17:11:13 INFO DAGScheduler:57 - ResultStage 4 (foreachPartition at KuduContext.scala:350) failed in 141.587 s due to Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}
at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:362)
at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:350)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
2021-02-01 17:11:13 INFO DAGScheduler:57 - Job 3 failed: foreachPartition at KuduContext.scala:350, took 141.590192 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}
at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:362)
at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:350)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:929)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2081)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2102)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2121)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
at org.apache.kudu.spark.kudu.KuduContext.writeRows(KuduContext.scala:350)
at org.apache.kudu.spark.kudu.KuduContext.upsertRows(KuduContext.scala:291)
at com.sparkel.rw.writer.kudu.KuduWriter.writerDatas(KuduWriter.java:100)
at SparkEL.syncData(SparkEL.java:219)
at SparkEL.main(SparkEL.java:255)
Caused by: java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}Timed out: cannot complete before timeout: Batch{operations=256, tablet="4a02e65bac264694b14faeee409987d1" [0x00000002, 0x00000003), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=4a02e65bac264694b14faeee409987d1, attempt=23, TimeoutTracker(timeout=30000, elapsed=28936), Trace Summary(28936 ms): Sent(23), Received(23), Delayed(23), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (49d6bbea2c404778a65f3738a72c6ae9, [ Write, 23 ])
Received: (49d6bbea2c404778a65f3738a72c6ae9, [ SERVICE_UNAVAILABLE, 23 ])
Delayed: (UNKNOWN, [ Write, 23 ]))}
at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:362)
at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:350)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
以上错误的信息中有几个关键信息“tablet”、"SERVICE_UNAVAILABLE" 、一串“49d6bbea2c404778a65f3738a72c6ae9”
然后想到了这个一串uuid是不是这个表的某个tablet id?找到这个表的kudu描述
如上可以看出他是这个kudu表的某个副本备份层节点出了问题。等待恢复就行,或者重启Kudu节点
之后重跑任务发现数据写入成功
2021-02-03 09:19:05 INFO KuduContext:458 - applied 60036 upserts to table 'db.data_20210201' in 5332ms
2021-02-03 09:19:05 INFO Executor:57 - Finished task 0.0 in stage 4.0 (TID 4). 37312 bytes result sent to driver
2021-02-03 09:19:05 INFO TaskSetManager:57 - Finished task 0.0 in stage 4.0 (TID 4) in 7801 ms on localhost (executor driver) (1/1)
2021-02-03 09:19:05 INFO TaskSchedulerImpl:57 - Removed TaskSet 4.0, whose tasks have all completed, from pool
2021-02-03 09:19:05 INFO DAGScheduler:57 - ResultStage 4 (foreachPartition at KuduContext.scala:350) finished in 7.807 s
2021-02-03 09:19:05 INFO DAGScheduler:57 - Job 3 finished: foreachPartition at KuduContext.scala:350, took 7.808699 s
2021-02-03 09:19:05 INFO KuduContext:371 - completed upsert ops: duration histogram: 5332ms