问题描述:structured streaming读腾讯云的ckafka,当消费流量超过消费峰值带宽时,structured streaming任务会挂掉,报错日志如下:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 126.0 failed 4 times, most recent failure: Lost task 6.3 in stage 126.0 (TID 1354, 10.123.42.47, executor 3):
org.apache.kafka.common.errors.RecordTooLargeException: There are some messages at [Partition=Offset]: {test_A12-2=149689780}
whose size is larger than the fetch size 1024000 and hence cannot be ever returned.
Increase the fetch size on the client (using max.partition.fetch.bytes),
or decrease the maximum message size the broker will allow (using message.max.bytes).
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
at org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
问题分析:
step1:起初通过异常信息,任务是有大的消息,超出了max.partition.fetch.bytes的大小,网上搜这个异常也全都是这种说法,要调小message.max.bytes或者调大max.partition.fetch.bytes。但是调了相应参数后,依然报错,尤其是一跑批次的任务,每个批次拉数据的时候,瞬时流量特别高,任务肯定挂掉,根本跑不起来。
step2:通过step1行不通后,在腾讯云提工单提该问题,然后腾讯云的人很快就拉了个群处理问题,但是他们起初就是说限流会缩小每次fetch.request.max.bytes的大小,让我们提高带宽解决问题,但是缩小每次fetch.request.max.bytes的大小并不会导致任务失败,后来我看spark和kafak的源代码,实际情况就是consumer clinet在读fetch回来的数据的时候,遇到了EOFException,然后kafka本身的机制是任务读对应buffer遇到EOFException会认为就是数据都读完了,然后后边有判断buffer.limit>0,如果大于0,就认为是遇到了大的消息,抛出RecordTooLargeException(如果也想看看代码,截图见最下方)。最后确定是腾讯云给 消息内容截取了,EOFException的是因为流中没有结束标识字符。
step3:确定是消息内容被截取,kafka client读数据失败,我采取的办法是在spark kafak source实现中加入捕获异常,重试fetch数据。spark的实现就不细说了,我修改的内容如下:
首先要改的文件是org.apache.spark.sql.kafka010.KafkaDataConsumer,然后在项目中创建个org.apache.spark.sql.kafka010的package,然后把KafkaDataConsumer放进来,修改
* get(
* offset: Long,
* untilOffset: Long,
* pollTimeoutMs: Long,
* failOnDataLoss: Boolean):
* ConsumerRecord[Array[Byte], Array[Byte]]函数
附件:
step2截图: