环境:cdh6.3.2环境下,spark on yarn,client/cluster模式运行
报错:sparkStreaming 消费kafka数据到kudu中报错如下:
每批数据大概4000条
21/03/12 10:13:24 ERROR executor.Executor: Exception in task 1.1 in stage 0.0 (TID 4)
org.apache.kudu.client.NonRecoverableException: MANUAL_FLUSH is enabled but the buffer is too big
at org.apache.kudu.client.KuduException.transformException(KuduException.java:110)
at org.apache.kudu.client.KuduSession.apply(KuduSession.java:93)
at com.zfdz.ProcessRDD$$anonfun$UpsertToSourceLayer$1.apply(Kafka2LogicServer.scala:198)
at com.zfdz.ProcessRDD$$anonfun$UpsertToSourceLayer$1.apply(Kafka2LogicServer.scala:190)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at com.zfdz.ProcessRDD.UpsertToSourceLayer(Kafka2LogicServer.scala:190)
at com.zfdz.ProcessRDD$$anonfun$upsertKudu$1$$anonfun$apply$1$$anonfun$apply$2.apply(Kafka2LogicServer.scala:127)
at com.zfdz.ProcessRDD$$anonfun$upsertKudu$1$$anonfun$apply$1$$anonfun$apply$2.apply(Kafka2LogicServer.scala:119)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.streaming.kafka010.KafkaRDDIterator.foreach(KafkaRDD.scala:229)
at com.zfdz.ProcessRDD$$anonfun$upsertKudu$1$$anonfun$apply$1.apply(Kafka2LogicServer.scala:119)
at com.zfdz.ProcessRDD$$anonfun$upsertKudu$1$$anonfun$apply$1.apply(Kafka2LogicServer.scala:102)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: org.apache.kudu.client.KuduException$OriginalException: Original asynchronous stack trace
at org.apache.kudu.client.AsyncKuduSession.apply(AsyncKuduSession.java:596)
at org.apache.kudu.client.KuduSession.apply(KuduSession.java:79)
... 25 more
解决方法1:
报错原因很容易理解,就是kudu采用手动flush的模式。但是缓冲区内数据超过了上线4000还没flush。
代码中做如下修改
//创建kudusession时候指定buffer大小
val kuduSession = kuduClient.newSession()
kuduSession.setFlushMode(SessionConfiguration.FlushMode.MANUAL_FLUSH)
kuduSession.setMutationBufferSpace(4000)
// 对于手工提交, 需要buffer在未满的时候flush,这里采用了buffer一半时即提交
uncommit=uncommit + 1;
if (uncommit > 4000/2)) {
kuduSession.flush();
uncommit = 0;
}
创建kuduclient时候指定setMutationBufferSpace 扩大缓冲区大小,可以解决大部分问题。
但是对于我的业务逻辑不适用,因为我spark 采用直连方式消费kafka内数据消费策略采用subscribe,也就是一个kafka topic对应一个消费者
解决方法2:
setMutationBufferSpace指定大小的前提下, 提交方式采用buffer大小除以 kafkatopic 分区数*2的方式
问题解决,因为我kuduclient是在 foreachRDD之前创建的,也就是一个client会对应多个分区内的缓存数据。
举个例子 buffer大小设定4000,topic有8个分区。 也就是缓存中可能最多存在32000个数据。只除以2,那么缓存内数据就有可能达到16000条数据,远超4000缓存就会报错;以下方式缓存中最多2000个数据所以不会报错。
//创建kudusession时候指定buffer大小
val kuduSession = kuduClient.newSession()
kuduSession.setFlushMode(SessionConfiguration.FlushMode.MANUAL_FLUSH)
kuduSession.setMutationBufferSpace(4000)
// 对于手工提交, 需要buffer在未满的时候flush,这里采用了buffer一半时即提交
//由于kuduclient是spark foreachPartition之前创建的,所以除以系数应该是 kafka分区数*2
if ( session.getFlushMode == SessionConfiguration.FlushMode.MANUAL_FLUSH ) {
uncommit=uncommit + 1;
if (uncommit > 4000/(conf.getKfkpartition*2)) {
session.flush();
uncommit = 0;
}
}
f (uncommit > 4000/(conf.getKfkpartition*2)) {
session.flush();
uncommit = 0;
}
}