1.依赖准备
添加spark-streaming整合kafka的依赖,将依赖添加到pom.xml中,如下:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.3</version>
</dependency>
注意:Spark 的版本要与你集群版本保持一致。
2.Spark Streaming程序
以统计kafka中的wordcount为例:
package org.apache.spark.examples.streaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @Description:
* @Date: 2019/11/15
*/
object KafkaWordCount {
val func= (iter: Iterator[(String, Seq[Int], Option[Int])]) =>{
iter.flatMap{case(x, y, z) => Some(y.sum + z.getOrElse(0)).map(count => (x,count))}
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Kafka Streaming")
.setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("E:\\software\\checkpoint")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "kafka_spark_streaming",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("testKafka")
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
val result = kafkaStream.flatMap(_.value().split(" "))
.map((_, 1))
.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
3.关键点说明
3.1 LocationStrategies
- LocationStrategies.PreferConsistent:在可用的executor上均匀分布分区;
- LocationStrategies.PreferBroker:如果你的executor与Kafka的Broker节点在同一台物理机上,使用PreferBrokers,这更倾向于在该节点上安排KafkaLeader对应的分区;
- LocationStrategies.PreferFixed:如果发生分区之间数据负载倾斜,使用PreferFixed。这允许你指定分区和主机之间的显示映射(任何未指定的分区将使用一致的位置);
3.2 ConsumerStrategies
- ConsumerStrategies.Subscribe:可以指定订阅的固定主题集合,可以指定多个主题,但是主题中的数据格式应保持一致;
- ConsumerStrategies.SubscribePattern:使用正则表达式匹配订阅的主题;
- ConsumerStrategies.Assign:可指定固定的分区集合
如果上述的策略无法满足需求,那么可以使用ConsumerStrategy这个公共类自行拓展,自定义消费策略。
3.3 整合Kafka时的偏移量管理
在写spark streaming整合kafka时,一般需要将enable.auto.commit”设置为false,即禁用自动提交偏移量。
偏移量管理有如下几种方式
- 利用chechpoint存储偏移量:如果启用spark的checkpoint,则偏移量将存储在检查点中。此中方法有缺陷,当应用程序代码更改后,可能会丢失偏移量数据。
- 保存在kafka中:利用kafka的commitAsync API来手动提交偏移量,与checkpoint相比,它的好处是,无论您对应用程序代码进行如何更改和升级,kaka对偏移量都是持久存储的(存储在单独的topic中),Kafka不是事务性的,输出必须仍然保证是幂等的。
.
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
// commitAsync最好在计算完成后调用
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
- 自己存储管理偏移量:将偏移量和计算结果进行事务性存储,如存储在数据库中,当失败时可以进行回滚。
// The details depend on your data store, but the general idea looks like this
// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val results = yourCalculation(rdd)
// begin your transaction
// update results
// update offsets where the end of existing offsets matches the beginning of this batch of offsets
// assert that offsets were updated correctly
// end your transaction
}