使用KafkaUtil来实现SparkStreaming的对接。 KafkaUtil共有两个版本:
spark-streaming-kafka-0-8 | spark-streaming-kafka-0-10 | |
---|---|---|
kafka version | 0.8.2.1 or higher | 0.10.0 or higher |
Offset Commit API | × | √ |
其中0.8版本已经被遗弃, 不建议使用。
消费kafka共有三种消费语义:
1.At most once: 至多消费一次
2.At least once: 至少一次
3.Exactly once:精确消费一次
其中, at least once级别推荐使用官方API对kafka offset进行维护, 代码如下:
object SSApp02 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("hehe")
val ssc = new StreamingContext(sparkConf, Seconds(3))
ssc.sparkContext.setLogLevel("ERROR")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "c1:9092,c2:9092,c3:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("testtopic")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => ((record.key), (record.value))).print()
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"partition=${o.partition}, fromOffset=${o.fromOffset}, untilOffset=${o.untilOffset}")
}
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
ssc.start()
ssc.awaitTermination()
}
}
Exactly once级别的实现方法见我的另一篇博客:
kafka + spark streaming 确保不丢失不重复消费的offset管理方法