一:编程前参考地址
Spark官网: 关于kafka的依赖和kafka的配置
http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html
二:官网案例参考
依赖
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.2.0
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
三:编程案例
需求:
实现实时消费kafka的数据,并对数据做wordcount统计再输出
package com.wonderland.demo01
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object demo05 {
def main(args: Array[String]): Unit = {
//1.创建入口
val sparkContext = new SparkContext(new SparkConf().setAppName("kafka").setMaster("local[6]"))
val sc = new StreamingContext(sparkContext,Seconds(3))
//2.修改日志级别,方便观察数据实时显示
sparkContext.setLogLevel("warn")
//3.消费Kafka数据--看createDirectStream源码,找到泛型和参数,参考spark官网 http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html
//3.1 配置topic , 是数组的形式,和开启的kafka的主题名称要一致
val topic = Array("spark01")
//3.2 配置kafka配置参数--kafkaParams
//该参数直接从 spark的官网复制再修改
val kafkaParams = Map[String, Object](
//参数1:kafka集群主机和端口
"bootstrap.servers" -> "node01:9092,node02:9092,node03:9092",
//参数2:反序列化key
"key.deserializer" -> classOf[StringDeserializer],
//参数3:反序列化value
"value.deserializer" -> classOf[StringDeserializer],
//参数4:消费者组的id,自由设定
"group.id" -> "sparkStreaming_kafka_spark01",
//参数5:更新offset的方式,用于指明从哪里开始消费数据,latest 从最新的offset开始消费, earliest 从小的offset开始消费
"auto.offset.reset" -> "latest",
//参数6:提交方式为手动提交,冒号:后面的类型不可去掉
"enable.auto.commit" -> (false: java.lang.Boolean)
)
//3.3 使用KafkaUtils 连接数据源,并设置各个参数
/**
* 泛型[k,v]代表kafka消息的key和vlue类型 ,
* 参数1:sparkStreaming
* 参数2:LocationStrategies.PreferConsistent 源码的解释中有
* 参数3:ConsumerStrategies.Subscribe[泛型,和上面一致](参数1是消费topic,参数2是kafka参数配置) 源码的解释中有
*/
val source: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
sc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topic, kafkaParams)
)
//4.处理数据--同时更新 offset
//处理每个RDD数据
source.foreachRDD(
//获取RDD中的value值,因为泛型代表数据的key和value,而value值才是真正的数据[String,String]
rdd =>{
//获取数据并处理数据
val sourceRDD: RDD[String] = rdd.map(_.value())
val resultRDD: RDD[(String, Int)] = sourceRDD.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
resultRDD.foreach(println(_))
//获取当前消费的offset
val offser: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//提交offset
source.asInstanceOf[CanCommitOffsets].commitAsync(offser)
}
)
//5.启动流
sc.start()
//6.阻塞线程接收数据
sc.awaitTermination()
}
}
注意:
报错ERROR Error when sending message to topic spark01 with key: null, value: 10 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback) org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
原因:
开启kafka的生成者的时候9092端口号写错