-
由Spark Streaming 向Kafka写数据,没有现成的官方接口,需要利用Kafka提供的底层接口。
-
第一种写法,如下,会报错:
nameAddrPhoneStream.foreachRDD(rdd => {
//在Driver中执行
//初始化生产者配置
val props = new Properties()
props.setProperty("bootstrap.servers", "master:9092")
props.setProperty("client.id", "kafkaGenerator")
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[StringSerializer].getName)
//创建生产者
val producer = new KafkaProducer[K, V](props)
rdd.foreach(record => {
//在RDD中的各条记录的本地计算结点Worker中执行
//生产者向Kafka发送消息
producer.send(new ProducerRecord[K, V]("kafkaProducer", record))
})
})
异常信息:KafkaProducer 未能序列化: Caused by: java.io.NotSerializableException: org.apache.kafka.clients.producer.KafkaProducer
原因是:
1. ds.foreachRDD(func)操作的函数在Driver中运行;
2. rdd.foreach(func)操作的函数在Worker中运行;
3. producer对象需要从Driver序列化传送到Worker中,而producer并不能序列化。
3. 第二种写法,如下,会产生大量生产者对象,增加开销:
nameAddrPhoneStream.foreachRDD(rdd => {
rdd.foreach(record => {
//初始化生产者配置
val props = new Properties()
props.setProperty("bootstrap.servers", "master:9092")
props.setProperty("client.id", "kafkaGenerator")
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[StringSerializer].getName)
//创建生产者
val producer = new KafkaProducer[K, V](props)
//生产者向Kafka发送消息
producer.send(new ProducerRecord[K, V]("kafkaProducer", record))
})
})
- 第三种写法,如下,对KafkaProducer包装,再广播到每个Executor中:
1)对KafkaProducer的包装类:
package sparkstreaming_action.kafka.operation
import java.util.Properties
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.RecordMetadata
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.spark.broadcast.Broadcast
import org.apache.kafka.common.serialization.StringSerializer
class KafkaSink[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
//避免运行时产生NotSerializableException异常
lazy val producer = createProducer()
def send(topic: String, key: K, value: V): Future[RecordMetadata] = {
//写入Kafka
producer.send(new ProducerRecord[K, V](topic, key, value))
}
def send(topic: String, value: V): Future[RecordMetadata] = {
//写入Kafka
producer.send(new ProducerRecord[K, V](topic, value))
}
}
object KafkaSink {
//导入 scala java 自动类型互转类
import scala.collection.JavaConversions._
//此处Map 为 scala.collection.immutable.Map
def apply[K, V](config: Map[String, String]): KafkaSink[K, V] = {
val createProducerFunc = () => {
//新建KafkaProducer
//scala.collection.Map => java.util.Map
val producer = new KafkaProducer[K, V](config) //需要java.util.Map
//虚拟机JVM退出时执行函数
sys.addShutdownHook({
//确保在Executor的JVM关闭前,KafkaProducer将缓存中的所有信息写入Kafka
//close()会被阻塞直到之前所有发送的请求完成
producer.close()
})
producer
}
new KafkaSink[K, V](createProducerFunc)
}
//隐式转换 java.util.Properties => scala.collection.mutable.Map[String, String]
//再通过 Map.toMap => scala.collection.immutable.Map
def apply[K, V](config: Properties): KafkaSink[K, V] = apply(config.toMap)
}
2)对KafkaSink的惰性单例实现,避免在Worker中重复创建:
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
//Kafka生产者单例(惰性)
object KafkaProducerSingle{
@volatile private var instance: Broadcast[KafkaSink[String, String]] = null
def getInstance(sc: SparkContext): Broadcast[KafkaSink[String, String]] = {
if (instance == null) {
synchronized {
if (instance == null) {
val kafkaProducerConfig: Properties = {
//新建配置项
val props = new Properties()
//配置broker
props.setProperty("bootstrap.servers", "master:9092")
//客户端名称
props.setProperty("client.id", "kafkaGenerator")
//序列化类型
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[StringSerializer].getName)
props
}
//将生产者广播
instance = sc.broadcast(KafkaSink[String, String](kafkaProducerConfig))
val log = Logger.getLogger(KafkaProducerSinngle.getClass)
log.warn("kafka producer init done!")
instance
}
}
}
instance
}
}
3)对上一个案例中的分析结果增加消息写入Kafka的操作:Spark Streaming分析Kafka数据
//将结果写到Kafka
nameAddrPhoneStream.foreachRDD(rdd => {
//获取可序列化的生产者
val producer = KafkaProducerSingle.getInstance(rdd.sparkContext).value
rdd.foreach(record => {
//发送消息
producer.send("kafkaSink", record)
})
})
参考文章:
- 《Spark Streaming 实时流式大数据处理实战》第五章 Spark Streaming 与 Kafka
转自 https://blog.csdn.net/weixin_39469127/article/details/92649312
————————————————
版权声明:本文为CSDN博主「碣石观海」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_39469127/article/details/92649312