1. 由Spark Streaming 向Kafka写数据,没有现成的官方接口,需要利用Kafka提供的底层接口。
2. 第一种写法,如下,会报错:
nameAddrPhoneStream.foreachRDD(rdd => {
//在Driver中执行
//初始化生产者配置
val props = new Properties()
props.setProperty("bootstrap.servers", "master:9092")
props.setProperty("client.id", "kafkaGenerator")
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[StringSerializer].getName)
//创建生产者
val producer = new KafkaProducer[K, V](props)
rdd.foreach(record => {
//在RDD中的各条记录的本地计算结点Worker中执行
//生产者向Kafka发送消息
producer.send(new ProducerRecord[K, V]("kafkaProducer", record))
})
})
异常信息:KafkaProducer 未能序列化:
Caused by: java.io.NotSerializableException: org.apache.kafka.clients.producer.KafkaProducer
原因是:
1. ds.foreachRDD(func)操作的函数在Driver中运行;
2. rdd.foreach(func)操作的函数在Worker中运行;
3. producer对象需要从Driver序列化传送到Worker中,而producer并不能序列化。
3. 第二种写法,如下,会产生大量生产者对象,增加开销:
nameAddrPhoneStream.foreachRDD(rdd => {
rdd.foreach(record => {
//初始化生产者配置
val props = new Properties()
props.setProperty("bootstrap.servers", "master:9092")
props.setProperty("client.id", "kafkaGenerator")
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[StringSerializer].getName)
//创建生产者
val producer = new KafkaProducer[K, V](props)
//生产者向Kafka发送消息
producer.send(new ProducerRecord[K, V]("kafkaProducer", record))
})
})
4. 第三种写法,如下,对KafkaProducer包装,再广播到每个Executor中:
1)对KafkaProducer的包装类:
package sparkstreaming_action.kafka.operation
import java.util.Properties
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.RecordMetadata
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.spark.broadcast.Broadcast
import org.apache.kafka.common.serialization.StringSerializer
class KafkaSink[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
//避免运行时产生NotSerializableException异常
lazy val producer = createProducer()
def send(topic: String, key: K, value: V): Future[RecordMetadata] = {
//写入Kafka
producer.send(new ProducerRecord[K, V](topic, key, value))
}
def send(topic: String, value: V): Future[RecordMetadata] = {
//写入Kafka
producer.send(new ProducerRecord[K, V](topic, value))
}
}
object KafkaSink {
//导入 scala java 自动类型互转类
import scala.collection.JavaConversions._
//此处Map 为 scala.collection.immutable.Map
def apply[K, V](config: Map[String, String]): KafkaSink[K, V] = {
val createProducerFunc = () => {
//新建KafkaProducer
//scala.collection.Map => java.util.Map
val producer = new KafkaProducer[K, V](config) //需要java.util.Map
//虚拟机JVM退出时执行函数
sys.addShutdownHook({
//确保在Executor的JVM关闭前,KafkaProducer将缓存中的所有信息写入Kafka
//close()会被阻塞直到之前所有发送的请求完成
producer.close()
})
producer
}
new KafkaSink[K, V](createProducerFunc)
}
//隐式转换 java.util.Properties => scala.collection.mutable.Map[String, String]
//再通过 Map.toMap => scala.collection.immutable.Map
def apply[K, V](config: Properties): KafkaSink[K, V] = apply(config.toMap)
}
2)对KafkaSink的惰性单例实现,避免在Worker中重复创建:
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
//Kafka生产者单例(惰性)
object KafkaProducerSingle{
@volatile private var instance: Broadcast[KafkaSink[String, String]] = null
def getInstance(sc: SparkContext): Broadcast[KafkaSink[String, String]] = {
if (instance == null) {
synchronized {
if (instance == null) {
val kafkaProducerConfig: Properties = {
//新建配置项
val props = new Properties()
//配置broker
props.setProperty("bootstrap.servers", "master:9092")
//客户端名称
props.setProperty("client.id", "kafkaGenerator")
//序列化类型
props.setProperty("key.serializer", classOf[StringSerializer].getName)
props.setProperty("value.serializer", classOf[StringSerializer].getName)
props
}
//将生产者广播
instance = sc.broadcast(KafkaSink[String, String](kafkaProducerConfig))
val log = Logger.getLogger(KafkaProducerSinngle.getClass)
log.warn("kafka producer init done!")
instance
}
}
}
instance
}
}
3)对上一个案例中的分析结果增加消息写入Kafka的操作:Spark Streaming分析Kafka数据
//将结果写到Kafka
nameAddrPhoneStream.foreachRDD(rdd => {
//获取可序列化的生产者
val producer = KafkaProducerSingle.getInstance(rdd.sparkContext).value
rdd.foreach(record => {
//发送消息
producer.send("kafkaSink", record)
})
})
*注:pom.xml 文件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId><!-- 组织名 -->
<artifactId>kafkaSparkStreaming</artifactId><!-- 项目名 -->
<version>0.1</version><!-- 版本号 -->
<properties>
<spark.version>2.4.3</spark.version><!-- 设置变量指定Spark版本号 -->
</properties>
<dependencies>
<dependency><!-- Spark依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
</dependency>
<dependency><!-- Spark Streaming依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
</dependency>
<dependency><!-- Spark Streaming with Kafka 依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency><!-- Log 日志依赖包 -->
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency><!-- 日志依赖接口 -->
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.12</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 混合scala/java编译 -->
<plugin><!-- scala编译插件 -->
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
<phase>compile</phase>
</execution>
<execution>
<id>test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
<phase>test-compile</phase>
</execution>
<execution>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source><!-- 设置Java源 -->
<target>1.8</target>
</configuration>
</plugin>
<!-- for fatjar -->
<plugin><!-- 将所有依赖包打入同一个jar包中 -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!-- 添加类路径 -->
<addClasspath>true</addClasspath>
<!-- 设置程序的入口类 -->
<mainClass>sparkstreaming_action.kafka.operation.KafkaOperation</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
</project>
参考文章:
1. 《Spark Streaming 实时流式大数据处理实战》第五章 Spark Streaming 与 Kafka