前言
本篇主要介绍Spark Streaming如何集成Kafka,演示SparkStreaming如何从Kafka读取消息,如果通过连接池方法把消息处理完成后再写会Kafka。
在工程中需要引入Maven工件来使用它。包内提供的 KafkaUtils 对象可以在 StreamingContext 和JavaStreamingContext 中以你的 Kafka 消息创建出 DStream。由于 KafkaUtils 可以订阅多个主题,因此它创建出的 DStream 由成对的主题和消息组成。要创建出一个流数据,需 要使用 StreamingContext 实例、一个由逗号隔开的 ZooKeeper 主机列表字符串、消费者组的名字(唯一名字),以及一个从主题到针对这个主题的接收器线程数的映射表来调用 createStream() 方法。
import org.apache.spark.streaming.kafka._
...
// 创建一个从主题到接收器线程数的映射表
val topics = List(("pandas", 1), ("logs", 1)).toMap
val topicLines = KafkaUtils.createStream(ssc, zkQuorum, group, topics)
topicLines.map(_._2)
一、IDEA编写代码
1、在原Spark Streaming项目基础上创建集成Kafka子项目并添加如下Maven依赖
<!-- 提供对象连接池的一种方式 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
<version>2.4.2</version>
</dependency>
<!-- 用来连接Kafka的工具类 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
2、完整代码如下:
构建Kafka生产者连接池类
package com.m.jd.streaming
import org.apache.commons.pool2.impl.{GenericObjectPool, GenericObjectPoolConfig}
//单例对象
object createKafkaProducerPool{
//用于返回真正的对象池GenericObjectPool
def apply(brokerList: String, topic: String): GenericObjectPool[KafkaProducerProxy] = {
val producerFactory = new BaseKafkaProducerFactory(brokerList, defaultTopic = Option(topic))
val pooledProducerFactory = new PooledKafkaProducerAppFactory(producerFactory)
//指定了你的kafka对象池的大小
val poolConfig = {
val c = new GenericObjectPoolConfig
val maxNumProducers = 10
c.setMaxTotal(maxNumProducers)
c.setMaxIdle(maxNumProducers)
c
}
//返回一个对象池
new GenericObjectPool[KafkaProducerProxy](pooledProducerFactory, poolConfig)
}
}
Kafka生产者代理对象:
package com.m.jd.streaming
import java.util.Properties
import org.apache.commons.pool2.impl.DefaultPooledObject
import org.apache.commons.pool2.{BasePooledObjectFactory, PooledObject}
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
case class KafkaProducerProxy(brokerList: String,
producerConfig: Properties = new Properties,
defaultTopic: Option[String] = None,
producer: Option[KafkaProducer[String, String]] = None) {
type Key = String
type Val = String
require(brokerList == null || !brokerList.isEmpty, "Must set broker list")
private val p = producer getOrElse {
var props:Properties= new Properties();
props.put("bootstrap.servers", brokerList);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
new KafkaProducer[String,String](props)
}
//把我的消息包装成了ProducerRecord
private def toMessage(value: Val, key: Option[Key] = None, topic: Option[String] = None): ProducerRecord[Key, Val] = {
val t = topic.getOrElse(defaultTopic.getOrElse(throw new IllegalArgumentException("Must provide topic or default topic")))
require(!t.isEmpty, "Topic must not be empty")
key match {
case Some(k) => new ProducerRecord(t, k, value)
case _ => new ProducerRecord(t, value)
}
}
def send(key: Key, value: Val, topic: Option[String] = None) {
//调用KafkaProducer他的send方法发送消息
p.send(toMessage(value, Option(key), topic))
}
def send(value: Val, topic: Option[String]) {
send(null, value, topic)
}
def send(value: Val, topic: String) {
send(null, value, Option(topic))
}
def send(value: Val) {
send(null, value, None)
}
def shutdown(): Unit = p.close()
}
abstract class KafkaProducerFactory(brokerList: String, config: Properties, topic: Option[String] = None) extends Serializable {
def newInstance(): KafkaProducerProxy
}
class BaseKafkaProducerFactory(brokerList: String,
config: Properties = new Properties,
defaultTopic: Option[String] = None)
extends KafkaProducerFactory(brokerList, config, defaultTopic) {
override def newInstance() = new KafkaProducerProxy(brokerList, config, defaultTopic)
}
// 继承一个基础的连接池,需要提供池化的对象类型
class PooledKafkaProducerAppFactory(val factory: KafkaProducerFactory)
extends BasePooledObjectFactory[KafkaProducerProxy] with Serializable {
// 用于池来创建对象
override def create(): KafkaProducerProxy = factory.newInstance()
// 用于池来包装对象
override def wrap(obj: KafkaProducerProxy): PooledObject[KafkaProducerProxy] = new DefaultPooledObject(obj)
// 用于池来销毁对象
override def destroyObject(p: PooledObject[KafkaProducerProxy]): Unit = {
p.getObject.shutdown()
super.destroyObject(p)
}
}
Spark Streaming集成Kafka类:
package com.m.jd.streaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaStreaming {
def main(args: Array[String]) {
//设置sparkconf
val conf = new SparkConf().setMaster("local[4]").setAppName("Spark Streaming Kafka")
//新建了streamingContext
val ssc = new StreamingContext(conf, Seconds(1))
//kafka的地址
val brobrokers = "192.168.10.30:9092,192.168.10.31:9092,192.168.10.32:9092"
//kafka的队列名称
val sourcetopic="source";
//kafka的队列名称
val targettopic="target";
//创建消费者组名
var group="con-consumer-group"
//kafka消费者配置
val kafkaParam = Map(
"bootstrap.servers" -> brobrokers,//用于初始化链接到集群的地址
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
//用于标识这个消费者属于哪个消费团体
"group.id" -> group,
//如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
//可以使用这个配置,latest自动重置偏移量为最新的偏移量
"auto.offset.reset" -> "latest",
//如果是true,则这个消费者的偏移量会在后台自动提交
"enable.auto.commit" -> (false: java.lang.Boolean)
//ConsumerConfig.GROUP_ID_CONFIG
);
//创建DStream,返回接收到的输入数据
val stream = KafkaUtils.createDirectStream[String,String](ssc, LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(sourcetopic),kafkaParam))
//定义偏移量数组
var offsetRanges = Array[OffsetRange]()
stream.transform { rdd =>
//获取offset信息
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map(
//每一个stream都是一个ConsumerRecord
s =>("id:" + s.key(),">>>>:"+s.value())
).foreachRDD(
rdd => {
//对于RDD的每一个分区执行一个操作
rdd.foreachPartition(partitionOfRecords => {
// kafka连接池。
val pool = createKafkaProducerPool(brobrokers, targettopic)
//从连接池里面取出了一个Kafka的连接
val p = pool.borrowObject()
//发送当前分区里面每一个数据
partitionOfRecords.foreach {
message => System.out.println(message._2)
p.send(message._2,Option(targettopic))
}
// 使用完了需要将kafka还回去
pool.returnObject(p)
for( o <- offsetRanges ){
//打印offset信息..
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
})
})
ssc.start()
ssc.awaitTermination()
}
}
二、部署及测试
1、启动zookeeper和kafka
2、创建两个topic,一个为source,一个为target
bin/kafka-topics.sh --create --zookeeper 192.168.10.30:2181,192.168.10.31:2181,192.168.10.32:2181 --replication-factor 2 --partitions 2 --topic source
bin/kafka-topics.sh --create --zookeeper 192.168.10.30:2181,192.168.10.31:2181,192.168.10.32:2181 --replication-factor 2 --partitions 2 --topic target
3、启动kafka console producer 写入source topic
bin/kafka-console-producer.sh --broker-list 192.168.10.30:9092,192.168.10.31:9092,192.168.10.32:9092 --topic source
4、启动kafka console consumer 监听target topic
bin/kafka-console-consumer.sh --bootstrap-server 192.168.10.30:9092,192.168.10.31:9092,192.168.10.32:9092 --topic target
5、启动kafkaStreaming程序:
[root@hadoop0 spark-2.1.1-bin-hadoop2.7]# bin/spark-submit --class com.m.jd.staming.KafkaStreaming /opt/spark-jar/kafkastreaming-jar-with-dependencies.jar
程序运行部分结果如下:
[root@hadoop0 kafka]# bin/kafka-console-producer.sh --broker-list 192.168.10.30:9092,192.168.10.31:9092,192.168.10.32:9092 --topic source
>hello
>hello spark
>hello spark
>haha
>shuishui
>ss
>
[root@hadoop0 kafka]# bin/kafka-console-consumer.sh --bootstrap-server 192.168.10.30:9092,192.168.10.31:9092,192.168.10.32:9092 --topic target
>>>>:hello
>>>>:hello spark
>>>>:hello spark
>>>>:haha
>>>>:shuishui
>>>>:ss
kafkaStreaming程序运行部分打印日志:
>>>>:ss
source 0 3 3
source 1 2 3
source 0 3 3
source 1 3 3
source 0 3 3
source 1 3 3
上述日志对应:
${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}
拓展思考:
1、KafkaUtils.createDirectStream() 和 KafkaUtils.createStream()这两个方法的区别和底层实现?
2、上面两种方式下,Spark Streaming集成Kafka时如何保证数据不丢失,不重复消费?
3、上面两张方式下,Spark的partition和Kafka的partition有什么区别或者联系?
注意:
在Spark Streaming中,目前官方推荐的方式是createDirectStream方式,但是这种方式就需要我们自己去管理offset,只要将 output 操作和保存 offsets 操作封装成一个原子操作。
//如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
//可以使用这个配置,latest自动重置偏移量为最新的偏移量
"auto.offset.reset" -> "latest",
//如果是true,则这个消费者的偏移量会在后台自动提交
"enable.auto.commit" -> (false: java.lang.Boolean)
参考文章:
Sparak-Streaming基于Offset消费Kafka数据
官方集成指南:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html