在Spark1.3之前,默认的Spark接收Kafka数据的方式是基于Receiver
的,在这之后的版本里,推出了Direct Approach
,现在整理一下两种方式的异同。
1. Receiver-based Approach
示例代码:
<code class="hljs css" style="font-family:Menlo,Monaco,Consolas,"Courier New",monospace;font-size:undefined; padding:0.5em; display:block; width:auto; word-wrap:normal; overflow-x:auto"><span class="hljs-tag" style="color:#0080;">import</span> <span class="hljs-tag" style="color:#0080;">org</span><span class="hljs-class" style="">.apache</span><span class="hljs-class" style="">.spark</span><span class="hljs-class" style="">.streaming</span><span class="hljs-class" style="">.kafka</span><span class="hljs-class" style="">._</span>
</code>
<code class="hljs css" style="font-family:Menlo,Monaco,Consolas,"Courier New",monospace;font-size:undefined; padding:0.5em; display:block; width:auto; word-wrap:normal; overflow-x:auto"> <span class="hljs-attr_selector" style="">[ZK quorum]</span>, <span class="hljs-attr_selector" style="">[consumer group id]</span>, <span class="hljs-attr_selector" style="">[per-topic number of Kafka partitions to consume]</span>)
</code>
2. Direct Approach (No Receivers)
示例代码:
<code class="hljs cpp" style="font-family:Menlo,Monaco,Consolas,"Courier New",monospace;font-size:undefined; padding:0.5em; display:block; width:auto; word-wrap:normal; overflow-x:auto"> import org.apache.spark.streaming.kafka._
val directKafkaStream = KafkaUtils.createDirectStream[
[key <span class="hljs-keyword" style="font-weight:bold">class</span>], [value <span class="hljs-keyword" style="font-weight:bold">class</span>], [key decoder <span class="hljs-keyword" style="font-weight:bold">class</span>], [value decoder <span class="hljs-keyword" style="font-weight:bold">class</span>] ](
streamingContext, [<span class="hljs-built_in" style="color:#086b3;">map</span> of Kafka parameters], [<span class="hljs-built_in" style="color:#086b3;">set</span> of topics to consume])
</code>
源码实现
1、 KafkaUtils.createStream
首先从源码层面来看,其主要调用栈顺序:
<code class="hljs php" style="font-family:Menlo,Monaco,Consolas,"Courier New",monospace;font-size:undefined; padding:0.5em; display:block; width:auto; word-wrap:normal; overflow-x:auto">KafkaUtils.createStream--->createStream--->new KafkaInputDStream--->new KafkaReceiver
</code>
KafkaReceiver
类继承了Receiver
,当Reciver
被调用起来时,执行onStart()
方法,MessageHandler
负责将收到的数据进行存储。执行流程如下:
- 创建
createStream
,Receiver
被调起执行 - 连接
ZooKeeper
,读取相应的Consumer
、Topic
配置信息等 - 通过
consumerConnector
连接到Kafka
集群,收取指定topic
的数据 - 创建
KafkaMessageHandler
线程池来对数据进行处理,通过ReceiverInputDStream
中的方法,将数据转换成BlockRDD
,供后续计算
2、 KafkaUtils.createDirectStream
主要调用栈顺序:
KafkaUtils.createDirectStream—> new DirectKafkaInputDStream执行流程如下:
- 实例化
KafkaCluster
,根据用户配置的Kafka
参数,连接Kafka
集群 - 通过
Kafka API
读取Topic
中每个Partition
最后一次读的Offset
- 接收成功的数据,直接转换成
KafkaRDD
,供后续计算
架构
通过两张图,来看下他们架构。
1、 Receiver-based Approach
2、 Direct Approach (No Receivers)
优缺点
相关的优缺点,在官网上已经说得很清楚了。追求效率、数据准确可以使用Direct
方式,但需要自己对Offset
进行处理。
参考资料: