Spark Streaming + Kafka Integration Guide (Kafka broker version 0.8.2.1 or higher)
Here we explain how to configure Spark Streaming to receive data from Kafka. There are two approaches to this - the old approach using Receivers and Kafka’s high-level API, and a new approach (introduced in Spark 1.3) without using Receivers.
Approach 1: Receiver-based Approach
This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API.
As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.
However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure. See Deploying section in the streaming programming guide for more details on Write Ahead Logs.
Next, we discuss how to use this approach in your streaming application.
-
Linking:
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-8_2.11 version = 2.1.1
-
Programming: import
KafkaUtils
andcreate an input DStreamPoints to remember:
-
Topic的分区数和RDDs的分区无关联,增加topic的分区只是增加了单个receiver消费topics的线程数,不会增加spark处理数据的并行度。
-
利用多个receivers并行接收数据,因此可由不同的groups and topics创建多个 input DStreams。
-
开启Write Ahead Logs时设置存储level为StorageLevel.MEMORY_AND_DISK_SER( use
KafkaUtils.createStream(..., StorageLevel.MEMORY_AND_DISK_SER)
)
-
-
Deploying: As with any Spark applications,
spark-submit
is used to launch your application. However, the details are slightly different for Scala/Java applications and Python applications.For Scala and Java applications, if you are using SBT or Maven for project management, then package
spark-streaming-kafka-0-8_2.11
and its dependencies into the application JAR. Make surespark-core_2.11
andspark-streaming_2.11
are marked asprovided
dependencies as those are already present in a Spark installation. Then usespark-submit
to launch your application (see Deploying section in the main programming guide).Alternatively, you can also download the JAR of the Maven artifact
spark-streaming-kafka-0-8-assembly
from the Maven repository and add it tospark-submit
with--jars
.
Approach 2: Direct Approach (No Receivers)
This new receiver-less “direct” approach has been introduced in Spark 1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system). Note that this feature was introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API.
This approach has the following advantages over the receiver-based approach (i.e. Approach 1).
-
Simplified Parallelism:
不同于建立多streams然后union,direct Stream实现 Kafka and RDD partitions间端到端的映射(aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().)
-
Efficiency:
Receiver-based Approach为了不丢失数据,需要将data保存在a Write Ahead log,因此数据出了Kafka将在log中冗余一份。Direct 方式只要数据在Kafka中设置了足够的滞留期即可。
-
Exactly-once semantics:
Receiver-based Approach利用传统的方式从Kafka中消费数据,在一些时候由于Spark Streaming 接收数据和Zookeeper offsets跟踪的不一致性,可能出现一些消息会在第一次消费失败重试时出现二次消费的现象。there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper.
第二方法利用不依赖Zookeeper的Kafka底层API,Offsets由 Spark Streaming利用checkpoints跟踪。注意为了“仅一次”输出结果,存储输出数据的操作必须是idempotent或an atomic transaction 的(see Semantics of output operations )
Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).
Note that one disadvantage of this approach is that it does not update offsets in Zookeeper, hence Zookeeper-based Kafka monitoring tools will not show progress. However, you can access the offsets processed by this approach in each batch and update Zookeeper yourself (see below).
Next, we discuss how to use this approach in your streaming application.
-
Linking:
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-8_2.11 version = 2.1.1
-
Programming: import
KafkaUtils
and create an input DStreamimport org.apache.spark.streaming.kafka._ val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume])
You can also pass a
messageHandler
tocreateDirectStream
to accessMessageAndMetadata
that contains metadata about the current message and transform it to any desired type. See the API docs and the example.In the Kafka parameters, you must specify either
metadata.broker.list
orbootstrap.servers
. By default, it will start consuming from the latest offset of each Kafka partition. If you set configurationauto.offset.reset
in Kafka parameters tosmallest
, then it will start consuming from the smallest offset.You can also start consuming from any arbitrary offset using other variations of
KafkaUtils.createDirectStream
. Furthermore, if you want to access the Kafka offsets consumed in each batch, you can do the following.// Hold a reference to the current offset ranges, so it can be used downstream var offsetRanges = Array.empty[OffsetRange] directKafkaStream.transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.map { ... }.foreachRDD { rdd => for (o <- offsetRanges) { println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") } ... }
You can use this to update Zookeeper yourself if you want Zookeeper-based Kafka monitoring tools to show progress of the streaming application.
Note that the typecast to HasOffsetRanges will only succeed if it is done in the first method called on the directKafkaStream, not later down a chain of methods. You can use transform() instead of foreachRDD() as your first method call in order to access offsets, then call further Spark methods. However, beaware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().
Another thing to note is that since this approach does not use Receivers, the standard receiver-related (that is, configurations of the form
spark.streaming.receiver.*
) will not apply to the input DStreams created by this approach (will apply to other input DStreams though). Instead, use the configurationsspark.streaming.kafka.*
.An important one is
spark.streaming.kafka.maxRatePerPartition
which is the maximum rate (in messages per second) at which each Kafka partition will be read by this direct API. -
Deploying: This is same as the first approach.