spark-streaming-[8]-Spark Streaming + Kafka Integration Guide0.8.2.1学习笔记

最新推荐文章于 2020-04-04 13:09:58 发布

hjw199089

最新推荐文章于 2020-04-04 13:09:58 发布

阅读量942

点赞数

分类专栏： [13]spark streaming 文章标签： spark

本文链接：https://blog.csdn.net/hjw199089/article/details/71308324

版权

[13]spark streaming 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

Spark Streaming + Kafka Integration Guide (Kafka broker version 0.8.2.1 or higher)

官方文档链接

Here we explain how to configure Spark Streaming to receive data from Kafka. There are two approaches to this - the old approach using Receivers and Kafka’s high-level API, and a new approach (introduced in Spark 1.3) without using Receivers.

Approach 1: Receiver-based Approach

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API.

As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.

However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure. See Deploying section in the streaming programming guide for more details on Write Ahead Logs.

Next, we discuss how to use this approach in your streaming application.

Linking:

 groupId = org.apache.spark
 artifactId = spark-streaming-kafka-0-8_2.11
 version = 2.1.1

Programming: import KafkaUtils andcreate an input DStream
```
 import org.apache.spark.streaming.kafka._
 val kafkaStream = KafkaUtils.createStream(streamingContext,
     [ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])
```
同时可以设置输入Key及其values的classes和其相应的decoder classes，详见 API docs 和 example.
Points to remember:
- Topic的分区数和RDDs的分区无关联，增加topic的分区只是增加了单个receiver消费topics的线程数，不会增加spark处理数据的并行度。
- 利用多个receivers并行接收数据，因此可由不同的groups and topics创建多个 input DStreams。
- 开启Write Ahead Logs时设置存储level为StorageLevel.MEMORY_AND_DISK_SER（ use KafkaUtils.createStream(..., StorageLevel.MEMORY_AND_DISK_SER)）
Deploying: As with any Spark applications, spark-submit is used to launch your application. However, the details are slightly different for Scala/Java applications and Python applications.

For Scala and Java applications, if you are using SBT or Maven for project management, then package spark-streaming-kafka-0-8_2.11and its dependencies into the application JAR. Make sure spark-core_2.11 and spark-streaming_2.11 are marked as provided dependencies as those are already present in a Spark installation. Then use spark-submit to launch your application (see Deploying section in the main programming guide).

Alternatively, you can also download the JAR of the Maven artifact spark-streaming-kafka-0-8-assembly from the Maven repository and add it to spark-submit with --jars.

Approach 2: Direct Approach (No Receivers)

This new receiver-less “direct” approach has been introduced in Spark 1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system). Note that this feature was introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API.

This approach has the following advantages over the receiver-based approach (i.e. Approach 1).

Simplified Parallelism:

不同于建立多streams然后union，direct Stream实现 Kafka and RDD partitions间端到端的映射（aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().）
Efficiency:

Receiver-based Approach为了不丢失数据，需要将data保存在a Write Ahead log，因此数据出了Kafka将在log中冗余一份。Direct 方式只要数据在Kafka中设置了足够的滞留期即可。
Exactly-once semantics:

Receiver-based Approach利用传统的方式从Kafka中消费数据，在一些时候由于Spark Streaming 接收数据和Zookeeper offsets跟踪的不一致性，可能出现一些消息会在第一次消费失败重试时出现二次消费的现象。there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper.

第二方法利用不依赖Zookeeper的Kafka底层API，Offsets由 Spark Streaming利用checkpoints跟踪。注意为了“仅一次”输出结果，存储输出数据的操作必须是idempotent或an atomic transaction 的（see Semantics of output operations ）

Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).

Note that one disadvantage of this approach is that it does not update offsets in Zookeeper, hence Zookeeper-based Kafka monitoring tools will not show progress. However, you can access the offsets processed by this approach in each batch and update Zookeeper yourself (see below).

Next, we discuss how to use this approach in your streaming application.

Linking:

 groupId = org.apache.spark
 artifactId = spark-streaming-kafka-0-8_2.11
 version = 2.1.1

Programming: import KafkaUtils and create an input DStream
```
 import org.apache.spark.streaming.kafka._
 val directKafkaStream = KafkaUtils.createDirectStream[
     [key class], [value class], [key decoder class], [value decoder class] ](
     streamingContext, [map of Kafka parameters], [set of topics to consume])
```
You can also pass a messageHandler to createDirectStream to access MessageAndMetadata that contains metadata about the current message and transform it to any desired type. See the API docs and the example.
In the Kafka parameters, you must specify either metadata.broker.list or bootstrap.servers. By default, it will start consuming from the latest offset of each Kafka partition. If you set configuration auto.offset.reset in Kafka parameters to smallest, then it will start consuming from the smallest offset.

You can also start consuming from any arbitrary offset using other variations of KafkaUtils.createDirectStream. Furthermore, if you want to access the Kafka offsets consumed in each batch, you can do the following.
```
 // Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array.empty[OffsetRange]

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
           ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
     println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
You can use this to update Zookeeper yourself if you want Zookeeper-based Kafka monitoring tools to show progress of the streaming application.

Note that the typecast to HasOffsetRanges will only succeed if it is done in the first method called on the directKafkaStream, not later down a chain of methods. You can use transform() instead of foreachRDD() as your first method call in order to access offsets, then call further Spark methods. However, beaware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().

Another thing to note is that since this approach does not use Receivers, the standard receiver-related (that is, configurations of the form spark.streaming.receiver.* ) will not apply to the input DStreams created by this approach (will apply to other input DStreams though). Instead, use the configurations spark.streaming.kafka.*.

An important one is spark.streaming.kafka.maxRatePerPartition which is the maximum rate (in messages per second) at which each Kafka partition will be read by this direct API.
Deploying: This is same as the first approach.