SparkStreaming03

最新推荐文章于 2024-09-26 20:47:25 发布

HBinz

最新推荐文章于 2024-09-26 20:47:25 发布

阅读量509

点赞数

文章标签： BigData

本文链接：https://blog.csdn.net/Binbinhb/article/details/88657430

版权

一、回顾重点

1、transform

只要你想DSreamVSRDD操作，只有这个算子可以实现。

2、foreachRDD

（1）统计结果输出的只有通过它

（2）连接池

3、Window

二、Input DStreams and Receivers（Advanced Sources）

基础数据源：file systrem和socket

高级数据源：Kafka, Flume, Kinesis等

1、Flume

Flume的支持从Spark2.3.0后标记为过时

配置Flume-Spark的第一种方式:Flume-style Push-based Approach

Flume is designed to push data between Flume agents. In this approach, Spark Streaming essentially sets up a receiver that acts an Avro agent for Flume, to which Flume can push the data. Here are the configuration steps.

Flume将数据push到SparkStreaming处理（后启动Flume）

（1）一般要求

1）Flume和SparkStreming应用启动的时候，其中一个executor必须在机器工作

2）Flume会通过配置将数据推送到机器的某个端口上

Avro上肯定需要配置host+port。而且先启动SparkStreaming再启动Flume

（2）配置Flume

agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
agent.sinks.avroSink.hostname = <chosen machine's hostname>
agent.sinks.avroSink.port = <chosen port on the machine>

1）编写Flume的conf文件：flume_push_streaming.conf

avro-sink-agent.sources = netcat-source
avro-sink-agent.channels = netcat-memory-channel
avro-sink-agent.sinks = avro-sink

avro-sink-agent.sources.netcat-source.type = netcat
avro-sink-agent.sources.netcat-source.bind = localhost
avro-sink-agent.sources.netcat-source.port = 44444
avro-sink-agent.sources.netcat-source.channels = netcat-memory-channel

avro-sink-agent.channels.netcat-memory-channel.type = memory

avro-sink-agent.sinks.avro-sink.type = avro
avro-sink-agent.sinks.avro-sink.channel = netcat-memory-channel
avro-sink-agent.sinks.avro-sink.hostname = localhost
avro-sink-agent.sinks.avro-sink.port = 41414

启动flume脚本：

flume-ng agent --name avro-sink-agent \
--conf $FLUME_HOME/conf \
--conf-file /opt/script/flume/conf/flume_push_streaming.conf \
-Dflume.root.logger=INFO,console &

（2）配置SparkStreamingApplication

Maven:

 groupId = org.apache.spark
 artifactId = spark-streaming-flume_2.11
 version = 2.3.2

（3）pox.xml：

<!--添加SparkStreaming整合Flume的依赖-->
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-flume_2.11</artifactId>
  <version>${spark.version}</version>
</dependency>

（4）开发

导入FlumeUtils并创建inputDStream

 import org.apache.spark.streaming.flume._

 val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

1）进入FlumeUtils.scala类

根据此方法的传参，补充inputDStream

val lines = FlumeUtils.createStream(scc, "hadoop002",44444)

（5）代码

package com.HBinz.spark.streaming.day03

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingFlumeApp01 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setMaster("").setAppName("")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    //创建inputDStream，对接Flume数据
    val lines = FlumeUtils.createStream(scc, "hadoop002",41414)


    scc.start()
    scc.awaitTermination()
  }
}

（6）执行

1）IDEA打包，上传服务器通过spark-submit提交

需要下载flume-streaming依赖，否则需要重新编译spark，包含这个依赖：

spark-submit --master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.3.2 \
--class com.HBinz.spark.streaming.day03.StreamingFlumeApp01 \
/opt/lib/spark-train-1.0.jar

成功了

2）执行flume脚本

flume-ng agent --name avro-sink-agent \

--conf $FLUME_HOME/conf \

--conf-file /opt/script/flume/conf/flume_push_streaming.conf \

-Dflume.root.logger=INFO,console &

Spark-Streaming打印出来的是这样：

Flume传过来的结果默认是SparkFlumeEvent类型，该类型默认是传参的哈希值拼接起来，所以是这样，需要优化代码

（7）优化

1）telnet hadoop002 44444

2）Spark-Streaming输出：

成功

注意：

Flume传数据，一定要对数据流两边去掉空格，调用x.trim方法。否则格式会有问题。

2、IDEA胖包

Flume-SparkSreaming官网：

For Scala and Java applications, if you are using SBT or Maven for project management, then package spark-streaming-flume_2.11 and its dependencies into the application JAR. Make sure spark-core_2.11 and spark-streaming_2.11 are marked as provided dependencies as those are already present in a Spark installation. Then use spark-submit to launch your application (see Deploying section in the main programming guide).

对于Scala和Java应用程序，如果您正在使用SBT或Maven进行项目管理，那么可以将spark-streaming-flume_2.11及其依赖项打包到应用程序JAR中。确保spark-core_2.11和spark-streaming_2.11都标记为provided的依赖关系，因为它们已经存在于spark安装中。然后使用spark-submit启动您的应用程序（参见主编程指南中的部署部分）。

（1）胖包

平常打包都是瘦包，可以使用插件assembly打胖包

但是需要改动一下pom.xml

（2）pom.xml

例如：标记为provided，那样打胖包就不会把spark-core/spark-streaming依赖打进去。以免重复。

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming_2.11</artifactId>
  <version>${spark.version}</version>
  <scope>provided</scope>
</dependency>

配置Flume-Spark的第二种方式:Pull-based Approach using a Custom Sink(先启动FLUME)

Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink that allows the following.

Flume pushes data into the sink, and the data stays buffered.

Spark Streaming uses a reliable Flume receiver and transactions to pull data from the sink. Transactions succeed only after data is received and replicated by Spark Streaming.

This ensures stronger reliability and fault-tolerance guarantees than the previous approach. However, this requires configuring Flume to run a custom sink. Here are the configuration steps.

这种方法不是直接将数据推送到Spark Streaming上，而是运行一个定制的Flume接收器，允许以下操作。

Flume将数据推入sink，数据保持在缓存。

Spark Streaming使用可靠的Flume接收器和事务从sink中提取数据。事务只有在接收到数据并由Spark Streaming复制之后才能成功。

这确保了比以前的方法更可靠和容错的保证。但是，这需要配置Flume来运行自定义接收器。以下是配置步骤：

1、pom.xml配置

2、conf配置

 agent.sinks = spark
 agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
 agent.sinks.spark.hostname = <hostname of the local machine>
 agent.sinks.spark.port = <port to listen on for connection from Spark>
 agent.sinks.spark.channel = memoryChannel

flume_pull_streaming.conf:

avro-sink-agent.sources = netcat-source
avro-sink-agent.channels = netcat-memory-channel
avro-sink-agent.sinks = spark-sink

avro-sink-agent.sources.netcat-source.type = netcat
avro-sink-agent.sources.netcat-source.bind = localhost
avro-sink-agent.sources.netcat-source.port = 44444
avro-sink-agent.sources.netcat-source.channels = netcat-memory-channel

avro-sink-agent.channels.netcat-memory-channel.type = memory

avro-sink-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
avro-sink-agent.sinks.spark-sink.channel = netcat-memory-channel
avro-sink-agent.sinks.spark-sink.hostname = localhost
avro-sink-agent.sinks.spark-sink.port = 41414

启动脚本：

flume-ng agent --name avro-sink-agent \

--conf $FLUME_HOME/conf \

--conf-file /opt/script/flume/conf/flume_pull_streaming.conf \

-Dflume.root.logger=INFO,console &

3、代码

package com.HBinz.spark.streaming.day03

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingFlumeApp01 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setMaster("local[2]").setAppName("StreamingFlumeApp01")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    //创建inputDStream，对接Flume数据
    val lines = FlumeUtils.createStream(scc, "hadoop002",41414)
    //从Flume取的是SparkFlumeEvent类型的数据，需要转String，才可以读取准确的结果。
    lines.map(x => new String(x.event.getBody.array()).trim)
        .flatMap(_.split(",")).map((_,1)).reduceByKey(_ + _).print()
    scc.start()
    scc.awaitTermination()
  }
}

4、执行

（1）启动FLUME

flume-ng agent --name avro-sink-agent \

--conf $FLUME_HOME/conf \

--conf-file /opt/script/flume/conf/flume_pull_streaming.conf \

-Dflume.root.logger=INFO,console &

（2）telnet localhost 44444

telnet localhost 44444

（3）Spark-submit:

spark-submit --master local[2] \

--packages org.apache.spark:spark-streaming-flume_2.11:2.3.2 \

--class com.HBinz.spark.streaming.day03.StreamingFlumeApp02 \

/opt/lib/spark-train-1.0.jar

（4）端口输出：

2、Kafka（工作常用）

The Kafka project introduced a new consumer API between versions 0.8 and 0.10, so there are 2 separate corresponding Spark Streaming packages available. Please choose the correct package for your brokers and desired features; note that the 0.8 integration is compatible with later 0.9 and 0.10 brokers, but the 0.10 integration is not compatible with earlier brokers.

Kafka项目在版本0.8和0.10之间引入了一个新的消费者API，因此有两个单独的对应的Spark Streaming包。请为您的brokers和所需的功能选择正确的packages；注意，0.8集成与后面的0.9和0.10代理兼容，但0.10集成与以前的代理不兼容。

所以生产一般用0.8

1、概要

这里我们解释了如何配置Spark Streaming以接收来自Kafka的数据。有两种方法可以解决这个问题——一种是使用receivers的旧方法和kafka的高级API，以及一种不使用接收器的新方法（在Spark1.3中引入）。它们有不同的编程模型、性能特征和语义保证，因此请继续阅读以了解更多细节。这两种方法都被认为是当前版本的Spark的稳定API

回顾：

filesysterm是不需要receivers的。有receiver说明会有一个一直跑的程序在UI

第一种方法：Receiver-based Approach

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.

However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure. See Deploying section in the streaming programming guide for more details on Write Ahead Logs.

Next, we discuss how to use this approach in your streaming application.

这种方法使用一个Receiver来接收数据。接收器是使用Kafka高级消费者API实现的。和所有的接收器一样，kafka通过接收器接收到的数据存储在Spark executors中，然后由SparkStreaming处理数据启动的作业。

但是，在默认配置下，这种方法可能会在失败时丢失数据（参见接收者可靠性）。要确保零数据丢失，还必须在Spark Streaming中启用WAL——“提前写日志”（在Spark1.2中引入）。这可以将所有接收到的Kafka数据同步保存到分布式文件系统（例如HDFS,iditlog+fsimage定期会清理）上的写前日志中，以便在发生故障时可以恢复所有数据。有关“写前写日志”的详细信息，请参阅Streaming编程指南中的部署部分。

接下来，我们将讨论如何在您的流应用程序中使用这种方法。

（1）pom.xml

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-lang3</artifactId>
  <version>3.5</version>
</dependency>

（2）代码

package com.HBinz.spark.streaming.day03

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingKafkaApp01 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setMaster("local[2]").setAppName("StreamingKafkaApp01")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    //定义一个topics，讲String转为Map类型，而且Map的value是Int类型，定义处理的线程数
    val topics = "HB_Kafka_Streaming".split(",").map((_,1)).toMap
    val lines = KafkaUtils.createStream(scc,"hadoop002:2181","HB_groups",topics)
    lines.print()

    scc.start()
    scc.awaitTermination()
  }
}

（3）启动zk

./zkServer.sh start

（3）启动kafka

./kafka-server-start.sh -daemon /opt/app/kafka/config/server.properties

（4）查询kafka上的topics

./kafka-topics.sh --list --zookeeper localhost:2181

（5）创建topics：HB_Kafka_Streaming

./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic HB_Kafka_Streaming

（6）创建生产者和消费者

生产者:

./kafka-console-producer.sh -broker-list localhost:9092 -topic HB_Kafka_Streaming

另一个控制台消费者：

./kafka-console-consumer.sh --zookeeper localhost:2181 --topic HB_Kafka_Streaming

（7）运行IDEA代码

生产者输入测试数据：

HB,HB,HB,zidong,zidong

dazhu,dazhu,dazhu,dazhu

（8）分析结果

KafkaDstream返回的是(null,HB,HB,HB,zidong,zidong)，是key-values的map类型。

所以可以优化一下

lines.map(x=>(x._2)).flatMap(_.split(",")).map((_,1)).reduceByKey(_ + _).print()

（10）注意

Points to remember:

Topic partitions in Kafka does not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the KafkaUtils.createStream() only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.

Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

If you have enabled Write Ahead Logs with a replicated file system like HDFS, the received data is already being replicated in the log. Hence, the storage level in storage level for the input stream to StorageLevel.MEMORY_AND_DISK_SER (that is, useKafkaUtils.createStream(..., StorageLevel.MEMORY_AND_DISK_SER)).

需要记住的要点：

Kafka中的topic分区与在Spark Streaming中生成的RDDs分区并不相关。因此，在KafkaUtils.createStream()中增加特定于主题的分区的数量，只会增加使用单个receiver中使用的topic的线程数量。在处理数据时，它不会增加Spark的并行性。有关这方面的更多信息，请参考主文档。

可以用不同的组和主题创建多个Kafka输入DStreams，提高多个receiver接收数据的并行度。

如果您已经使用复制的文件系统（如HDFS）启用了“提前写日志”，则已在日志中复制接收到的数据（HDFS有3个副本了）。因此， storage level可以设置为StorageLevel.MEMORY_AND_DISK_SER（单个副本）

（11）部署

与Flume一样

第二种方法：Direct Approach (No Receivers)

This new receiver-less “direct” approach has been introduced in Spark 1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system). Note that this feature was introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API.

这种新的无receiver的“直接”方法已在Spark1.3中引入，以确保更强大的端到端保证。该方法不使用receiver接收数据，而是定期查询Kafka，以获得每个topic+partitions中的最新offsets，并相应地定义每个批处理中的偏移范围。在启动处理数据的作业时，使用Kafka的简单消费者API来读取来自Kafka的已定义的offset ranges（类似于从文件系统中读取文件）。请注意，这个特性是在针对Scala和Java API的Spark1.3中引入的，在Spark1.4中是针对Python API的。

类似从kafka读。

（1）官网

（2）createDirectStream方法

@param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">
*   configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"
*   to be set with Kafka broker(s) (NOT zookeeper servers), specified in
*   host1:port1,host2:port2 form.
*   If not starting from a checkpoint, "auto.offset.reset" may be set to "largest" or "smallest"
*   to determine where the stream starts (defaults to "largest")

所以kafkaParams 的key->value

"metadata.broker.list"->hadoop002:9092

（3）代码如下：

package com.HBinz.spark.streaming.day03

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingKafkaApp02 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setMaster("local[2]").setAppName("StreamingKafkaApp02")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    //Map的定义Map(""->"")
    val kafkaParams = Map[String,String](
    "metadata.broker.list"->"hadoop002:9092"
    )
    val topics = "HB_Kafka_Streaming".split(",").toSet
    val lines = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](scc,kafkaParams,topics)
    lines.map(x=>(x._2)).flatMap(_.split(",")).map((_,1)).reduceByKey(_ + _).print()
    scc.start()
    scc.awaitTermination()
  }
}

（4）输出

生产者：

HB,HB,RUOZE,RUOZE,RUOZE,HAHA

（5）没有receiver

UI没有一直在跑的job

2、两种方法的区别

This approach has the following advantages over the receiver-based approach (i.e. Approach 1).

Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.

Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. This eliminates inconsistencies between Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).

与基于接收的方法 (即方法 1) 相比, 第二种方法具有以下优点：

简化的并行性: 无需创建多个inputKafkaStream并将其合并。通过 directstream, sparkStreaming将创建尽可能多的rdd分区, 因为有kafka partition可供使用, 这些分区都将并行读取卡夫卡的数据。因此,kafka 和 rdd 分区之间有一对一的映射, 这更容易理解和调整。

效率: 在第一种方法中实现零数据丢失需要将数据存储在 "写入前日志" 中, 该日志进一步复制了数据。这实际上是低效的, 因为数据有效地被复制两次--一次是kafka复制的, 第二次是由WAL复制的。第二种方法消除了问题, 因为没有receiver, 因此不需要WAL。只要你有足够的kafka的retention=168小时, 可以从kafka恢复信息。

精确的语义学: 第一种方法使用kafka的高级api在zookeeper中存储消耗的offset。这传统上是消费kafka数据的方式。虽然这种方法 (与写提前日志结合使用) 可以确保零数据丢失 (即at-least once semantics语义), 但在某些失败下, 某些记录可能会被消耗两次的可能性很小。出现这种情况的原因是 sparkStreaming接收到的数据与zookeeper跟踪的偏移量之间的不一致。因此, 在第二种方法中, 我们使用简单的kafka api, 不使用zookeeper。偏移由 sparkStreaming在其检查点内跟踪。这消除了sparkStreaming和 zookeeper/kafka之间的不一致, 因此 sparkStreaming有效地接收了一次, 尽管失败了。为了实现结果输出的Exactly-once semantics, 将数据保存到外部数据存储区的输出操作必须是幂等的, 或者是保存结果和偏移量是原子的 (请参阅输出操作的语义)在主要编程指南中了解更多信息)。

幂等：不过你执行多少次，你执行结果都是一样的。

3、输出操作

Output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to file systems using the saveAs***Files operations (as the file will simply get overwritten with the same data), additional effort may be necessary to achieve exactly-once semantics. There are two approaches.

Idempotent updates: Multiple attempts always write the same data. For example, saveAs***Files always writes the same data to the generated files.

Transactional updates: All updates are made transactionally so that updates are made exactly once atomically. One way to do this would be the following.

Use the batch time (available in foreachRDD) and the partition index of the RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application.

Update external system with this blob transactionally (that is, exactly once, atomically) using the identifier. That is, if the identifier is not already committed, commit the partition data and the identifier atomically. Else, if this was already committed, skip the update.

输出操作（如foreachRDD）至少有一次语义，也就是说，在出现worker失败时，转换后的数据可能会多次写入外部实体。虽然这对于使用saveAs***Files操作保存到文件系统来说是可以接受的（因为该文件只会被相同的数据复盖），但是可能需要额外的努力来实现精确的语义。有两种方法。

幂等更新：多次尝试总是写入相同的数据。例如，保存***文件总是将相同的数据写入生成的文件。

事务性更新：所有更新都是通过事务方式进行的，因此更新只需进行一次原子更新。这样做的一个方法是如下。

使用批处理时间（在foreachRDD中可用）和RDD的分区索引创建标识符。这个标识符在流应用程序中唯一地标识一个blob数据。

使用该标识符以事务方式（即，准确地一次，原子地）使用此blob更新外部系统。也就是说，如果该标识符还没有提交，则以原子方式提交分区数据和标识符。否则，如果已提交，则跳过更新。

dstream.foreachRDD { (rdd, time) =>
  rdd.foreachPartition { partitionIterator =>
    val partitionId = TaskContext.get.partitionId()
    val uniqueId = generateUniqueId(time.milliseconds, partitionId)
    // use this uniqueId to transactionally commit the data in partitionIterator
  }
}

4、缺点

Note that one disadvantage of this approach is that it does not update offsets in Zookeeper, hence Zookeeper-based Kafka monitoring tools will not show progress. However, you can access the offsets processed by this approach in each batch and update Zookeeper yourself (see below).

Next, we discuss how to use this approach in your streaming application.

请注意，这种方法的一个缺点是它没有更新zookeeper中的offset，因此，基于zookeeper的kafka监控工具将不会显示进展。但是，您可以在每个批处理中访问此方法处理的偏移量，并自己更新Zookeeper（见下面）。

接下来，我们将讨论如何在您的Streaming应用程序中使用这种方法。

// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array.empty[OffsetRange]

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
           ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
     println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }

（1）OffsetRange

表示与单个Kafka的topic和partition之间的偏移量范围。

一定要保存offset。

（2）HasOffsetRanges

与OffsetRange一样

三、spark-streaming-kafka-0-10最新版本

1、pom.xml

groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.3.2

2、代码（全部都不用自己改，底层写好了offset管理）

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "localhost:9092,anotherhost:9092",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "use_a_separate_group_id_for_each_stream",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

stream.map(record => (record.key, record.value))

3、Kafka itself

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

四、Fault-tolerance Semantics（http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics）

1、定义

At most once: Each record will be either processed once or not processed at all.
At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates.
Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.

最多只一次：每个记录要么处理一次，要么根本不处理。

至少一次：每条记录将被处理一次或多次。这比一次更强大，因为它确保了不会丢失任何数据。但也可能有副本。

精确一次：每条记录将被精确地处理一次——不会丢失任何数据，也不会对任何数据进行多次处理。这显然是三者中最有力的保证。

2、容错

Data received and replicated - This data survives failure of a single worker node as a copy of it exists on one of the other nodes.
Data received but buffered for replication - Since this is not replicated, the only way to recover this data is to get it again from the source.

接收和复制的数据-该数据在单个工作节点失败后仍然存在，因为它的副本存在于另一个节点上。

接收到但为复制而缓冲的数据——由于没有复制数据，恢复该数据的唯一方法是再次从源获取该数据。

3、需关注的两种错误情况

Failure of a Worker Node - Any of the worker nodes running executors can fail, and all in-memory data on those nodes will be lost. If any receivers were running on failed nodes, then their buffered data will be lost.

Failure of the Driver Node - If the driver node running the Spark Streaming application fails, then obviously the SparkContext is lost, and all executors with their in-memory data are lost.

一个workernode的失败运行执行器的任何辅助节点都可能会失败，而那些节点上的所有内存中数据都将丢失。如果任何接收器正在失败的节点上运行，则它们的缓冲数据将丢失。

Driver Node失败-如果运行Spark Streaming应用程序的Driver Node失败，那么很明显，SparkContext已经丢失了，所有执行器及其内存中数据都将丢失。

4、With Receiver-based Sources

下表总结了失败情况下的语义:

5、什么是可靠的receivers

根据其可靠性，可以有两种数据源。资料来源（Kafka/Flume）允许对传输的数据进行确认。如果系统从这些可靠的来源接收到的数据正确地确认接收到的数据，那么就可以确保不会因为任何类型的故障而丢失数据。这就引出了两种接收器：

可靠接收器：一个可靠的接收器可以正确的发送ack给一个可靠的数据源，当数据已经接收并存储在Spark的副本里。

不可靠接收器-一个不可靠的接收器不会发送ack给数据源，这可用于不支持ack的数据源，甚至数据源用于不希望或不需要进入复杂的ack。

6、Kafka-Streanming为了保证数据不丢失，当出现故障的时候，可以通过指定auto.offset.reset参数的设置，达到从最早的offset读取数据来保证数据零丢失，但会浪费存储

In the Kafka parameters, you must specify either metadata.broker.list or bootstrap.servers. By default, it will start consuming from the latest offset of each Kafka partition. If you set configuration auto.offset.reset in Kafka parameters to smallest, then it will start consuming from the smallest offset.

在Kafka参数中，您必须指定metadata.broker.list或bootstrap.servers。默认情况下，它将开始使用每个Kafka分区的最新偏移量。如果将Kafka参数中的配置auto.offset.reset设置为smallest ，那么它将从最小的偏移量开始消耗。