spark streaming背压机制
背压机制产生的背景
背压(back pressure)机制主要用于解决流处理系统中,业务流量在短时间内剧增,造成巨大的流量毛刺,数据流入速度远高于数据处理速度,对流处理系统构成巨大的负载压力的问题。
如果不能处理流量毛刺或者持续的数据过高速率输入,可能导致Executor端出现OOM的情况或者任务崩溃。
旧版背压机制(spark1.5之前)
旧版架构图
receiver-based
receiver-based数据接收器,可以配置spark.streaming.receiver.maxRate
参数来限制每个receiver没每秒最大可以接收的数据量
direct-approach
direct-approach方式接收数据,可以配置 spark.streaming.kafka.maxRatePerPartition
参数来限制每个kafka分区最多读取的数据量。
缺点
- 实现需要进行压测,来设置最大值。参数的设置必须合理,如果集群处理能力高于配置的速率,则会造成资源的浪费。
- 参数需要手动设置,设置过后必须重启streaming服务。
新版背压机制(spark1.5之后)
新版的背压机制不需要手动干预,spark streaming能够根据当前数据量以及集群状态来预估下个批次最优速率。
新版架构图
新版具体流程如下
新版的背压机制主要通过RateController
组件来实现。RateController
继承了接口StreamingListener
并实现了onBatchCompleted
方法。
结合direct-approach方式的源码来理解
- 首先创建一个kafka流。
val kafkaDStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,(String,String)](streamingContext, kafkaParams, getOffsets(topics,kc,kafkaParams),messageHandler)
- createDirectStream方法创建并返回一个DirectKafkaInputDStream对象
/**
* Create an input stream that directly pulls messages from Kafka Brokers
* without using any receiver. This stream can guarantee that each message
* from Kafka is included in transformations exactly once (see points below).
*
* Points to note:
* - No receivers: This stream does not use any receiver. It directly queries Kafka
* - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked
* by the stream itself. For interoperability with Kafka monitoring tools that depend on
* Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.
* You can access the offsets used in each batch from the generated RDDs (see
* [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).
* - Failure Recovery: To recover from driver failures, you have to enable checkpointing
* in the `StreamingContext`. The information on consumed offset can be
* recovered from the checkpoint. See the programming guide for details (constraints, etc.).
* - End-to-end semantics: This stream ensures that every records is effectively received and
* transformed exactly once, bu