智慧出行/spark Streaming-Dstream流优化:1.消费并行度,2.序列化,3.限流,压背,冷启4.cpu空转时间,5.不要在代码中判断这个表是否存在,6.推测执行7.开启动态资源分配

最新推荐文章于 2023-02-25 09:32:03 发布

程序猿与汪

最新推荐文章于 2023-02-25 09:32:03 发布

阅读量402

点赞数

分类专栏：智慧出行 Spark

本文链接：https://blog.csdn.net/weixin_45154559/article/details/107373970

版权

智慧出行同时被 2 个专栏收录

25 篇文章 1 订阅

订阅专栏

Spark

12 篇文章 2 订阅

订阅专栏

1.设置合理的消费并行度

最优的方案是:kafka分区数:broker *3/6/9

kafka分区能不能增加,能不能减少?

kafka分区数是可以增加的,但是不能减少

2.序列化

java的序列化,很沉重,会序列化好多无关的,耗时特别长
我们大数据生产中使用Kryo序列化

sparkConf.registerKryoClasses(//这里就是做的是kryo序列化,是远远比java的序列化快的多
        Array(//序列化我们订单表,司机表,乘客表
          classOf[OrderInfo],
          classOf[Opt_alliance_business],
          classOf[DriverInfo],
          classOf[RegisterUsers]

        )//指定使用kryo序列化,而放弃使用java原生的序列化,这是一个优化点
       )

3.限流,压背

1.限流使用场景

当数据处理的时间⼤于batch的间隔时间，即batch processing time > batch interval时，就会导致
Executor端的Receiver的数据产⽣堆积，极端的情况下会导致OOM异常，因此，在Spark
Streaming1.5之前可以通过参数来调整每秒处理的数据量（通过SparkConf可以直接设置):

spark.streaming.receiver.maxRate

在Driect⽅式下，可以通过下⾯参数来设置：

spark.streaming.kafka.maxRatePerPartition

这个参数可以限制作业每次从kafka的分区中最多读取的记录数
从上⾯分析来看，似乎可以通过参数来解决数据流速度的问题，那么问题来了，如果我们升级集群或
者扩展机器后，集群的吞吐量提⾼了很多，我们就需要⼿动去调整参数以避免浪费资源，有没有⼀种
⾃动可以调节的⽅式呢？
在Spark Streaming1.5之后，引⼊了压背的机制，可以动态的⾃动调整数据处理的速率

sparkConf.set("spark.streaming.backpressure.initialRate","500")
sparkConf.set("spark.streaming.backpressure.enabled","true")  //压背:他帮你自动调整你的消费速度

实现压背限流主要是依赖这三个类:
RateController:实现计算当前用什么样的速率来运行
rateEstimator:
任何实现流程序在spark源码里面都会调用这个StreamingListener特质,而这个特质StreamingListener里面给我提供了一系列的回调函数,当执行完成后调回调函数,回调函数内部实现压背.
在这里插入图片描述

processingEndTime:当前批次处理的结束时间
processingDelay:当前批次处理的所耗时长
schedulingDelay:processingStartTime- submissionTime,程序开始时间-提交时间=调度时间
numRecords:当前批次接收的数据

newRate:代表新的处理速率
在这里插入图片描述做一个控制,当前批次处理的时间>上一个批次处理的时间,并且批次的数量大于零,当前批次处理消耗时间也是大于零的

本批次处理结束时间-上一批次处理结束时间=两个批次处理完的时间间隔
在这里插入图片描述

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.streaming.scheduler.rate

import org.apache.spark.internal.Logging

/**
 * Implements a proportional-integral-derivative (PID) controller which acts on
 * the speed of ingestion of elements into Spark Streaming. A PID controller works
 * by calculating an '''error''' between a measured output and a desired value. In the
 * case of Spark Streaming the error is the difference between the measured processing
 * rate (number of elements/processing delay) and the previous rate.
 *
 * @see https://en.wikipedia.org/wiki/PID_controller
 *
 * @param batchIntervalMillis the batch duration, in milliseconds
 * @param proportional how much the correction should depend on the current
 *        error. This term usually provides the bulk of correction and should be positive or zero.
 *        A value too large would make the controller overshoot the setpoint, while a small value
 *        would make the controller too insensitive. The default value is 1.
 * @param integral how much the correction should depend on the accumulation
 *        of past errors. This value should be positive or 0. This term accelerates the movement
 *        towards the desired value, but a large value may lead to overshooting. The default value
 *        is 0.2.
 * @param derivative how much the correction should depend on a prediction
 *        of future errors, based on current rate of change. This value should be positive or 0.
 *        This term is not used very often, as it impacts stability of the system. The default
 *        value is 0.
 * @param minRate what is the minimum rate that can be estimated.
 *        This must be greater than zero, so that the system always receives some data for rate
 *        estimation to work.
 */
private[streaming] class PIDRateEstimator(
    batchIntervalMillis: Long,
    proportional: Double,
    integral: Double,
    derivative: Double,
    minRate: Double
  ) extends RateEstimator with Logging {

  private var firstRun: Boolean = true
  private var latestTime: Long = -1L
  private var latestRate: Double = -1D
  private var latestError: Double = -1L

  require(
    batchIntervalMillis > 0,
    s"Specified batch interval $batchIntervalMillis in PIDRateEstimator is invalid.")
  require(
    proportional >= 0,
    s"Proportional term $proportional in PIDRateEstimator should be >= 0.")
  require(
    integral >= 0,
    s"Integral term $integral in PIDRateEstimator should be >= 0.")
  require(
    derivative >= 0,
    s"Derivative term $derivative in PIDRateEstimator should be >= 0.")
  require(
    minRate > 0,
    s"Minimum rate in PIDRateEstimator should be > 0")

  logInfo(s"Created PIDRateEstimator with proportional = $proportional, integral = $integral, " +
    s"derivative = $derivative, min rate = $minRate")

  /*
  * time:当前批次处理的结束时间
  * numElements：批次的数据量
  * processingDelay：`processingEndTime` - `processingStartTime` 当前批次处理消耗的时间
  * schedulingDelay：`processingStartTime` - `submissionTime` 本批次调度时间
  * */
  def compute(
      time: Long, //当前批次处理结束的时间
      numElements: Long, //批次的数据量
      processingDelay: Long, //processingDelay：`processingEndTime` - `processingStartTime` 当前批次处理消耗的时间
      schedulingDelay: Long //schedulingDelay：`processingStartTime` - `submissionTime` 本批次调度时间
    ): Option[Double] = {
    logTrace(s"\ntime = $time, # records = $numElements, " +
      s"processing time = $processingDelay, scheduling delay = $schedulingDelay")
    this.synchronized {// 上来就用个锁

      /*
      *首先做了一个控制检验:当前批次处理的时间如果大于上一批次处理结束的时间
      * 当前接收的条数是大于0的,当前批次处理消耗的时间是大于0的
      * 满足上述,就开始走他的算法啦
      * */
      if (time > latestTime && numElements > 0 && processingDelay > 0) {
        /*
        * 本批次的结束时间 - 上一个批次的结束时间 =  两个批次处理完的时间间隔（s）
        * */
        val delaySinceUpdate = (time - latestTime).toDouble / 1000

        /*
        * 本批次的数据量/当前批次处理消耗的时间 = 本批次处理的速率
        * */

        val processingRate = numElements.toDouble / processingDelay * 1000

        /*
        * 上一个批次处理的速率 - 本批次处理速率 = 速率差
        * */
        val error = latestRate - processingRate

        /*
        * 本批次调度时间 * 本批次处理的速率 /  批次的时间间隔 = 当前处理需要的速率(每一个间隔处理多少数据)
        * */
        val historicalError = schedulingDelay.toDouble * processingRate / batchIntervalMillis

        // （当前批次速率差 - 上批次速率差）/两个批次处理完的时间间隔 =  误差微分
        val dError = (error - latestError) / delaySinceUpdate

        /*
        * （上批次速率差 - 矫正误差值（1）*当前批次速率差 - 矫正累计值（0.2）*当前处理需要的速率 - 矫正预测值（0）*误差微分） = 新的消费速度
        * */
        val newRate = (latestRate - proportional * error -
                                    integral * historicalError -
                                    derivative * dError).max(minRate)
        logTrace(s"""
            | latestRate = $latestRate, error = $error
            | latestError = $latestError, historicalError = $historicalError
            | delaySinceUpdate = $delaySinceUpdate, dError = $dError
            """.stripMargin) //打印一下新的速率

        latestTime = time
        /*
        * 如果是头一次执行,不通知executor改变速率 , firstRun = false
        * */
        if (firstRun) {
          latestRate = processingRate
          latestError = 0D
          firstRun = false
          logTrace("First run, rate estimation skipped")
          None
        } else {
          latestRate = newRate
          latestError = error
          logTrace(s"New rate = $newRate")
          Some(newRate)
        }
      } else {
        logTrace("Rate estimation skipped")
        None
      }
    }
  }
}

4.cpu空转时间

流任务–>task,如果task没拿到数据,这个task他依然是运行的,这个task就是空转的,即便是空转他依旧要消耗资源,因为他要进行序列化,反序列化,压缩,等等这些操作.

spark.locality.wait=1  //默认是3毫秒

默认是3毫秒的cpu等待时间,我们通过修改可以变为1毫秒,减少他空转时间从而降低资源的浪费

5.不要在代码中判断这个表是否存在

当我们要插入一批数据时,表一定是存在,不存在我们差个什么意思,对吧!
当你插入数据前做一层控制判断表是否存在,这个操作是及其浪费时间的,大概需要耗时1s左右
所以,我们在插入前就要提前创建好表,在生产中我们创建表全部用命令行的方式进行创建,因为用代码进行创建极其浪费时间

命令行方式创建表:

create "order_info",{NAME=>'MM',COMPRESSION=>'SNAPPY',SPLITS=>['0000|','0001|','0002|','0003|','0004|','0005|','0006|','0007|']}

6.推测执行

流 -run()起来后会出现
task是失败状态现象---->就会造成各种重试
pedding状态的task(等待状态的task)—>task就会被阻塞
总结:因为你这一个task的失败,导致程序不断重试,其余的task被阻塞在哪里,影响效率
为了解决上述的问题这个时候我们就要开启推测执行,

sparkConf.set("spark.speculation",true)//是否开启推测执行
.set("spark.sqeculation.interval","300")//推测执行间隔
.set("saprk.speculation.quantile","0.9") //当task完成数量达到了总数量的90%,我们开启推测执行,把失败的task移动到其他机器上重试执行

为什么90%?因为一批task往往失败的只有那一两个

7.开启动态资源分配(主要是针对SparkSQL,而SparkStreaming是不能开启的,因为他会出现数据丢失现象)

1.为什么要开启动态资源分配?

⽤户提交Spark应⽤到Yarn上时，可以通过spark-submit的num-executors参数显示地指定executor
个数，随后，ApplicationMaster会为这些executor申请资源，每个executor作为⼀个Container在
Yarn上运⾏。Spark调度器会把Task按照合适的策略分配到executor上执⾏。所有任务执⾏完后，
executor被杀死，应⽤结束。在job运⾏的过程中，⽆论executor是否领取到任务，都会⼀直占有着
资源不释放。很显然，这在任务量⼩且显示指定⼤量executor的情况下会很容易造成资源浪费

你开了50个executor 但是实际运行的只有30个,就会产生20个空转的现象,但是当我们开启了动态资源分配后,你依然是提交了50个executor,运行依然是30个,这样子下来依旧是闲置了20个,但是没关系他会在15s之后判定闲置executor,对其进行回收

provider类路径