spark-streaming有状态转换计算

最新推荐文章于 2022-01-06 21:29:38 发布

Master_slaves

最新推荐文章于 2022-01-06 21:29:38 发布

阅读量758

点赞数 1

分类专栏：大数据文章标签： spark-streaming有状态转换

本文链接：https://blog.csdn.net/Master_chaoAndQi/article/details/101308420

版权

大数据专栏收录该内容

28 篇文章 0 订阅

订阅专栏

文章目录

一 updateStateByKey
二 window操作

一 updateStateByKey

操作允许在使用新信息不断更新状态的同时能够保留他的状态.

定义状态. 状态可以是任意数据类型
定义状态更新函数. 指定一个函数, 这个函数负责使用以前的状态和新值来更新状态.
在每个阶段, Spark 都会在所有已经存在的 key 上使用状态更新函数, 而不管是否有新的数据在.
Seq[V] 当前key 新接受的value的值的序列
Option[S] 上个阶段累加后的结果

def updateStateByKey[S: ClassTag](
                 updateFunc: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]

package com.gc.sparkStreaming.day01.HaveStatusTransform

import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * 有状态转换  将中间计算的结果进行缓存 方便累加
  */
object upDateStateByKey {

  //需求 对接kafka 根据输入对输入的数据进行 累加求和
  // 案例 wordCount
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("upDateStateByKey")
    val streamingContext = new StreamingContext(conf, Seconds(4))// 四秒一个批次
    val group: String ="guochao" // 消费组
    val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092" // kafka 集群地址
    val topic: String ="first" //主题
    val kafkaParams = Map(
      ConsumerConfig.GROUP_ID_CONFIG->group ,
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->brokers
    )
    val kafkaInputStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      streamingContext,
      kafkaParams,
      Set(topic)
    )
    val wordOne: DStream[(String, Int)] = kafkaInputStream.map(_._2).flatMap(_.split("\\W+")).map((_,1))
    val dsStream: DStream[(String, Int)] = wordOne.updateStateByKey(updateFunc)
    streamingContext.checkpoint("./checkpoint") // 设置检查点 将中间计算完的结果保存在此目录
    dsStream.print(100)
    streamingContext.start()
    streamingContext.awaitTermination()


  }
  //定义状态更改函数   第一个参数  当前key 新传入的value 序列    第二个参数 上个阶段这个key计算后的结果
def updateFunc(newValue:Seq[Int],runningCount:Option[Int]):Option[Int]={
  // 新的总数和状态进行求和操作
  val sum: Int = newValue.sum
  val value:Int = runningCount.getOrElse(0)
  Some[Int](sum+value)
}
}

二 window操作

Spark Streaming 也提供了窗口计算, 允许执行转换操作作用在一个窗口内的数据.
默认情况下, 计算只对一个时间段内的RDD进行, 有了窗口之后, 可以把计算应用到一个指定的窗口内的所有 RDD 上.
一个窗口可以包含多个时间段. 基于窗口的操作会在一个比StreamingContext的批次间隔更长的时间范围内，通过整合多个批次的结果，计算出整个窗口的结果。
在这里插入图片描述
观察上图, 窗口在 DStream 上每滑动一次, 落在窗口内的那些 RDD会结合在一起, 然后在上面操作产生新的 RDD, 组成了 window DStream.
在上面图的情况下, 操作会至少应用在 3 个数据单元上, 每次滑动 2 个时间单位. 所以, 窗口操作需要 2 个参数:
• 窗口长度 – 窗口的持久时间(执行一次持续多少个时间单位)(图中是 3)
• 滑动步长 – 窗口操作被执行的间隔(每多少个时间单位执行一次).(图中是 2 )
注意: 这两个参数必须是源 DStream 的 interval 的倍数.

2.1 reduceByKeyAndWindow

/**
   * Return a new DStream by applying `reduceByKey` over a sliding window. This is similar to
   * `DStream.reduceByKey()` but applies it over a sliding window. Hash partitioning is used to
   * generate the RDDs with Spark's default number of partitions.
   * @param reduceFunc associative and commutative reduce function
   * @param windowDuration width of the window; must be a multiple of this DStream's
   *                       batching interval
   * @param slideDuration  sliding interval of the window (i.e., the interval after which
   *                       the new DStream will generate RDDs); must be a multiple of this
   *                       DStream's batching interval
   */
def reduceByKeyAndWindow(
      reduceFunc: (V, V) => V,
      windowDuration: Duration,
      slideDuration: Duration
    ): DStream[(K, V)] = ssc.withScope {
    reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
  }

编码实现从kafka上读取流数据，进行计算，指定窗口长度为12，滑动长度为8

package com.gc.sparkStreaming.day02.window

import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  */
object ReduceByKeyAndWindow {

  //需求 对接kafka 根据输入对输入的数据进行 累加求和
  // 案例 wordCount
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("upDateStateByKey")
    val streamingContext = new StreamingContext(conf, Seconds(4))// 四秒一个批次
    val group: String ="guochao" // 消费组
    val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092" // kafka 集群地址
    val topic: String ="first" //主题
    val kafkaParams = Map(
      ConsumerConfig.GROUP_ID_CONFIG->group ,
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->brokers
    )
    val kafkaInputStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      streamingContext,
      kafkaParams,
      Set(topic)
    )
    val wordOne: DStream[(String, Int)] = kafkaInputStream.map(_._2).flatMap(_.split("\\W+")).map((_,1))
    val resDStream: DStream[(String, Int)] = wordOne.reduceByKeyAndWindow((x: Int, y: Int) => {
      x + y
    }, Seconds(12), Seconds(8))
    // 指定窗口的长度为Seconds(12) 窗口滑动步长为Seconds(8)
    // (x:Int,y:Int)=>{x+y}聚合函数 必须带上类型
    // 窗口长度和敞口滑动的步长必须为批次的整数倍
    resDStream.print(100)
    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

2.2 reduceByKeyAndWindow

invReduceFunc 比没有invReduceFunc高效. 会利用旧值来进行计算.
invReduceFunc: (V, V) => V 窗口移动了, 上一个窗口和新的窗口会有重叠部分, 重叠部分的值可以不用重复计算了. 第一个参数就是新的值, 第二个参数是旧的值

方法声明

/**
   * Return a new DStream by applying incremental `reduceByKey` over a sliding window.
   * The reduced value of over a new window is calculated using the old window's reduced value :
   *  1. reduce the new values that entered the window (e.g., adding new counts)
   *
   *  2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)
   *
   * This is more efficient than reduceByKeyAndWindow without "inverse reduce" function.
   * However, it is applicable to only "invertible reduce functions".
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   * @param reduceFunc associative and commutative reduce function
   * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
   *                      `invReduceFunc(reduceFunc(x, y), x) = y`
   * @param windowDuration width of the window; must be a multiple of this DStream's
   *                       batching interval
   * @param slideDuration  sliding interval of the window (i.e., the interval after which
   *                       the new DStream will generate RDDs); must be a multiple of this
   *                       DStream's batching interval
   * @param filterFunc     Optional function to filter expired key-value pairs;
   *                       only pairs that satisfy the function are retained
   */
  def reduceByKeyAndWindow(
      reduceFunc: (V, V) => V,
      invReduceFunc: (V, V) => V,
      windowDuration: Duration,
      slideDuration: Duration = self.slideDuration,
      numPartitions: Int = ssc.sc.defaultParallelism,
      filterFunc: ((K, V)) => Boolean = null
    ): DStream[(K, V)] = ssc.withScope {
    reduceByKeyAndWindow(
      reduceFunc, invReduceFunc, windowDuration,
      slideDuration, defaultPartitioner(numPartitions), filterFunc
    )
  }

编码实现wordcount 指定invReduceFunc函数

package com.gc.sparkStreaming.day02.window

import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  */
object ReduceByKeyAndWindow2 {

  //需求 对接kafka 根据输入对输入的数据进行 累加求和
  // 案例 wordCount
  // The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint(). 需要指定checkpoint
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("upDateStateByKey")
    val streamingContext = new StreamingContext(conf, Seconds(4))// 四秒一个批次
    val group: String ="guochao" // 消费组
    val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092" // kafka 集群地址
    val topic: String ="first" //主题
    val kafkaParams = Map(
      ConsumerConfig.GROUP_ID_CONFIG->group ,
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->brokers
    )
    val kafkaInputStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      streamingContext,
      kafkaParams,
      Set(topic)
    )
    val wordOne: DStream[(String, Int)] = kafkaInputStream.map(_._2).flatMap(_.split("\\W+")).map((_,1))
    val resDStream: DStream[(String, Int)] = wordOne.reduceByKeyAndWindow((x: Int, y: Int) => {x + y},
      (newValue:Int,oldValue:Int)=>{
        println(newValue)
        println(oldValue)
        newValue-oldValue},
      Seconds(12),
      Seconds(8),
      filterFunc = _._2>0
    )
    //
    // (x:Int,y:Int)=>{x+y}聚合函数 必须带上类型
    //  (newValue:Int,oldValue:Int)=>{newValue-oldValue}, invReduceFunc函数  第一参数为新计算的结果  oldValue 上个窗口计算的结果
    // 指定窗口的长度为Seconds(12) 窗口滑动步长为Seconds(8)
    //filterFunc = _._2>0 命名参数 传递过滤函数 过滤掉值为0 的数据
    // 窗口长度和敞口滑动的步长必须为批次的整数倍
    //
    streamingContext.checkpoint("checkpoint1")
    resDStream.print(100)
    streamingContext.start()
    streamingContext.awaitTermination()
    //在窗口滑动 超过 窗口长度的时候 会出现下面的结果 可以在增加一个过滤参数 过滤掉 上次计算结果为0 的数据
//    (d,0)
//    (b,0)
//    (f,0)
//    (s,0)
//    (gh,0)
//    (a,0)
//    (g,0)

  }
}

2.3 window(windowLength, slideInterval)

基于对源 DStream 窗化的批次进行计算返回一个新的 Dstream

package com.gc.sparkStreaming.day02.window

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
// nc -lt 9999 从scoket网络端口中获取数据
object window2 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("window2").setMaster("local[2]")
    val sc: StreamingContext = new  StreamingContext(conf,Seconds(3))

    val dsStream: DStream[String] = sc.socketTextStream("hadoop102",9999).window(Seconds(9),Seconds(6))
    val rsDstream: DStream[(String, Int)] = dsStream.flatMap(_.split("\\W+")).map((_,1)).reduceByKey(_+_)
    rsDstream.print(100)

    sc.start()
    sc.awaitTermination()
   /* -------------------------------------------
    Time: 1569327660000 ms
      -------------------------------------------

    -------------------------------------------
    Time: 1569327666000 ms
      -------------------------------------------
    (wangwu,1)
    (lisi,1)

    -------------------------------------------
    Time: 1569327672000 ms
      -------------------------------------------
    (zhangsan,1)
    (wangwu,1)
    (wangermazi,1)

    -------------------------------------------
    Time: 1569327678000 ms
      -------------------------------------------
    (wangermazi,1)

    -------------------------------------------
    Time: 1569327684000 ms
      -------------------------------------------

*/

  }
}

2.4 countByWindow(windowLength, slideInterval)

返回一个滑动窗口计数流中的元素的个数

 /**
   * Return a new DStream in which each RDD has a single element generated by counting the number
   * of elements in a sliding window over this DStream. Hash partitioning is used to generate
   * the RDDs with Spark's default number of partitions.
   * @param windowDuration width of the window; must be a multiple of this DStream's
   *                       batching interval
   * @param slideDuration  sliding interval of the window (i.e., the interval after which
   *                       the new DStream will generate RDDs); must be a multiple of this
   *                       DStream's batching interval
   */
  def countByWindow(
      windowDuration: Duration,
      slideDuration: Duration): DStream[Long] = ssc.withScope {
    this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
  }

package com.gc.sparkStreaming.day02.window

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream

object countByWindow {
  def main(args: Array[String]): Unit = {
      val conf: SparkConf = new SparkConf().setAppName("window2").setMaster("local[2]")
      val sc: StreamingContext = new  StreamingContext(conf,Seconds(3))
// 需要指定对应的checkpoint 目录
    sc.checkpoint("./window")
      val dsStream: DStream[String] = sc.socketTextStream("hadoop102",9999)
      val rsDstream: DStream[(String, Int)] = dsStream.flatMap(_.split("\\W+")).map((_,1)).reduceByKey(_+_)
      rsDstream.print(100)
    val countDstream: DStream[Long] = rsDstream.countByWindow(Seconds(9),Seconds(6))
    countDstream.print(100)
      sc.start()
      sc.awaitTermination()
  }
}