Spark Streaming的简单使用

最新推荐文章于 2024-01-21 03:40:17 发布

啊帅和和。

最新推荐文章于 2024-01-21 03:40:17 发布

阅读量1.1k

点赞数

分类专栏： Spark专栏。大数据专栏。文章标签： spark big data 大数据

本文链接：https://blog.csdn.net/l_dsj/article/details/121322255

版权

大数据专栏。同时被 2 个专栏收录

50 篇文章 1 订阅

订阅专栏

Spark专栏。

17 篇文章 0 订阅

订阅专栏

理解微批处理

spark的微批处理相当于给我们所写的代码加上一个定时任务，只要将我们的数据源换掉，不使用HDFS，数据就可以源源不断的获得，并且进行处理
在这里插入图片描述
这里的数据源不再是HDFS（比较慢，会有延迟），而是消息队列（数据排队，先进先出），数据会被生产和消费，这里的数据不会被永久存储，一般是七天

在这里插入图片描述

简单wordcount案例（处理完就没有了）

这里的核数起码要是两个，程序启动进程需要一个，接收数据需要一个

这里启动一个端口即可，Linux中启动端口需要 nc 命令，这里yum install nc 下载一个即可

在这里插入图片描述

//创建spark streaming环境
    val conf = new SparkConf()
    conf.setAppName("Demo1WordCountStreaming")
    conf.setMaster("local[*]") //这里的任务数起码要是两个，因为启动程序需要一个，接收数据需要一个核

    val sc = new SparkContext(conf)

    //相当于五秒封装成一个batch
    //设置微批次处理时间
    val ssc = new StreamingContext(sc, Durations.seconds(5))

    /**
      * 使用nc工具模拟消息队列
      * 在Linux中安装：yum install nc
      */
      //通过socket连接到nc服务，得到DStream
    val words: ReceiverInputDStream[String] = ssc.socketTextStream("master",8888)

    val flatMapDS: DStream[String] = words.flatMap(_.split(","))

    val mapDS: DStream[(String, Int)] = flatMapDS.map(word=>(word,1))

    val redultDS: DStream[(String, Int)] = mapDS.reduceByKey(_+_)

    redultDS.print()

    //sparkstreaming程序需要手动启动
    //启动
    ssc.start()
    //等待停止
    ssc.awaitTermination()
    ssc.stop()

在这里插入图片描述

有状态算子

有状态算子会保留之前计算过的数据，这里需要设置checkpoint的地址

有状态算子updateStateByKey中需要传入一个函数，seq代表的是当前批次按key进行分组后的所有value，option表示之前的计算结果

def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
    conf.setMaster("local[*]")
    conf.setAppName("Demo2UpdateStateByKey")

    val sc = new SparkContext(conf)

    val ssc: StreamingContext = new StreamingContext(sc,Durations.seconds(5))

    ssc.checkpoint("spark/data/checkpoint")

    val words: ReceiverInputDStream[String] = ssc.socketTextStream("master",8888)

    val wordsDS: DStream[(String, Int)] = words.flatMap(_.split(","))
      .map(word => (word, 1))

    /**
      * reducebykey：每隔五秒算一次，前面的结果不会保留
      */
    //想要保留：状态算子(updateStateByKey)
    //按照key更新状态
    //状态:前面计算的结果

    //seq:当前batch(批次)按key进行分组后的所有的value
    //option:表示之前计算的结果（option表示 有 或者 没有）
    //比如会出现这样的格式(a,[1,1,1,1,1]) 会默认按照key分组
    def updateFunc(seq:Seq[Int],option:Option[Int]):Option[Int ] ={

      val seq_sum: Int = seq.sum
      val last_res: Int = option.getOrElse(0) //之前的计算结果
      Some(seq_sum + last_res)

    }

    wordsDS.updateStateByKey(updateFunc).print()

    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
  }

滑动窗口

（有重叠，没有重叠，是滚动窗口，也就是说窗口大小和滑动时间一致）
这里我们每5s计算一次10s内的数据
在这里插入图片描述

def main(args: Array[String]): Unit = {

    //每隔5s，统计最近10s任务
    val conf = new SparkConf()
    conf.setMaster("local[*]")
    conf.setAppName("Demo3Window")

    val sc = new SparkContext(conf)

    val ssc: StreamingContext = new StreamingContext(sc,Durations.seconds(5))

    ssc.checkpoint("spark/data/checkpoint")

    val words: ReceiverInputDStream[String] = ssc.socketTextStream("master",8888)

    words
      .flatMap(_.split(","))
      .map(word=>(word,1))
      .reduceByKeyAndWindow(
        (x:Int,y:Int)=>(x + y) //这里传入一个聚合函数
        , Durations.seconds(10) //这个是窗口的大小
        , Durations.seconds(5) //表示每5s滑动一次 默认为前面设置的5
      ).print()

    ssc.start()
    ssc.awaitTermination()
    ssc.stop()

  }