spark-streaming-[4]-Window Operations

最新推荐文章于 2020-08-30 10:17:45 发布

hjw199089

最新推荐文章于 2020-08-30 10:17:45 发布

阅读量619

点赞数

分类专栏： [13]spark streaming 文章标签： spark

本文链接：https://blog.csdn.net/hjw199089/article/details/71078104

版权

[13]spark streaming 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

Spark Streaming Programming Guide + 中文翻译

一、Window Operations 概述

As shown in the figure, every time the windowslidesover a source DStream, the source RDDs that fallwithin the window are combined and operated upon to produce the RDDs of the windowed DStream. Inthis specific case, the operation is applied over the last 3 time units of data, and slides by 2 time units.This shows that any window operation needs to specify two parameters.

window length - The duration of the window (3 in the figure).
sliding interval - The interval at which the window operation is performed (2 in the figure).

Example:

counts over the last 30 seconds of data, every 10 seconds. To do this, we have to applythereduceByKeyoperation on thepairsDStream of(word, 1)pairs over the last 30 seconds of data.This is done using the operationreduceByKeyAndWindow.

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

二、window operations 函数

Some of the common window operations are as follows :

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument

转换	描述
window(windowLength, slideInterval)	返回一个基于源DStream的窗口批次计算后得到新的DStream。
countByWindow(windowLength,slideInterval)	返回基于滑动窗口的DStream中的元素的数量。
reduceByWindow(func, windowLength,slideInterval)	基于滑动窗口对源DStream中的元素进行聚合操作，得到一个新的DStream。
reduceByKeyAndWindow(func,windowLength,slideInterval, [numTasks])	基于滑动窗口对（K，V）键值对类型的DStream中的值按K使用聚合函数func进行聚合操作，得到一个新的DStream。
reduceByKeyAndWindow(func,invFunc,windowLength, slideInterval, [numTasks])	一个更高效的reduceByKkeyAndWindow()的实现版本，先对滑动窗口中新的时间间隔内数据增量聚合并移去最早的与新增数据量的时间间隔内的数据统计量。例如，计算t+4秒这个时刻过去5秒窗口的WordCount，那么我们可以将t+3时刻过去5秒的统计量加上[t+3，t+4]的统计量，在减去[t-2，t-1]的统计量，这种方法可以复用中间三秒的统计量，提高统计的效率。
countByValueAndWindow(windowLength,slideInterval, [numTasks])	基于滑动窗口计算源DStream中每个RDD内每个元素出现的频次并返回DStream[(K,Long)]，其中K是RDD中元素的类型，Long是元素频次。与countByValue一样，reduce任务的数量可以通过一个可选参数进行配置。

====================================================================================================================================

三、示例

3.1 countByWindow

(windowLength,slideInterval)

Return a sliding window count of elements in the stream.

此spout监听指定端口，若有链接后指定时间间隔millisecond向链接者发送一行数据

先运行以下模拟器再运行下面的 CountByWindow

TransfromData.txt中内容如下

hello
hell0
hello
hello
hjw
hjw
hjw
hjw
hello
hello
hello

program arguments：./srcFile/ TransfromData.txt 9999 1000

package com.dt.spark.main.Streaming

import java.io.PrintWriter
import java.net.ServerSocket
import java.util.Date

import scala.io.Source

/**
  * Created by hjw on 17/5/1.
  */
object StreamingSimulation {
  /*
  随机取整函数
   */
  def index(length:Int) ={
    import java.util.Random
    val rdm = new Random()
    rdm.nextInt(length)
  }

  def main(args: Array[String]) {
    if (args.length != 3){
      System.err.println("Usage: <filename><port><millisecond>")
      System.exit(1)
    }

    val filename = args(0)
    val lines = Source.fromFile(filename).getLines().toList
    val fileRow = lines.length

    val listener = new ServerSocket(args(1).toInt)

    //指定端口,当有请求时建立连接
    while(true){
      val socket = listener.accept()
      new Thread(){
        override def run() = {
          println("Got client connect from: " + socket.getInetAddress)
          val out =  new PrintWriter(socket.getOutputStream,true)
          while(true){
            Thread.sleep(args(2).toLong)
            //随机发送一行数据至client
            val content = lines(index(fileRow))

            val now = new Date()

            println("time: " + now.getTime + "    " + content)
            out.write(content + '\n')
            out.flush()
          }
          socket.close()
        }
      }.start()
    }
  }
}

CountByWindow

时间片为1s

val ssc = new StreamingContext(conf, Seconds(1))

窗口长度为3s

val windowedWordCounts = pairs.countByWindow( Seconds(3), Seconds(2))

windowedWordCounts应为3

package com.dt.spark.main.Streaming.Window

import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.streaming._ // not necessary since Spark 1.3

object CountByWindow {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit ={
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("CountByWindow_checkpoint")
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)


    val windowedWordCounts = pairs.countByWindow( Seconds(3), Seconds(2))


    // Print the first ten elements of each RDD generated in this DStream to the console
      windowedWordCounts.print()
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

先运行以下模拟器再运行CountByWindow

StreamingSimulation输出：

Got client connect from: /127.0.0.1
time: 1493700495475 hello
time: 1493700496476 hello
time: 1493700497480 hell0
time: 1493700498482 hello
time: 1493700499485 hello
time: 1493700500486 hello
time: 1493700501490 hell0
time: 1493700502491 hjw
time: 1493700503496 hello
time: 1493700504501 hjw

CountByWindow输出，其中第一次为1是窗刚启动，以后正常为3：

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/05/02 12:48:13 INFO Slf4jLogger: Slf4jLogger started
17/05/02 12:48:13 INFO Remoting: Starting remoting
17/05/02 12:48:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.18.136.133:49188]
17/05/02 12:48:14 INFO WriteAheadLogManager for Thread: Recovered 1 write ahead log files from file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata
-------------------------------------------
Time: 1493700496000 ms
-------------------------------------------
1
17/05/02 12:48:16 INFO WriteAheadLogManager for Thread: Attempting to clear 1 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700473000: file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata/log-1493700373007-1493700433007
17/05/02 12:48:16 INFO WriteAheadLogManager for Thread: Cleared log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700473000
-------------------------------------------
Time: 1493700498000 ms
-------------------------------------------
3
17/05/02 12:48:18 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700475000:
-------------------------------------------
Time: 1493700500000 ms
-------------------------------------------
3

====================================================================================================================================

3.2 reduceByKeyAndWindow

(func,windowLength,slideInterval, [numTasks])

先运行以下模拟器再运行下面的 CountByWindow

TransfromData.txt中内容如下,也即保证每次请求发送一个“hello”:

hello
hello
hello
hello

ReduceByKeyAndWindow如下

package com.dt.spark.main.Streaming.Window

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by hjw on 17/5/2.
  */
object ReduceByKeyAndWindow {
  Logger.getLogger("org").setLevel(Level.ERROR)


  def main(args: Array[String]): Unit = {
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("CountByWindow_checkpoint")
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)

    val reduceWordCounts = pairs.reduceByKeyAndWindow((v1:Int,v2:Int)=>
      v1+v2,Seconds(10),Seconds(2),2)

    reduceWordCounts.print()


    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}
模拟器输出;

Got client connect from: /127.0.0.1
time: 1493704449107    hello
time: 1493704450110    hello
time: 1493704451110    hello
time: 1493704452114    hello
time: 1493704453116    hello
time: 1493704454119    hello
time: 1493704455122    hello
time: 1493704456127    hello
time: 1493704457128    hello
time: 1493704458128    hello
time: 1493704459134    hello
time: 1493704460138    hello
time: 1493704461138    hello
time: 1493704462142    hello
time: 1493704463144    hello

ReduceByKeyAndWindow稳定后的输出：
17/05/02 13:54:17 INFO WriteAheadLogManager  for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493704445000:

 -------------------------------------------Time: 1493704459000 ms-------------------------------------------

(hello,10)

====================================================================================================================================

3.3 CountByValueAndWindow

这里我们模拟器每秒发送两个“hello"

countByValueAndWindow的窗长度为10，一次统计中hello的出现频次也就是个数为20（也可理解为变相的（“hello”，1）的reduceByKeyAndWindow）

val countByValueAndWindow = words.countByValueAndWindow(Seconds(10), Seconds(2), 2)

模拟器TransfromData.txt中内容如下

hello  hello
hello  hello
hello  hello
hello  hello
hello  hello
hello  hello
hello  hello

CountByValueAndWindow

package com.dt.spark.main.Streaming.Window

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by hjw on 17/5/2.
  */
object CountByValueAndWindow {
  Logger.getLogger("org").setLevel(Level.ERROR)


  def main(args: Array[String]): Unit = {
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("CountByValueAndWindow_checkpoint")
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch

    val countByValueAndWindow = words.countByValueAndWindow(Seconds(10), Seconds(2), 2)

    countByValueAndWindow.print()


    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

CountByValueAndWindow输出入下：

17/05/02 17:47:57 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByValueAndWindow_checkpoint/receivedBlockMetadata older than 1493718447000:
-------------------------------------------
Time: 1493718479000 ms
-------------------------------------------
(,10)
(hello,20)
17/05/02 17:47:59 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByValueAndWindow_checkpoint/receivedBlockMetadata older than 1493718449000:
-------------------------------------------
Time: 1493718481000 ms
-------------------------------------------
(,10)
(hello,20)