spark-streaming-[4]-Window Operations

Spark Streaming Programming Guide     +    中文翻译


一、Window Operations 概述


As shown in the figure, every time the windowslidesover a source DStream, the source RDDs that fallwithin the window are combined and operated upon to produce the RDDs of the windowed DStream. Inthis specific case, the operation is applied over the last 3 time units of data, and slides by 2 time units.This shows that any window operation needs to specify two parameters.

  1. window length - The duration of the window (3 in the figure).
  2. sliding interval - The interval at which the window operation is performed (2 in the figure). 

Example:

counts over the last 30 seconds of data, every 10 seconds. To do this, we have to applythereduceByKeyoperation on thepairsDStream of(word, 1)pairs over the last 30 seconds of data.This is done using the operationreduceByKeyAndWindow

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

二、window operations 函数

Some of the common window operations are as follows :

Transformation Meaning
window(windowLengthslideInterval) Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(funcwindowLength,slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLengthslideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(funcinvFunc,windowLengthslideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
countByValueAndWindow(windowLength,slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument

转换

描述

window(windowLengthslideInterval)

返回一个基于源DStream的窗口批次计算后得到新的DStream。

countByWindow(windowLength,slideInterval)

返回基于滑动窗口的DStream中的元素的数量。

reduceByWindow(funcwindowLength,slideInterval)

基于滑动窗口对源DStream中的元素进行聚合操作,得到一个新的DStream。

reduceByKeyAndWindow(func,windowLength,slideInterval, [numTasks])

基于滑动窗口对(K,V)键值对类型的DStream中的值按K使用聚合函数func进行聚合操作,得到一个新的DStream。

reduceByKeyAndWindow(func,invFunc,windowLengthslideInterval, [numTasks])

一个更高效的reduceByKkeyAndWindow()的实现版本,先对滑动窗口中新的时间间隔内数据增量聚合并移去最早的与新增数据量的时间间隔内的数据统计量。例如,计算t+4秒这个时刻过去5秒窗口的WordCount,那么我们可以将t+3时刻过去5秒的统计量加上[t+3,t+4]的统计量,在减去[t-2,t-1]的统计量,这种方法可以复用中间三秒的统计量,提高统计的效率。

countByValueAndWindow(windowLength,slideInterval, [numTasks])

基于滑动窗口计算源DStream中每个RDD内每个元素出现的频次并返回DStream[(K,Long)],其中K是RDD中元素的类型,Long是元素频次。与countByValue一样,reduce任务的数量可以通过一个可选参数进行配置。


====================================================================================================================================

三、示例

3.1 countByWindow

(windowLength,slideInterval

Return a sliding window count of elements in the stream. 

此spout监听指定端口,若有链接后指定时间间隔millisecond向链接者发送一行数据
先运行以下模拟器再运行下面的 CountByWindow

TransfromData.txt中内容如下
hello
hell0
hello
hello
hjw
hjw
hjw
hjw
hello
hello
hello
program arguments:./srcFile/ TransfromData.txt    9999   1000
package com.dt.spark.main.Streaming

import java.io.PrintWriter
import java.net.ServerSocket
import java.util.Date

import scala.io.Source

/**
  * Created by hjw on 17/5/1.
  */
object StreamingSimulation {
  /*
  随机取整函数
   */
  def index(length:Int) ={
    import java.util.Random
    val rdm = new Random()
    rdm.nextInt(length)
  }

  def main(args: Array[String]) {
    if (args.length != 3){
      System.err.println("Usage: <filename><port><millisecond>")
      System.exit(1)
    }

    val filename = args(0)
    val lines = Source.fromFile(filename).getLines().toList
    val fileRow = lines.length

    val listener = new ServerSocket(args(1).toInt)

    //指定端口,当有请求时建立连接
    while(true){
      val socket = listener.accept()
      new Thread(){
        override def run() = {
          println("Got client connect from: " + socket.getInetAddress)
          val out =  new PrintWriter(socket.getOutputStream,true)
          while(true){
            Thread.sleep(args(2).toLong)
            //随机发送一行数据至client
            val content = lines(index(fileRow))

            val now = new Date()

            println("time: " + now.getTime + "    " + content)
            out.write(content + '\n')
            out.flush()
          }
          socket.close()
        }
      }.start()
    }
  }
}
 
 
 
 
CountByWindow
时间片为1s
val ssc = new StreamingContext(conf, Seconds(1))
窗口长度为3s
val windowedWordCounts = pairs.countByWindow( Seconds(3), Seconds(2))
windowedWordCounts应为3

package com.dt.spark.main.Streaming.Window

import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.streaming._ // not necessary since Spark 1.3

object CountByWindow {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit ={
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("CountByWindow_checkpoint")
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)


    val windowedWordCounts = pairs.countByWindow( Seconds(3), Seconds(2))


    // Print the first ten elements of each RDD generated in this DStream to the console
      windowedWordCounts.print()
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

先运行以下模拟器再运行CountByWindow

StreamingSimulation输出:
Got client connect from: /127.0.0.1
time: 1493700495475    hello
time: 1493700496476    hello
time: 1493700497480    hell0
time: 1493700498482    hello
time: 1493700499485    hello
time: 1493700500486    hello
time: 1493700501490    hell0
time: 1493700502491    hjw
time: 1493700503496    hello
time: 1493700504501    hjw

CountByWindow输出,其中第一次为1是窗刚启动,以后正常为3:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/05/02 12:48:13 INFO Slf4jLogger: Slf4jLogger started
17/05/02 12:48:13 INFO Remoting: Starting remoting
17/05/02 12:48:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.18.136.133:49188]
17/05/02 12:48:14 INFO WriteAheadLogManager  for Thread: Recovered 1 write ahead log files from file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata
-------------------------------------------
Time: 1493700496000 ms
-------------------------------------------
1
17/05/02 12:48:16 INFO WriteAheadLogManager  for Thread: Attempting to clear 1 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700473000: file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata/log-1493700373007-1493700433007
17/05/02 12:48:16 INFO WriteAheadLogManager  for Thread: Cleared log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700473000
-------------------------------------------
Time: 1493700498000 ms
-------------------------------------------
3
17/05/02 12:48:18 INFO WriteAheadLogManager  for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700475000: 
-------------------------------------------
Time: 1493700500000 ms
-------------------------------------------
3

====================================================================================================================================

3.2 reduceByKeyAndWindow

(func,windowLength,slideInterval, [numTasks]) 


先运行以下模拟器再运行下面的 CountByWindow

TransfromData.txt中内容如下,也即保证每次请求发送一个“hello”:
hello
hello
hello
hello
ReduceByKeyAndWindow如下
package com.dt.spark.main.Streaming.Window

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by hjw on 17/5/2.
  */
object ReduceByKeyAndWindow {
  Logger.getLogger("org").setLevel(Level.ERROR)


  def main(args: Array[String]): Unit = {
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("CountByWindow_checkpoint")
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)

    val reduceWordCounts = pairs.reduceByKeyAndWindow((v1:Int,v2:Int)=>
      v1+v2,Seconds(10),Seconds(2),2)

    reduceWordCounts.print()


    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}
模拟器输出;
Got client connect from: /127.0.0.1
time: 1493704449107    hello
time: 1493704450110    hello
time: 1493704451110    hello
time: 1493704452114    hello
time: 1493704453116    hello
time: 1493704454119    hello
time: 1493704455122    hello
time: 1493704456127    hello
time: 1493704457128    hello
time: 1493704458128    hello
time: 1493704459134    hello
time: 1493704460138    hello
time: 1493704461138    hello
time: 1493704462142    hello
time: 1493704463144    hello
ReduceByKeyAndWindow稳定后的输出:
17/05/02 13:54:17 INFO WriteAheadLogManager  for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493704445000:
 -------------------------------------------Time: 1493704459000 ms-------------------------------------------
(hello,10)
====================================================================================================================================

3.3  CountByValueAndWindow

这里我们模拟器每秒发送两个“hello" 
countByValueAndWindow的窗长度为10,一次统计中hello的出现频次也就是个数为20(也可理解为变相的(“hello”,1)的reduceByKeyAndWindow)
val countByValueAndWindow = words.countByValueAndWindow(Seconds(10), Seconds(2), 2)
模拟器TransfromData.txt中内容如下
hello  hello
hello  hello
hello  hello
hello  hello
hello  hello
hello  hello
hello  hello
CountByValueAndWindow
package com.dt.spark.main.Streaming.Window

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by hjw on 17/5/2.
  */
object CountByValueAndWindow {
  Logger.getLogger("org").setLevel(Level.ERROR)


  def main(args: Array[String]): Unit = {
    // Create a local StreamingContext with two working thread and batch interval of 1 second.
    // The master requires 2 cores to prevent from a starvation scenario.

    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))


    // Create a DStream that will connect to hostname:port, like localhost:9999
    val lines = ssc.socketTextStream("localhost", 9999)

    ssc.checkpoint("CountByValueAndWindow_checkpoint")
    // Split each line into words
    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch

    val countByValueAndWindow = words.countByValueAndWindow(Seconds(10), Seconds(2), 2)

    countByValueAndWindow.print()


    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}
CountByValueAndWindow输出入下:
17/05/02 17:47:57 INFO WriteAheadLogManager  for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByValueAndWindow_checkpoint/receivedBlockMetadata older than 1493718447000: 
-------------------------------------------
Time: 1493718479000 ms
-------------------------------------------
(,10)
(hello,20)
17/05/02 17:47:59 INFO WriteAheadLogManager  for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByValueAndWindow_checkpoint/receivedBlockMetadata older than 1493718449000: 
-------------------------------------------
Time: 1493718481000 ms
-------------------------------------------
(,10)
(hello,20)




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值