在 Spark DataFrame 中使用Time Window

28 篇文章 12 订阅

从Spark 2.0.0开始,Spark Sql包内置和Spark Streaming类似的Time Window,方便我们通过时间来理解数据。

Spark Sql包中的Window API

Tumbling Window
window(timeColumn: Column, windowDuration: String): Column
Slide Window
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column
window(timeColumn: Column,windowDuration: String,slideDuration: String,startTime: String): Column
注意
  1. timeColumn 时间列的schema必须是timestamp类型。

  2. 窗口间隔(windowDurationslideDuration)是字符串类型。如 0 years0 months1 week0 days0 hours0 minute0 seconds1 milliseconds0 microseconds

  3. startTime 开始的位置。如从每小时第15分钟开始,startTime15 minutes

测试数据

data/cpu_memory_disk_monitor.csv,每行是每隔5分钟对CPU、内存、磁盘的监控。如下:

eventTime,cpu,memory,disk
2017-12-31 23:21:01,2.87,28.23,58
2017-12-31 23:26:01,4.32,28.47,58
2017-12-31 23:31:02,3.15,28.72,58
2017-12-31 23:36:02,3.62,28.65,58
2017-12-31 23:41:02,3.25,28.70,59
2017-12-31 23:46:02,3.63,28.85,59
2017-12-31 23:51:03,2.76,28.96,59
2017-12-31 23:56:03,3.44,29.07,59
2018-01-01 00:01:03,6.14,41.54,60
2018-01-01 00:06:03,14.84,35.44,59
2018-01-01 00:11:04,20.68,39.99,59
2018-01-01 00:16:04,7.53,33.55,61
2018-01-01 00:21:05,9.27,36.83,59
2018-01-01 00:26:05,4.78,35.79,59
2018-01-01 00:31:05,12.02,36.55,59
2018-01-01 00:36:06,2.23,34.89,59
2018-01-01 00:41:06,4.44,35.29,59
2018-01-01 00:46:06,3.76,62.45,59

在Spark DataFrame 中使用Time Window

package com.bigData.spark

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SparkSession, functions}

/**
  * Author: Wang Pei
  * License: Copyright(c) Pei.Wang
  * Summary:
  *
  * spark 2.2.2
  *
  */
object SparkDataFrameTimeWindow {
  def main(args: Array[String]): Unit = {

    //设置日志等级
    Logger.getLogger("org").setLevel(Level.WARN)

    //spark环境
    val spark = SparkSession.builder().master("local[3]").appName(this.getClass.getSimpleName.replace("$","")).getOrCreate()
    import spark.implicits._

    //读取时序数据
    val data = spark.read.option("header","true").option("inferSchema","true").csv("data/cpu_memory_disk_monitor.csv")
    //data.printSchema()

    /** 1) Tumbling window */
    /** 计算: 每10分钟的平均CPU、平均内存、平均磁盘,并保留两位小数 */
    data
      .filter(functions.year($"eventTime").between(2017,2018))
      .groupBy(functions.window($"eventTime","10 minute")) //Time Window
      .agg(functions.round(functions.avg($"cpu"),2).as("avgCpu"),functions.round(functions.avg($"memory"),2).as("avgMemory"),functions.round(functions.avg($"disk"),2).as("avgDisk"))
      .sort($"window.start").select($"window.start",$"window.end",$"avgCpu",$"avgMemory",$"avgDisk")
      .limit(5)
      .show(false)


    /** 2) Slide window */
    /** 计算:从第3分钟开始,每5分钟计算最近10分钟内的平均CPU、平均内存、平均磁盘,并保留两位小数 */
    data
      .filter(functions.year($"eventTime").between(2017,2018))
      .groupBy(functions.window($"eventTime","10 minute","5 minute","3 minute")) //Time Window
      .agg(functions.round(functions.avg($"cpu"),2).as("avgCpu"),functions.round(functions.avg($"memory"),2).as("avgMemory"),functions.round(functions.avg($"disk"),2).as("avgDisk"))
      .sort($"window.start").select($"window.start",$"window.end",$"avgCpu",$"avgMemory",$"avgDisk")
      .limit(5)
      .show(false)
  }
}

sparkDataFrame_useWindow.png

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值