从Spark 2.0.0开始,Spark Sql包内置和Spark Streaming类似的Time Window,方便我们通过时间来理解数据。
Spark Sql包中的Window API
Tumbling Window
window(timeColumn: Column, windowDuration: String): Column
Slide Window
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column
window(timeColumn: Column,windowDuration: String,slideDuration: String,startTime: String): Column
注意
-
timeColumn
时间列的schema
必须是timestamp
类型。 -
窗口间隔(
windowDuration
、slideDuration
)是字符串类型。如0 years
、0 months
、1 week
、0 days
、0 hours
、0 minute
、0 seconds
、1 milliseconds
、0 microseconds
。 -
startTime
开始的位置。如从每小时第15分钟开始,startTime
为15 minutes
。
测试数据
data/cpu_memory_disk_monitor.csv,每行是每隔5分钟对CPU、内存、磁盘的监控。如下:
eventTime,cpu,memory,disk
2017-12-31 23:21:01,2.87,28.23,58
2017-12-31 23:26:01,4.32,28.47,58
2017-12-31 23:31:02,3.15,28.72,58
2017-12-31 23:36:02,3.62,28.65,58
2017-12-31 23:41:02,3.25,28.70,59
2017-12-31 23:46:02,3.63,28.85,59
2017-12-31 23:51:03,2.76,28.96,59
2017-12-31 23:56:03,3.44,29.07,59
2018-01-01 00:01:03,6.14,41.54,60
2018-01-01 00:06:03,14.84,35.44,59
2018-01-01 00:11:04,20.68,39.99,59
2018-01-01 00:16:04,7.53,33.55,61
2018-01-01 00:21:05,9.27,36.83,59
2018-01-01 00:26:05,4.78,35.79,59
2018-01-01 00:31:05,12.02,36.55,59
2018-01-01 00:36:06,2.23,34.89,59
2018-01-01 00:41:06,4.44,35.29,59
2018-01-01 00:46:06,3.76,62.45,59
在Spark DataFrame 中使用Time Window
package com.bigData.spark
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SparkSession, functions}
/**
* Author: Wang Pei
* License: Copyright(c) Pei.Wang
* Summary:
*
* spark 2.2.2
*
*/
object SparkDataFrameTimeWindow {
def main(args: Array[String]): Unit = {
//设置日志等级
Logger.getLogger("org").setLevel(Level.WARN)
//spark环境
val spark = SparkSession.builder().master("local[3]").appName(this.getClass.getSimpleName.replace("$","")).getOrCreate()
import spark.implicits._
//读取时序数据
val data = spark.read.option("header","true").option("inferSchema","true").csv("data/cpu_memory_disk_monitor.csv")
//data.printSchema()
/** 1) Tumbling window */
/** 计算: 每10分钟的平均CPU、平均内存、平均磁盘,并保留两位小数 */
data
.filter(functions.year($"eventTime").between(2017,2018))
.groupBy(functions.window($"eventTime","10 minute")) //Time Window
.agg(functions.round(functions.avg($"cpu"),2).as("avgCpu"),functions.round(functions.avg($"memory"),2).as("avgMemory"),functions.round(functions.avg($"disk"),2).as("avgDisk"))
.sort($"window.start").select($"window.start",$"window.end",$"avgCpu",$"avgMemory",$"avgDisk")
.limit(5)
.show(false)
/** 2) Slide window */
/** 计算:从第3分钟开始,每5分钟计算最近10分钟内的平均CPU、平均内存、平均磁盘,并保留两位小数 */
data
.filter(functions.year($"eventTime").between(2017,2018))
.groupBy(functions.window($"eventTime","10 minute","5 minute","3 minute")) //Time Window
.agg(functions.round(functions.avg($"cpu"),2).as("avgCpu"),functions.round(functions.avg($"memory"),2).as("avgMemory"),functions.round(functions.avg($"disk"),2).as("avgDisk"))
.sort($"window.start").select($"window.start",$"window.end",$"avgCpu",$"avgMemory",$"avgDisk")
.limit(5)
.show(false)
}
}