PysparkNote103---window滑窗

最新推荐文章于 2023-05-22 17:18:05 发布

维格堂406小队

最新推荐文章于 2023-05-22 17:18:05 发布

阅读量650

点赞数

分类专栏： ★★★PySpark # ★★Pyspark基础文章标签： spark 大数据

本文链接：https://blog.csdn.net/wendaomudong_l2d4/article/details/126145262

版权

★★★PySpark 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

★★Pyspark基础

13 篇文章 0 订阅

订阅专栏

Intro

window滑窗函数的使用，主要是应用在一些统计场景。举例：统计下面每个人每15分钟点击次数

数据构造

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

def get_or_create(app_name):
    spark = SparkSession \
        .builder \
        .appName(app_name) \
        .config("spark.driver.maxResultSize", "10g") \
        .config("spark.sql.execution.arrow.enabled", "true") \
        .config("spark.dynamicAllocation.enabled", "false") \
        .config("spark.sql.crossJoin.enabled", "true") \
        .config("spark.kryoserializer.buffer.max", "512m") \
        .getOrCreate()
    spark.sparkContext.setLogLevel('ERROR')
    return spark

import pandas as pd

df = pd.DataFrame({'name': ['A', 'A', 'A', 'B', 'B'], 'click': [1, 1, 1, 1, 1],
                  'click_time': ['2022-08-01 11:59:59', '2022-08-01 12:02:00', '2022-08-01 12:03:59',
                                 '2022-08-01 12:00:10', '2022-08-01 12:02:10']})
df

	name	click	click_time
0	A	1	2022-08-01 11:59:59
1	A	1	2022-08-01 12:02:00
2	A	1	2022-08-01 12:03:59
3	B	1	2022-08-01 12:00:10
4	B	1	2022-08-01 12:02:10

spark = get_or_create('spark')
df_spark = spark.createDataFrame(df)
df_spark.show(truncate=False)

D:\code\spark\python\pyspark\sql\session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)


+----+-----+-------------------+
|name|click|click_time         |
+----+-----+-------------------+
|A   |1    |2022-08-01 11:59:59|
|A   |1    |2022-08-01 12:02:00|
|A   |1    |2022-08-01 12:03:59|
|B   |1    |2022-08-01 12:00:10|
|B   |1    |2022-08-01 12:02:10|
+----+-----+-------------------+

df_spark.printSchema()

root
 |-- name: string (nullable = true)
 |-- click: long (nullable = true)
 |-- click_time: string (nullable = true)

滑窗统计

滑窗左闭右开
参数介绍：

timeColumn: The time column must be of :class:pyspark.sql.types.TimestampType 滑窗统计的时间字段有类型要求
windowDuration: Durations are provided as strings, e.g. ‘1 second’, ‘1 day 12 hours’, ‘2 minutes’. Valid
interval strings are ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’.窗宽
slideDuration: If the slideDuration is not provided, the windows will be tumbling windows.滑窗长度，如果None就是滚动滑窗
startTime: The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start
window intervals. For example, in order to have hourly tumbling windows that start 15 minutes
past the hour, e.g. 12:15-13:15, 13:15-14:15… provide startTime as 15 minutes.

df_spark.withColumn('click_timestamp',F.to_timestamp(F.col('click_time'))).show()

+----+-----+-------------------+-------------------+
|name|click|         click_time|    click_timestamp|
+----+-----+-------------------+-------------------+
|   A|    1|2022-08-01 11:59:59|2022-08-01 11:59:59|
|   A|    1|2022-08-01 12:02:00|2022-08-01 12:02:00|
|   A|    1|2022-08-01 12:03:59|2022-08-01 12:03:59|
|   B|    1|2022-08-01 12:00:10|2022-08-01 12:00:10|
|   B|    1|2022-08-01 12:02:10|2022-08-01 12:02:10|
+----+-----+-------------------+-------------------+

df_spark.withColumn('click_timestamp',F.to_timestamp(F.col('click_time'))).printSchema()

root
 |-- name: string (nullable = true)
 |-- click: long (nullable = true)
 |-- click_time: string (nullable = true)
 |-- click_timestamp: timestamp (nullable = true)

5分钟窗宽，每分钟滑一次

df_spark.withColumn('click_timestamp',F.to_timestamp(F.col('click_time')))\
.groupby(['name',F.window(timeColumn='click_timestamp', windowDuration='5 minute', slideDuration='1 minute', startTime=None)])\
.agg(F.count('click').alias('cnt'),F.countDistinct('click').alias('dnt'))\
.withColumn('window_start',F.col('window.start'))\
.withColumn('window_end',F.col('window.end'))\
.orderBy(F.col('name'),F.col('window_start').asc()).show(truncate=False)

+----+------------------------------------------+---+---+-------------------+-------------------+
|name|window                                    |cnt|dnt|window_start       |window_end         |
+----+------------------------------------------+---+---+-------------------+-------------------+
|A   |[2022-08-01 11:55:00, 2022-08-01 12:00:00]|1  |1  |2022-08-01 11:55:00|2022-08-01 12:00:00|
|A   |[2022-08-01 11:56:00, 2022-08-01 12:01:00]|1  |1  |2022-08-01 11:56:00|2022-08-01 12:01:00|
|A   |[2022-08-01 11:57:00, 2022-08-01 12:02:00]|1  |1  |2022-08-01 11:57:00|2022-08-01 12:02:00|
|A   |[2022-08-01 11:58:00, 2022-08-01 12:03:00]|2  |1  |2022-08-01 11:58:00|2022-08-01 12:03:00|
|A   |[2022-08-01 11:59:00, 2022-08-01 12:04:00]|3  |1  |2022-08-01 11:59:00|2022-08-01 12:04:00|
|A   |[2022-08-01 12:00:00, 2022-08-01 12:05:00]|2  |1  |2022-08-01 12:00:00|2022-08-01 12:05:00|
|A   |[2022-08-01 12:01:00, 2022-08-01 12:06:00]|2  |1  |2022-08-01 12:01:00|2022-08-01 12:06:00|
|A   |[2022-08-01 12:02:00, 2022-08-01 12:07:00]|2  |1  |2022-08-01 12:02:00|2022-08-01 12:07:00|
|A   |[2022-08-01 12:03:00, 2022-08-01 12:08:00]|1  |1  |2022-08-01 12:03:00|2022-08-01 12:08:00|
|B   |[2022-08-01 11:56:00, 2022-08-01 12:01:00]|1  |1  |2022-08-01 11:56:00|2022-08-01 12:01:00|
|B   |[2022-08-01 11:57:00, 2022-08-01 12:02:00]|1  |1  |2022-08-01 11:57:00|2022-08-01 12:02:00|
|B   |[2022-08-01 11:58:00, 2022-08-01 12:03:00]|2  |1  |2022-08-01 11:58:00|2022-08-01 12:03:00|
|B   |[2022-08-01 11:59:00, 2022-08-01 12:04:00]|2  |1  |2022-08-01 11:59:00|2022-08-01 12:04:00|
|B   |[2022-08-01 12:00:00, 2022-08-01 12:05:00]|2  |1  |2022-08-01 12:00:00|2022-08-01 12:05:00|
|B   |[2022-08-01 12:01:00, 2022-08-01 12:06:00]|1  |1  |2022-08-01 12:01:00|2022-08-01 12:06:00|
|B   |[2022-08-01 12:02:00, 2022-08-01 12:07:00]|1  |1  |2022-08-01 12:02:00|2022-08-01 12:07:00|
+----+------------------------------------------+---+---+-------------------+-------------------+

5分钟窗宽，滚动滑窗

这里的滚动滑窗相当于滑动距离等于与窗宽的滑动滑窗

df_spark.withColumn('click_timestamp',F.to_timestamp(F.col('click_time')))\
.groupby(['name',F.window(timeColumn='click_timestamp', windowDuration='5 minute', slideDuration='5 minute', startTime=None)])\
.agg(F.count('click').alias('cnt'),F.countDistinct('click').alias('dnt'))\
.withColumn('window_start',F.col('window.start'))\
.withColumn('window_end',F.col('window.end'))\
.orderBy(F.col('name'),F.col('window_start').asc()).show(truncate=False)

+----+------------------------------------------+---+---+-------------------+-------------------+
|name|window                                    |cnt|dnt|window_start       |window_end         |
+----+------------------------------------------+---+---+-------------------+-------------------+
|A   |[2022-08-01 11:55:00, 2022-08-01 12:00:00]|1  |1  |2022-08-01 11:55:00|2022-08-01 12:00:00|
|A   |[2022-08-01 12:00:00, 2022-08-01 12:05:00]|2  |1  |2022-08-01 12:00:00|2022-08-01 12:05:00|
|B   |[2022-08-01 12:00:00, 2022-08-01 12:05:00]|2  |1  |2022-08-01 12:00:00|2022-08-01 12:05:00|
+----+------------------------------------------+---+---+-------------------+-------------------+

df_spark.withColumn('click_timestamp',F.to_timestamp(F.col('click_time')))\
.groupby(['name',F.window(timeColumn='click_timestamp', windowDuration='5 minute', slideDuration=None, startTime=None)])\
.agg(F.count('click').alias('cnt'),F.countDistinct('click').alias('dnt'))\
.withColumn('window_start',F.col('window.start'))\
.withColumn('window_end',F.col('window.end'))\
.orderBy(F.col('name'),F.col('window_start').asc()).show(truncate=False)

+----+------------------------------------------+---+---+-------------------+-------------------+
|name|window                                    |cnt|dnt|window_start       |window_end         |
+----+------------------------------------------+---+---+-------------------+-------------------+
|A   |[2022-08-01 11:55:00, 2022-08-01 12:00:00]|1  |1  |2022-08-01 11:55:00|2022-08-01 12:00:00|
|A   |[2022-08-01 12:00:00, 2022-08-01 12:05:00]|2  |1  |2022-08-01 12:00:00|2022-08-01 12:05:00|
|B   |[2022-08-01 12:00:00, 2022-08-01 12:05:00]|2  |1  |2022-08-01 12:00:00|2022-08-01 12:05:00|
+----+------------------------------------------+---+---+-------------------+-------------------+

startTime偏移量的使用

这里的窗宽是从0分0秒开始计算，如果头铁，起点从1分开始算，e.g.统计1-6,6-11这样的窗宽咋整

df_spark.withColumn('click_timestamp',F.to_timestamp(F.col('click_time')))\
.groupby(['name',F.window(timeColumn='click_timestamp', windowDuration='5 minute', slideDuration='5 minute', startTime='1 minutes')])\
.agg(F.count('click').alias('cnt'),F.countDistinct('click').alias('dnt'))\
.withColumn('window_start',F.col('window.start'))\
.withColumn('window_end',F.col('window.end'))\
.orderBy(F.col('name'),F.col('window_start').asc()).show(truncate=False)

+----+------------------------------------------+---+---+-------------------+-------------------+
|name|window                                    |cnt|dnt|window_start       |window_end         |
+----+------------------------------------------+---+---+-------------------+-------------------+
|A   |[2022-08-01 11:56:00, 2022-08-01 12:01:00]|1  |1  |2022-08-01 11:56:00|2022-08-01 12:01:00|
|A   |[2022-08-01 12:01:00, 2022-08-01 12:06:00]|2  |1  |2022-08-01 12:01:00|2022-08-01 12:06:00|
|B   |[2022-08-01 11:56:00, 2022-08-01 12:01:00]|1  |1  |2022-08-01 11:56:00|2022-08-01 12:01:00|
|B   |[2022-08-01 12:01:00, 2022-08-01 12:06:00]|1  |1  |2022-08-01 12:01:00|2022-08-01 12:06:00|
+----+------------------------------------------+---+---+-------------------+-------------------+

简单的用法介绍完毕

2022-08-03 于南京市江宁区九龙湖