pyspark Window 窗口函数

最新推荐文章于 2024-01-30 20:07:09 发布

NoOne-csdn

最新推荐文章于 2024-01-30 20:07:09 发布

阅读量3.5k

点赞数 1

分类专栏： pyspark

本文链接：https://blog.csdn.net/weixin_40161254/article/details/107182263

版权

pyspark 专栏收录该内容

63 篇文章 9 订阅

订阅专栏

参考：Introducing Window Functions in Spark SQL

窗口函数

At its core, a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way.
个人理解：窗口函数主要作用是基于对列进行分组，
将函数作用于指定的行范围。函数的作用功能十分powerful!

关键

分组 partitionBy
排序 orderby
frame 选取 rangeBetween rowsBetween
demo

tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")]
df = spark.createDataFrame(tup, ["id", "category"])
df.show()
window = Window.partitionBy("category").orderBy(df.id.desc()).rangeBetween(Window.currentRow, 1)
df.withColumn("sum", F.sum("id").over(window)).show()

frame 选取

基准为当前行

行数选择
rowsBetween(x, y)
Window.unboundedPreceding 表示当前行之前的无限行
Window.currentRow 表示当前行
Window.unboundedFollowing 表示当前行之后的无限行

rowsBetween(-1,1)
函数作用范围为当前行的上一行至下一行

行范围设置 rangeBetween(x,y)
基准为当前行的值
rangeBetween(20,50)
例如当前值为18
则选取的值范围为[-2,68]

主要函数

API	作用
rank
dense_rank
row_number
min
max
sum

NoOne-csdn

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
pyspark Window 窗口函数

参考：Introducing Window Functions in Spark SQL窗口函数At its core, a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This character
复制链接

扫一扫