窗口函数的定义引用一个大佬的定义: a window function calculates a return value for every input row of a table based on a group of rows。窗口函数与与其他函数的区别:
普通函数: 作用于每一条记录,计算出一个新列(记录数不变);
聚合函数: 作用于一组记录(全部数据按照某种方式分为多组),计算出一个聚合值(记录数变小);
窗口函数: 作用于每一条记录,逐条记录去指定多条记录来计算一个值(记录数不变)。
窗口函数语法结构: 函数名(参数)OVER(PARTITION BY 子句 ORDER BY 子句 ROWS/RANGE子句)
函数名:
OVER: 关键字,说明这是窗口函数,不是普通的聚合函数;
子句
PARTITION BY: 分组字段
ORDER BY: 排序字段
ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)
ROW: 物理窗口,数据筛选基于排序后的index
RANGE: 逻辑窗口,数据筛选基于值
主要有以下三种窗口函数
ranking functions
analytic functions
aggregate functions
数据加载
from pyspark.sql.types import *
schema = StructType().add('name', StringType(), True).add('department', StringType(), True).add('salary', IntegerType(), True)
df = spark.createDataFrame([
("Tom", "Sales", 4500),
("Georgi", "Sales", 4200),
("Kyoichi", "Sales", 3000),
("Berni", "Sales", 4700),
("Guoxiang", "Sales", 4200),
("Parto", "Finance", 2700),
("Anneke", "Finance", 3300),
("Sumant", "Finance", 3900),
("Jeff", "Marketing", 3100),
("Patricio", "Marketing", 2500)
], schema=schema)
df.createOrReplaceTempView('salary')
df.show()
+--------+----------+------+
| name|department|salary|
+--------+----------+------+
| Tom| Sales| 4500|
| Georgi| Sales| 4200|
| Kyoichi| Sales| 3000|
| Berni| Sales| 4700|
|Guoxiang| Sales| 4200|
| Parto| Finance| 2700|
| Anneke| Finance| 3300|
| Sumant| Finance| 3900|
| Jeff| Marketing| 3100|
|Patricio| Marketing| 2500|
+--------+----------+------+
ranking functions
sql
DataFrame
功能
row_number
rowNumber
从1~n的唯一序号值
rank
rank
与denseRank一样,都是排名,对于相同的数值,排名一致。区别:rank不会跳过并列的排名
dense_rank
denseRank
同rank
percent_rank
percentRank
计算公式: (组内排名-1)/(组内行数-1),如果组内只有1行,则结果为0
ntile
ntile
将组内数据排序后,按照指定的n切分为n个桶,该值为当前行的桶号(桶号从1开始)spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,rank() over(partition by department order by salary) as rank
,dense_rank() over(partition by department order by salary) as dense_rank
,percent_rank() over(partition by department order by salary) as percent_rank
,ntile(2) over(partition by department order by salary) as ntile
FROM salary
""").toPandas()
name
department
salary
index
rank
dense_rank
percent_rank
ntile
0
Patricio
Marketing
2500
1
1
1
0.00
1
1
Jeff
Marketing
3100
2
2
2
1.00
2
2
Kyoichi
Sales
3000
1
1
1
0.00
1
3
Georgi
Sales
4200
2
2
2
0.25
1
4
Guoxiang
Sales
4200
3
2
2
0.25
1
5
Tom
Sales
4500
4
4
3
0.75
2
6
Berni
Sales
4700
5
5
4
1.00
2
7
Parto
Finance
2700
1
1
1
0.00
1
8
Anneke
Finance
3300
2
2
2
0.50
1
9
Sumant
Finance
3900
3
3
3
1.00
2
analytic functions
sql
DataFrame
功能
cume_dist
cumeDist
计算公式: 组内小于等于值当前行数/组内总行数
lag
lag
lag(input, [offset,[default]]) 当前index
lead
lead
与lag相反spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,cume_dist() over(partition by department order by salary) as cume_dist
,lag('salary', 2) over(partition by department order by salary) as lag
,lead('salary', 2) over(partition by department order by salary) as lead
FROM salary
""").toPandas()
name
department
salary
index
cume_dist
lag
lead
0
Patricio
Marketing
2500
1
0.500000
None
None
1
Jeff
Marketing
3100
2
1.000000
None
None
2
Kyoichi
Sales
3000
1
0.200000
None
salary
3
Georgi
Sales
4200
2
0.600000
None
salary
4
Guoxiang
Sales
4200
3
0.600000
salary
salary
5
Tom
Sales
4500
4
0.800000
salary
None
6
Berni
Sales
4700
5
1.000000
salary
None
7
Parto
Finance
2700
1
0.333333
None
salary
8
Anneke
Finance
3300
2
0.666667
None
None
9
Sumant
Finance
3900
3
1.000000
salary
None
aggregate functions
只是在一定窗口里实现一些普通的聚合函数。
sql
功能
avg
平均值
sum
求和
min
最小值
max
最大值spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,sum(salary) over(partition by department order by salary) as sum
,avg(salary) over(partition by department order by salary) as avg
,min(salary) over(partition by department order by salary) as min
,max(salary) over(partition by department order by salary) as max
FROM salary
""").toPandas()
name
department
salary
index
sum
avg
min
max
0
Patricio
Marketing
2500
1
2500
2500.0
2500
2500
1
Jeff
Marketing
3100
2
5600
2800.0
2500
3100
2
Kyoichi
Sales
3000
1
3000
3000.0
3000
3000
3
Georgi
Sales
4200
2
11400
3800.0
3000
4200
4
Guoxiang
Sales
4200
3
11400
3800.0
3000
4200
5
Tom
Sales
4500
4
15900
3975.0
3000
4500
6
Berni
Sales
4700
5
20600
4120.0
3000
4700
7
Parto
Finance
2700
1
2700
2700.0
2700
2700
8
Anneke
Finance
3300
2
6000
3000.0
2700
3300
9
Sumant
Finance
3900
3
9900
3300.0
2700
3900
窗口子句
ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)
ROWS: 物理窗口,数据筛选基于排序后的index
RANGE: 逻辑窗口,数据筛选基于值
语法:OVER (PARTITION BY … ORDER BY … frame_type BETWEEN start AND end)
有以下5种边界
CURRENT ROW:
UNBOUNDED PRECEDING: 分区第一行
UNBOUNDED FOLLOWING: 分区最后一行
n PRECEDING: 前n行
n FOLLOWING: 后n行
UNBOUNDED: 起点
spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,row_number() over(partition by department order by salary rows between UNBOUNDED PRECEDING and CURRENT ROW) as index1
FROM salary
""").toPandas()
name
department
salary
index
index1
0
Patricio
Marketing
2500
1
1
1
Jeff
Marketing
3100
2
2
2
Kyoichi
Sales
3000
1
1
3
Georgi
Sales
4200
2
2
4
Guoxiang
Sales
4200
3
3
5
Tom
Sales
4500
4
4
6
Berni
Sales
4700
5
5
7
Parto
Finance
2700
1
1
8
Anneke
Finance
3300
2
2
9
Sumant
Finance
3900
3
3
混合应用
spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,salary - (min(salary) over(partition by department order by salary)) as salary_diff
FROM salary
""").toPandas()
name
department
salary
index
salary_diff
0
Patricio
Marketing
2500
1
0
1
Jeff
Marketing
3100
2
600
2
Kyoichi
Sales
3000
1
0
3
Georgi
Sales
4200
2
1200
4
Guoxiang
Sales
4200
3
1200
5
Tom
Sales
4500
4
1500
6
Berni
Sales
4700
5
1700
7
Parto
Finance
2700
1
0
8
Anneke
Finance
3300
2
600
9
Sumant
Finance
3900
3
1200
参考
小哲嗨数
发布了52 篇原创文章 · 获赞 12 · 访问量 8万+
私信
关注
标签:salary,窗口,函数,over,partition,Sales,SparkSQL,department,Finance
来源: https://blog.csdn.net/ganzheyu/article/details/104590954