sparksql 定义聚合函数_SparkSQL | 窗口函数

本文介绍了SparkSQL中的窗口函数,包括其定义、与普通函数和聚合函数的区别。通过示例展示了ranking、analytic和aggregate函数的用法,并探讨了窗口子句的ROW和RANGE区别。同时,给出了如何在窗口中应用聚合函数如avg、sum等,并展示了一些混合应用的实例。
摘要由CSDN通过智能技术生成

窗口函数的定义引用一个大佬的定义: a window function calculates a return value for every input row of a table based on a group of rows。窗口函数与与其他函数的区别:

普通函数: 作用于每一条记录,计算出一个新列(记录数不变);

聚合函数: 作用于一组记录(全部数据按照某种方式分为多组),计算出一个聚合值(记录数变小);

窗口函数: 作用于每一条记录,逐条记录去指定多条记录来计算一个值(记录数不变)。

窗口函数语法结构: 函数名(参数)OVER(PARTITION BY 子句 ORDER BY 子句 ROWS/RANGE子句)

函数名:

OVER: 关键字,说明这是窗口函数,不是普通的聚合函数;

子句

PARTITION BY: 分组字段

ORDER BY: 排序字段

ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)

ROW: 物理窗口,数据筛选基于排序后的index

RANGE: 逻辑窗口,数据筛选基于值

主要有以下三种窗口函数

ranking functions

analytic functions

aggregate functions

数据加载

from pyspark.sql.types import *

schema = StructType().add('name', StringType(), True).add('department', StringType(), True).add('salary', IntegerType(), True)

df = spark.createDataFrame([

("Tom", "Sales", 4500),

("Georgi", "Sales", 4200),

("Kyoichi", "Sales", 3000),

("Berni", "Sales", 4700),

("Guoxiang", "Sales", 4200),

("Parto", "Finance", 2700),

("Anneke", "Finance", 3300),

("Sumant", "Finance", 3900),

("Jeff", "Marketing", 3100),

("Patricio", "Marketing", 2500)

], schema=schema)

df.createOrReplaceTempView('salary')

df.show()

+--------+----------+------+

| name|department|salary|

+--------+----------+------+

| Tom| Sales| 4500|

| Georgi| Sales| 4200|

| Kyoichi| Sales| 3000|

| Berni| Sales| 4700|

|Guoxiang| Sales| 4200|

| Parto| Finance| 2700|

| Anneke| Finance| 3300|

| Sumant| Finance| 3900|

| Jeff| Marketing| 3100|

|Patricio| Marketing| 2500|

+--------+----------+------+

ranking functions

sql

DataFrame

功能

row_number

rowNumber

从1~n的唯一序号值

rank

rank

与denseRank一样,都是排名,对于相同的数值,排名一致。区别:rank不会跳过并列的排名

dense_rank

denseRank

同rank

percent_rank

percentRank

计算公式: (组内排名-1)/(组内行数-1),如果组内只有1行,则结果为0

ntile

ntile

将组内数据排序后,按照指定的n切分为n个桶,该值为当前行的桶号(桶号从1开始)spark.sql("""

SELECT

name

,department

,salary

,row_number() over(partition by department order by salary) as index

,rank() over(partition by department order by salary) as rank

,dense_rank() over(partition by department order by salary) as dense_rank

,percent_rank() over(partition by department order by salary) as percent_rank

,ntile(2) over(partition by department order by salary) as ntile

FROM salary

""").toPandas()

name

department

salary

index

rank

dense_rank

percent_rank

ntile

0

Patricio

Marketing

2500

1

1

1

0.00

1

1

Jeff

Marketing

3100

2

2

2

1.00

2

2

Kyoichi

Sales

3000

1

1

1

0.00

1

3

Georgi

Sales

4200

2

2

2

0.25

1

4

Guoxiang

Sales

4200

3

2

2

0.25

1

5

Tom

Sales

4500

4

4

3

0.75

2

6

Berni

Sales

4700

5

5

4

1.00

2

7

Parto

Finance

2700

1

1

1

0.00

1

8

Anneke

Finance

3300

2

2

2

0.50

1

9

Sumant

Finance

3900

3

3

3

1.00

2

analytic functions

sql

DataFrame

功能

cume_dist

cumeDist

计算公式: 组内小于等于值当前行数/组内总行数

lag

lag

lag(input, [offset,[default]]) 当前index

lead

lead

与lag相反spark.sql("""

SELECT

name

,department

,salary

,row_number() over(partition by department order by salary) as index

,cume_dist() over(partition by department order by salary) as cume_dist

,lag('salary', 2) over(partition by department order by salary) as lag

,lead('salary', 2) over(partition by department order by salary) as lead

FROM salary

""").toPandas()

name

department

salary

index

cume_dist

lag

lead

0

Patricio

Marketing

2500

1

0.500000

None

None

1

Jeff

Marketing

3100

2

1.000000

None

None

2

Kyoichi

Sales

3000

1

0.200000

None

salary

3

Georgi

Sales

4200

2

0.600000

None

salary

4

Guoxiang

Sales

4200

3

0.600000

salary

salary

5

Tom

Sales

4500

4

0.800000

salary

None

6

Berni

Sales

4700

5

1.000000

salary

None

7

Parto

Finance

2700

1

0.333333

None

salary

8

Anneke

Finance

3300

2

0.666667

None

None

9

Sumant

Finance

3900

3

1.000000

salary

None

aggregate functions

只是在一定窗口里实现一些普通的聚合函数。

sql

功能

avg

平均值

sum

求和

min

最小值

max

最大值spark.sql("""

SELECT

name

,department

,salary

,row_number() over(partition by department order by salary) as index

,sum(salary) over(partition by department order by salary) as sum

,avg(salary) over(partition by department order by salary) as avg

,min(salary) over(partition by department order by salary) as min

,max(salary) over(partition by department order by salary) as max

FROM salary

""").toPandas()

name

department

salary

index

sum

avg

min

max

0

Patricio

Marketing

2500

1

2500

2500.0

2500

2500

1

Jeff

Marketing

3100

2

5600

2800.0

2500

3100

2

Kyoichi

Sales

3000

1

3000

3000.0

3000

3000

3

Georgi

Sales

4200

2

11400

3800.0

3000

4200

4

Guoxiang

Sales

4200

3

11400

3800.0

3000

4200

5

Tom

Sales

4500

4

15900

3975.0

3000

4500

6

Berni

Sales

4700

5

20600

4120.0

3000

4700

7

Parto

Finance

2700

1

2700

2700.0

2700

2700

8

Anneke

Finance

3300

2

6000

3000.0

2700

3300

9

Sumant

Finance

3900

3

9900

3300.0

2700

3900

窗口子句

ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)

ROWS: 物理窗口,数据筛选基于排序后的index

RANGE: 逻辑窗口,数据筛选基于值

语法:OVER (PARTITION BY … ORDER BY … frame_type BETWEEN start AND end)

有以下5种边界

CURRENT ROW:

UNBOUNDED PRECEDING: 分区第一行

UNBOUNDED FOLLOWING: 分区最后一行

n PRECEDING: 前n行

n FOLLOWING: 后n行

UNBOUNDED: 起点

spark.sql("""

SELECT

name

,department

,salary

,row_number() over(partition by department order by salary) as index

,row_number() over(partition by department order by salary rows between UNBOUNDED PRECEDING and CURRENT ROW) as index1

FROM salary

""").toPandas()

name

department

salary

index

index1

0

Patricio

Marketing

2500

1

1

1

Jeff

Marketing

3100

2

2

2

Kyoichi

Sales

3000

1

1

3

Georgi

Sales

4200

2

2

4

Guoxiang

Sales

4200

3

3

5

Tom

Sales

4500

4

4

6

Berni

Sales

4700

5

5

7

Parto

Finance

2700

1

1

8

Anneke

Finance

3300

2

2

9

Sumant

Finance

3900

3

3

混合应用

spark.sql("""

SELECT

name

,department

,salary

,row_number() over(partition by department order by salary) as index

,salary - (min(salary) over(partition by department order by salary)) as salary_diff

FROM salary

""").toPandas()

name

department

salary

index

salary_diff

0

Patricio

Marketing

2500

1

0

1

Jeff

Marketing

3100

2

600

2

Kyoichi

Sales

3000

1

0

3

Georgi

Sales

4200

2

1200

4

Guoxiang

Sales

4200

3

1200

5

Tom

Sales

4500

4

1500

6

Berni

Sales

4700

5

1700

7

Parto

Finance

2700

1

0

8

Anneke

Finance

3300

2

600

9

Sumant

Finance

3900

3

1200

参考

小哲嗨数

发布了52 篇原创文章 · 获赞 12 · 访问量 8万+

私信

关注

标签:salary,窗口,函数,over,partition,Sales,SparkSQL,department,Finance

来源: https://blog.csdn.net/ganzheyu/article/details/104590954

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值