c++ select函数_Spark Window Functions-PySpark(窗口函数)

Spark Window Functions 有下列的属性

  • 在一组行上面执行计算,这一组行称为Frame
  • 每行row对应一个Frame
  • 给每行返回一个新的值通过aggregate/window 函数
  • 能够使用SQL 语法或者DataFrame API

1、创建一个简单的数据集

from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *

empsalary_data = [
  ("sales",     1,  "Alice",  5000, ["game",  "ski"]),
  ("personnel", 2,  "Olivia", 3900, ["game",  "ski"]),
  ("sales",     3,  "Ella",   4800, ["skate", "ski"]),
  ("sales",     4,  "Ebba",   4800, ["game",  "ski"]),
  ("personnel", 5,  "Lilly",  3500, ["climb", "ski"]),
  ("develop",   7,  "Astrid", 4200, ["game",  "ski"]),
  ("develop",   8,  "Saga",   6000, ["kajak", "ski"]),
  ("develop",   9,  "Freja",  4500, ["game",  "kajak"]),
  ("develop",   10, "Wilma",  5200, ["game",  "ski"]),
  ("develop",   11, "Maja",   5200, ["game",  "farming"])]

empsalary=spark.createDataFrame(empsalary_data, 
    schema=["depName", "empNo", "name", "salary", "hobby"])
empsalary.show()

20b25a270a1177b6944c90173b44c6b3.png

2、Spark Functions

在spark函数中,只有Aggregate Functions 能够和 Window Functions搭配使用

e06609ec9b12b654ff0d28ee2dcbb19f.png

其他类别的函数不能应用于Spark Window中,例如下面的一个例子,使用了函数array_contains,(collection functions的一种),spark会报错

overCategory = Window.partitionBy("depName")

df = empsalary.withColumn(
"average_salary_in_dep",array_contains(col("hobby"),"game").over(overCategory)).withColumn("total_salary_in_dep",sum("salary").over(overCategory))
df.show()
## pyspark.sql.functions.array_contains(col,value)
## Collection 函数,return True if the array contains the given value.The collection elements and value must be of the same type
df = spark.createDataFrame([(['a', 'b', 'c'],),([],)],['data'])
df.select(array_contains(df.data,'a')).collect()
[Row(array_contains(data,a) = True,Row(array_contains(data,a) = False)]

3、Basic Frame with partitionBy()

A Basic Frame有下列的属性

  • 被一列或者多列的Window.partitionBy生成
  • 每行对应一个Frame
  • Frame在同一个partition里面是相同的
  • Aggregate/Window functions 可以运用于每个row+frame 去生成单个的值

111586b10a8562228f572902dbe8964f.png

如上图所示的,在一个department里面计算平均的salary和总的salary

overCategory = Window.partitionBy("depName")
df = empsalary.withColumn(
"salaries", collect_list("salary").over(overCategory)).withColumn(
"average_salary",(avg("salary").over(overCategory)).cast("int")).withColumn(
"total_salary",sum("salary").over(overCategory)).select(
"depName","empNo","name","salary","salaries","average_salary","total_salary")
df.show(20,False)

Here is the output from the previous sample code

293d99423cfad9420aa8e47989d211db.png

From the output,we can see that column salaries by function collect_list has the same values in a window

3、Ordered Frame with partitionBy and orderBy

an Ordered Frame has the following traits

  • 被一个或者是多个columns生成
  • Followed by orderby on a column
  • Each row have a corresponding frame
  • The frame will not be the same for every row within the same partition.By default,the frame contains all previous rows and the currentRow
  • Aggregate/Window functions can be applied to each row+frame to generate a value

Here is the sample code

overCategory = Window.partitionBy("depName").orderBy(desc("salary"))
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"average_salary",(avg("salary").over(overCategory)).cast("int")).withColumn(
"total_salary",sum("salary").over(overCategory)).select(
"depName","empNo","name","salary","salaries","average_salary","average_salary")
df.show(20,False)

788dcbaeadb3dac3aaffc4d252ae77d5.png

从输出可以看到salaries在一个window里面是不一致的,the values are only from unboudedPreceding until currentRow

4、Rank funtions in a group

下面是在spark中被支持的rank functions

2f5ab3328ac9619635f0feca023d6abd.png

下面是例子

overCategory = window.partitionBy("depName").orderBy(desc("salary"))
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"rank",rank().over(overCategory)).withColumn(
"dense_rank",dense_rank().over(overCategory)).withColumn(
"row_number",row_number().over(overCategory)).withColumn(
"ntile",ntile(3).over(overCategory)).withColumn(
"percent_rank",percent_rank().over(overCategory)).select(
"depName","empNo","name","salary","rank","dense_rank","row_number","ntile","percent_rank")
df.show(20,False)

下面是输出的结果

ed4e6a0d884e6ac81fc8e3871a2db169.png

利用rank函数,我们能快速的得到类似于top2之类的数据

overCategory = Window.partitionBy("depName").orderBy(desc("salary"))
df = empsalary.withColumn(
"row_number",row_number().over(overCategory)).filter(
"row_number <= 2").select(
"depName","empNo","name","salary")
df.show(20,False)

5、lag & lead in a group

lag & lead 当我们想拿到没行的前一个element或者是后一个的时候会用到

lag 的意思是拿到当前value的前一个

lead 的意思是拿到当前value的后一个

overCategory = Window.partitionBy("depname").orderBy(desc("salary"))
df = empsalary.withColumn(
"lead",lead("salary",1).over(overCategory)).withColumn(
"lag",lag("salary",1).over(overCategory)).select(
"depName","empNo","name","salary","lead","lag")
df.show(20,False)

输出如下

7cd9f41d256faa3af233a55f2948075d.png

从结果可以看到,lag的第一个值是null,lead的最后一个值是null

然后我们可以计算lag或者lead行跟当前行的差,这种做法在特征加工里面是很常见的一种方式

diff = df.withColumn(
"highter_than_next",col("salary") - col("lead")).withColumn(
"lower_than_previous",col("lag") - col("salary"))

diff.show()

然后diff会有null值,这里用0去替换null

diff = df.withColumn(
"highter_than_next",when(col("lead").isNull(),0).otherwise(col("lead"))).withColumn(
"lower_than_previous",when(col("lag").isNull(),0).otherwise(col("lag")))
diff.show()

假如要找到下一个salary比当前2倍还大的行

diff.filter(col("higher_than_next") > (lit(2)*col("salary"))).show(3)

6、Running Total(计算累计)

overCategory = Window.partitionBy("depname").orderby(desc("salary"))
running_total = empsalary.withColumn(
"rank",rank().over(overCategory)).withColumn(
"costs",sum("salary").over(overCategory)).select(
"depName","empNo","name","salary","rank","costs")
running_total.show(20,False)

955833610533326a58f74bc4aabc8a4f.png

7、Range Frame(自定义区间)

我们可以利用range functions去改变frame的边界(boundary)

  • 被生成通过Window.partitionBy 一列或者多列
  • 通常是有orderBy的,所以在frame里面的数据是被排序过的
  • Then followed by rangeBetween or rowsBetween
  • 每行对应一个frame
  • frame的边界是被rangeBetween 和 rowsBetween控制的
  • Aggregate/Window functions可以被应用到row+frame上去生成单个的值

Threre are two range window functions,here are the functions definitions

rowsBetween(start,end)
rangeBetween(start,end)

both functions accept two parameters,[start,end] all inclusive.The parameters value can be Window.unboundedPreceding,Window.unboundedFollowing,and Window.currentRow.Or a value relative to Window.currentRow,either negtive or positive.

rowsBetween get the frame boundary based on the row index in the window compared to currentRow here are a few examples and it's meaning

498734cf22491999d3d103d409f920f2.png

rangeBetween 拿到frame的边界基于window内的row value,the difference compares to rowsBetween is that it compare with value of the current row

Here is the value definition of the constant values used in range functions

Window.currentRow = 0

Window.unboundedPreceding = Long.MinValue

Window.unboundedFollowing = Long.MaxValue

下面是直接使用Window.partitionBy没有使用orderBy的例子,从输出我们能看到数据是随机的

overCategory = Window.partitionBy("depName").rowsBetween(
Window.currentRow,1)
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"total_salary",sum("salary").over(overCategory))
df = df.select("depName","empNo","name","salary","salaries","total_salary")
df.show(20,False)

020b11dd54bf1175285f73131f3a49e7.png

8、Median

mean(avg)和median是最常见的统计方法,在某种意义下,median是比mean更稳定的,因为median会过滤掉异常值

c891844b742f124c781b5c26b9fbe75b.png

使用Window去计算median

## 先定义一个udf函数
@udf("long")
def median_udf(s):
    index = int(len(s) / 2)
    return s[index]
overCategory = Window.partitionBy("depName").orderBy("salary").rowsBetween(
Window.unboundedPreceding,Window.unboundedFollowing)
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"median_salary",median_udf(col("salaried")))

df = df.select("depName","empNo","name","salary","salaries","median_salary")
df.show(20,False)

输出结果如下

dcaafdd724562330e66f2d3cd4e96a39.png

使用groupBy 计算median然后join回原表

dfMedian = empsalary.groupBy("depName").agg(
      sort_array(collect_list("salary")).alias("salaries")).select(
      "depName", "salaries", median_udf(col("salaries")).alias("median_salary"))
df = empsalary.join(broadcast(dfMedian), "depName").select(
        "depName", "empNo", "name", "salary", "salaries", "median_salary")
df.show(20, False)

4655dcd14fbc037731903c0827ea7ab9.png
https://knockdata.github.io/spark-window-function-pyspark/​knockdata.github.io
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值