c++ select函数_Spark Window Functions-PySpark（窗口函数）

最新推荐文章于 2024-04-11 17:29:20 发布

晓雪smile

最新推荐文章于 2024-04-11 17:29:20 发布

阅读量249

点赞数

文章标签： c++ select函数

本文链接：https://blog.csdn.net/weixin_29682291/article/details/112124578

版权

Spark Window Functions 有下列的属性

在一组行上面执行计算，这一组行称为Frame
每行row对应一个Frame
给每行返回一个新的值通过aggregate/window 函数
能够使用SQL 语法或者DataFrame API

1、创建一个简单的数据集

from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *

empsalary_data = [
  ("sales",     1,  "Alice",  5000, ["game",  "ski"]),
  ("personnel", 2,  "Olivia", 3900, ["game",  "ski"]),
  ("sales",     3,  "Ella",   4800, ["skate", "ski"]),
  ("sales",     4,  "Ebba",   4800, ["game",  "ski"]),
  ("personnel", 5,  "Lilly",  3500, ["climb", "ski"]),
  ("develop",   7,  "Astrid", 4200, ["game",  "ski"]),
  ("develop",   8,  "Saga",   6000, ["kajak", "ski"]),
  ("develop",   9,  "Freja",  4500, ["game",  "kajak"]),
  ("develop",   10, "Wilma",  5200, ["game",  "ski"]),
  ("develop",   11, "Maja",   5200, ["game",  "farming"])]

empsalary=spark.createDataFrame(empsalary_data, 
    schema=["depName", "empNo", "name", "salary", "hobby"])
empsalary.show()

2、Spark Functions

在spark函数中，只有Aggregate Functions 能够和 Window Functions搭配使用

其他类别的函数不能应用于Spark Window中，例如下面的一个例子，使用了函数array_contains，（collection functions的一种），spark会报错

overCategory = Window.partitionBy("depName")

df = empsalary.withColumn(
"average_salary_in_dep",array_contains(col("hobby"),"game").over(overCategory)).withColumn("total_salary_in_dep",sum("salary").over(overCategory))
df.show()
## pyspark.sql.functions.array_contains(col,value)
## Collection 函数，return True if the array contains the given value.The collection elements and value must be of the same type
df = spark.createDataFrame([(['a', 'b', 'c'],),([],)],['data'])
df.select(array_contains(df.data,'a')).collect()
[Row(array_contains(data,a) = True,Row(array_contains(data,a) = False)]

3、Basic Frame with partitionBy（）

A Basic Frame有下列的属性

被一列或者多列的Window.partitionBy生成
每行对应一个Frame
Frame在同一个partition里面是相同的
Aggregate/Window functions 可以运用于每个row+frame 去生成单个的值

如上图所示的，在一个department里面计算平均的salary和总的salary

overCategory = Window.partitionBy("depName")
df = empsalary.withColumn(
"salaries", collect_list("salary").over(overCategory)).withColumn(
"average_salary",(avg("salary").over(overCategory)).cast("int")).withColumn(
"total_salary",sum("salary").over(overCategory)).select(
"depName","empNo","name","salary","salaries","average_salary","total_salary")
df.show(20,False)

Here is the output from the previous sample code

From the output,we can see that column salaries by function collect_list has the same values in a window

3、Ordered Frame with partitionBy and orderBy

an Ordered Frame has the following traits

被一个或者是多个columns生成
Followed by orderby on a column
Each row have a corresponding frame
The frame will not be the same for every row within the same partition.By default,the frame contains all previous rows and the currentRow
Aggregate/Window functions can be applied to each row+frame to generate a value

Here is the sample code

overCategory = Window.partitionBy("depName").orderBy(desc("salary"))
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"average_salary",(avg("salary").over(overCategory)).cast("int")).withColumn(
"total_salary",sum("salary").over(overCategory)).select(
"depName","empNo","name","salary","salaries","average_salary","average_salary")
df.show(20,False)

从输出可以看到salaries在一个window里面是不一致的，the values are only from unboudedPreceding until currentRow

4、Rank funtions in a group

下面是在spark中被支持的rank functions

下面是例子

overCategory = window.partitionBy("depName").orderBy(desc("salary"))
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"rank",rank().over(overCategory)).withColumn(
"dense_rank",dense_rank().over(overCategory)).withColumn(
"row_number",row_number().over(overCategory)).withColumn(
"ntile",ntile(3).over(overCategory)).withColumn(
"percent_rank",percent_rank().over(overCategory)).select(
"depName","empNo","name","salary","rank","dense_rank","row_number","ntile","percent_rank")
df.show(20,False)

下面是输出的结果

利用rank函数，我们能快速的得到类似于top2之类的数据

overCategory = Window.partitionBy("depName").orderBy(desc("salary"))
df = empsalary.withColumn(
"row_number",row_number().over(overCategory)).filter(
"row_number <= 2").select(
"depName","empNo","name","salary")
df.show(20,False)

5、lag & lead in a group

lag & lead 当我们想拿到没行的前一个element或者是后一个的时候会用到

lag 的意思是拿到当前value的前一个

lead 的意思是拿到当前value的后一个

overCategory = Window.partitionBy("depname").orderBy(desc("salary"))
df = empsalary.withColumn(
"lead",lead("salary",1).over(overCategory)).withColumn(
"lag",lag("salary",1).over(overCategory)).select(
"depName","empNo","name","salary","lead","lag")
df.show(20,False)

输出如下

从结果可以看到，lag的第一个值是null，lead的最后一个值是null

然后我们可以计算lag或者lead行跟当前行的差，这种做法在特征加工里面是很常见的一种方式

diff = df.withColumn(
"highter_than_next",col("salary") - col("lead")).withColumn(
"lower_than_previous",col("lag") - col("salary"))

diff.show()

然后diff会有null值，这里用0去替换null

diff = df.withColumn(
"highter_than_next",when(col("lead").isNull(),0).otherwise(col("lead"))).withColumn(
"lower_than_previous",when(col("lag").isNull(),0).otherwise(col("lag")))
diff.show()

假如要找到下一个salary比当前2倍还大的行

diff.filter(col("higher_than_next") > (lit(2)*col("salary"))).show(3)

6、Running Total（计算累计）

overCategory = Window.partitionBy("depname").orderby(desc("salary"))
running_total = empsalary.withColumn(
"rank",rank().over(overCategory)).withColumn(
"costs",sum("salary").over(overCategory)).select(
"depName","empNo","name","salary","rank","costs")
running_total.show(20,False)

7、Range Frame(自定义区间）

我们可以利用range functions去改变frame的边界（boundary）

被生成通过Window.partitionBy 一列或者多列
通常是有orderBy的，所以在frame里面的数据是被排序过的
Then followed by rangeBetween or rowsBetween
每行对应一个frame
frame的边界是被rangeBetween 和 rowsBetween控制的
Aggregate/Window functions可以被应用到row+frame上去生成单个的值

Threre are two range window functions，here are the functions definitions

rowsBetween(start,end)
rangeBetween(start,end)

both functions accept two parameters,[start,end] all inclusive.The parameters value can be Window.unboundedPreceding,Window.unboundedFollowing,and Window.currentRow.Or a value relative to Window.currentRow,either negtive or positive.

rowsBetween get the frame boundary based on the row index in the window compared to currentRow here are a few examples and it's meaning

rangeBetween 拿到frame的边界基于window内的row value，the difference compares to rowsBetween is that it compare with value of the current row

Here is the value definition of the constant values used in range functions

Window.currentRow = 0

Window.unboundedPreceding = Long.MinValue

Window.unboundedFollowing = Long.MaxValue

下面是直接使用Window.partitionBy没有使用orderBy的例子，从输出我们能看到数据是随机的

overCategory = Window.partitionBy("depName").rowsBetween(
Window.currentRow,1)
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"total_salary",sum("salary").over(overCategory))
df = df.select("depName","empNo","name","salary","salaries","total_salary")
df.show(20,False)

8、Median

mean(avg)和median是最常见的统计方法，在某种意义下，median是比mean更稳定的，因为median会过滤掉异常值

使用Window去计算median

## 先定义一个udf函数
@udf("long")
def median_udf(s):
    index = int(len(s) / 2)
    return s[index]
overCategory = Window.partitionBy("depName").orderBy("salary").rowsBetween(
Window.unboundedPreceding,Window.unboundedFollowing)
df = empsalary.withColumn(
"salaries",collect_list("salary").over(overCategory)).withColumn(
"median_salary",median_udf(col("salaried")))

df = df.select("depName","empNo","name","salary","salaries","median_salary")
df.show(20,False)

输出结果如下

使用groupBy 计算median然后join回原表

dfMedian = empsalary.groupBy("depName").agg(
      sort_array(collect_list("salary")).alias("salaries")).select(
      "depName", "salaries", median_udf(col("salaries")).alias("median_salary"))
df = empsalary.join(broadcast(dfMedian), "depName").select(
        "depName", "empNo", "name", "salary", "salaries", "median_salary")
df.show(20, False)

https://knockdata.github.io/spark-window-function-pyspark/knockdata.github.io

晓雪smile

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
c++ select函数_Spark Window Functions-PySpark（窗口函数）

Spark Window Functions 有下列的属性在一组行上面执行计算，这一组行称为Frame每行row对应一个Frame给每行返回一个新的值通过aggregate/window 函数能够使用SQL 语法或者DataFrame API1、创建一个简单的数据集from pyspark.sql import Windowfrom pyspark.sql.types import *from...
复制链接

扫一扫