spark允许创建的分组类型有:
- group by,
- window,
- grouping set,
- rollup,
- cube
Window开窗聚合函数
1、开窗方法: 按$*分组partitionBy(), 按$*排序orderBy(), 窗口计算范围 rowsBetween()
2、常用窗口聚合函数: rank($列名) ..over($窗口) , dense_rank() ..over() ,max() ..over()
加载本地的一个测试数据
val path = "/Users/gao/Desktop/Code/features_dacu.csv"
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load(path)
df.createOrReplaceTempView("df_table")
测试1:怎么使用window窗口函数--scala语言
//(1)建立窗口
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col
val windowSpec = Window.partitionBy("status").orderBy(col("cvr0").desc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
//(2)定义窗口聚合函数max/min
import org.apache.spark.sql.functions.max
val maxPurchasecvr0 = max(df.col("cvr0")).over(windowSpec)
val minPurchasecvr0 = min(df.col("cvr0")).over(windowSpec)
//(2)定义窗口聚合函数rank/dense_rank
import org.apache.spark.sql.functions.{dense_rank, rank}
val purchaseDenseRank = dense_rank().over(windowSpec)
val purchaseRank = rank().over(windowSpec)
//(3)在DataFrame中使用窗口函数
import org.apache.spark.sql.functions.col
df.where("status IS NOT NULL").orderBy("status")
.select(
col("status"),
col("cvr0"),
purchaseRank.alias("cvr0Rank"),
purchaseDenseRank.alias("cvr0DenseRank"),
maxPurchasecvr0.alias("maxPurchasecvr0"),
minPurchasecvr0.alias("minPurchasecvr0")).show()
测试2:怎么使用window窗口函数--sql语言
spark.sql("""
SELECT status, cvr0,
rank(cvr0) OVER (PARTITION BY status
ORDER BY cvr0 DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as rank,
dense_rank(cvr0) OVER (PARTITION BY status
ORDER BY cvr0 DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as dRank,
max(cvr0) OVER (PARTITION BY status
ORDER BY cvr0 DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as maxPurchase,
min(cvr0) OVER (PARTITION BY status
ORDER BY cvr0 DESC NULLS LAST
ROWS BETWEEN
UNBOUNDED PRECEDING AND
CURRENT ROW) as minPurchase
FROM df_table WHERE status IS NOT NULL ORDER BY status
""").show()
结果展示:
Grouping set分组函数
用于分析中,按层次分组,汇总不同维度上钻、下钻的指标统计。
grouping sets 和group by 的关系: 等价于将不同维度的group by结果进行union all。grouping_id 表示结果属于哪一个分组集合。
HSQL版本示例,参考博客1:https://www.cnblogs.com/Allen-rg/p/10648231.html 【含有grouping set , rollup, cube,使用和结果说明已经足够好了,我这里就不赘述了。】
参考博客2: https://www.cnblogs.com/wenBlog/p/8440200.html
scala版示例1,注意1:同一个分组的列名放在同一个括号()中,
注意2:grouping sets中加不加最后一个括号()的含义和结果区别
spark.sql("""
SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull
GROUP BY customerId, stockCode GROUPING SETS((customerId, stockCode),())
ORDER BY CustomerId DESC, stockCode DESC
""").show()
结果展示:
scala版示例2:
import org.apache.spark.sql.functions.{grouping_id, sum, expr}
dfNoNull.cube("customerId", "stockCode").agg(grouping_id(), sum("Quantity"))
.orderBy(expr("grouping_id()").desc)
.show()
结果展示:
结果解读:1、无分组的全量统计 2、按照分组进行统计,并显示分组方法
RollsUp
使用示例 --scala
val rolledUpDF = dfNoNull.rollup("Date", "Country").agg(sum("Quantity"))
.selectExpr("Date", "Country", "`sum(Quantity)` as total_quantity")
.orderBy("Date")
结果展示:
结果解读:1、无分组的全量统计 2、按照rollup中的分组进行统计
Cube
使用示例--scala
dfNoNull.cube("Date", "Country").agg(sum(col("Quantity")))
.select("Date", "Country", "sum(Quantity)").orderBy("Date").show()
结果展示:
结果解读:1、无分组的全量统计 2、分别控制一个cube中的列名进行统计