窗口函数的适用范围:
1.SparkSQL
2.HiveSQL
3.JDBC
一:定义
窗口函数和 GroupBy 最大的区别, 就是 GroupBy 的聚合对每一个组只有一个结果, 而窗口函数可以对每一条数据都有一个结果
说白了, 窗口函数其实就是根据当前数据, 计算其在所在的组中的统计数据
二:窗口函数的逻辑
三:函数的组成
从语法的角度上讲, 窗口函数大致分为两个部分
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
* 函数部分 dense_rank()
* 窗口定义部分 PARTITION BY category ORDER BY revenue DESC
四:窗口函数的类型
五:案例代码
class demo05 {
val spark = SparkSession.builder().master("local[6]").appName("aggregation").getOrCreate()
import spark.implicits._
//创建数据集
val df: DataFrame = Seq(("aa", 80), ("aa", 70), ("aa", 70), ("aa", 60), ("bb", 70), ("bb", 80), ("cc", 70), ("cc", 50), ("cc", 50), ("cc", 90), ("cc", 100))
.toDF("name", "score")
//创建临时表
df.createTempView("class")
@Test
def rank(): Unit ={
//rank窗口函数测试
spark.sql(
"""
|select name , score ,
| rank() over(
| partition by name order by score desc
| ) number
| from class
|
""".stripMargin).show()
}
@Test
def dense_rank(): Unit ={
//dense_rank窗口函数测试
spark.sql(
"""
|select name , score ,
| dense_rank() over(
| partition by name order by score desc
| ) number
| from class
|
""".stripMargin).show()
}
@Test
def row_number(): Unit ={
//row_number窗口函数测试
spark.sql(
"""
|select name , score ,
| row_number() over(
| partition by name order by score desc
| ) number
| from class
|
""".stripMargin).show()
}
@Test
def ntile(): Unit ={
//ntile窗口函数测试
spark.sql(
"""
|select name , score ,
| ntile(2) over(
| partition by name order by score desc
| ) ntile
| from class
|
""".stripMargin).show()
}
@Test
def first_value(): Unit ={
//first_value窗口函数测试
spark.sql(
"""
|
| select name , score ,
| first_value( score ) over(
| partition by name order by score desc
| ) first
| from class
|
|
""".stripMargin).show()
}
@Test
def last_value(): Unit ={
//last_value窗口函数测试
spark.sql(
"""
|
| select name , score ,
| last_value( score ) over(
| partition by name order by score desc
| ) last
| from class
|
|
""".stripMargin).show()
}
@Test
def lag(): Unit ={
//lag窗口函数测试
spark.sql(
"""
|
| select name , score ,
| lag(score,2) over(
| partition by name order by score desc
| ) lag
| from class
|
|
""".stripMargin).show()
}
@Test
def lead(): Unit ={
//lead
spark.sql(
"""
|
| select name , score ,
| lead(score,2) over(
| partition by name order by score desc
| ) lead
| from class
|
|
""".stripMargin).show()
}
}
六:窗口函数的两种使用场景
1.在sql语句中使用
SELECT
product,
category,
revenue
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
FROM productRevenue) tmp
WHERE
rank <= 2
窗口函数在 SQL 中的完整语法如下
function OVER (PARITION BY ... ORDER BY ... FRAME_TYPE BETWEEN ... AND ...)
2.在DataFrame的API中使用
val window: WindowSpec = Window.partitionBy('category)
.orderBy('revenue.desc)
source.select('product, 'category, 'revenue, dense_rank() over window as "rank")
.where('rank <= 2)
.show()
WindowSpec : 窗口的描述符, 描述窗口应该是怎么样的
dense_rank() over window : 表示一个叫做 dense_rank() 的函数作用于每一个窗口