简述
通过利用pyspark.sql.Window实现collect_list的排序
示例
有如下数据
df.show()
+----+-------+-----+
|name|subject|score|
+----+-------+-----+
|李明| 语文| 82|
|李明| 数学| 90|
|李明| 英语| 75|
|陈凯| 语文| 71|
|陈凯| 数学| 83|
|陈凯| 英语| 66|
|王莉| 语文| 85|
|王莉| 数学| 80|
|王莉| 英语| 81|
+----+-------+-----+
利用pyspark.sql.Window实现collect_list的排序如下
from pyspark.sql import functions as F
from pyspark.sql import Window
window_ = Window.partitionBy("name").orderBy("score")
df.withColumn("score_list",F.collect_list("score").over(window_)
).groupby("name").agg(F.max("score_list").alias("score_list")).show()
+----+------------+
|name| score_list|
+----+------------+
|王莉|[80, 81, 85]|
|陈凯|[66, 71, 83]|
|李明|[75, 82, 90]|
+----+------------+