1.去重且保留最大/小值:
from pyspark.sql import functions as F
df.groupby(['columns1','columns2']).agg(F.max/min(column_name))
2.将df按照某一列排序,取前n列
from pyspark.sql.window import Window
df.withColumn('rownumber',F.row_number().over(Window.orderBy(-df['col'])))
3.聚合函数agg,排序后,根据uid聚合,将wid这一列collect成list
df1.sort(df1['p'].desc()).groupby('uid').agg(F.collect_list('wid').alias('value_list'))