1.定义udf
基于F.lit(data) 实现传参
F.lit 函数实际上是新增一例,可以理解是自定义的函数把参数当列一样来操作
from pyspark.sql import SparkSession,functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
def filter(col,filters=["haha","hehe"])
col_map={}
try:
for event_cnt in col:
event,cnt=col.split(",")
if event in filters:
col_map[event]=int[cnt]
return col_map
except:
return None
udf_filter=F.udf(filter,MapType(StringType(),IntergerType())
collect_df=df.groupby("user","docId").agg({"envCnt":"collect_list"}).withColumnRenamed("collect_list(envCnt)","eventCnt")
collect_df.select("userId","docId",udf_filter("eventCnt",F.lit(filters=["haha","hh","hehe"]).alias("actionCnt"))
2.udaf实现
在数据量小的情况下 可以先分组 基于collect_list聚合成列,在自定义合并函数来满足需求