首先 ,先要知道 array_join 及 array_sort的函数用法,详情请参考如下网址:
https://www.iteblog.com/archives/2459.html
下面给出Spark 2.4的 demo代码
select
row_number() OVER (PARTITION BY 1 ORDER BY 1) id,
md5(array_join(array_sort(collect_set(f.holder_id)),'|')) association_id,
current_timestamp() date_modified,
first(f.date_id) date_id,
array_join(array_sort(collect_set(f.holder_id)),'|') horder_ids_string,
size(collect_set(f.holder_id)) holder_count,
first(h.type) holder_type,
first(h.type_name) holder_type_name,
first(f.date_id) dt
from XXXXX f, XXXX h
where f.holder_id=h.id
group by f.association_tmp_id
下面给出Spark2.3版本的实现方法,即采用先排序,再用collect_list函数完成array_join、array_sort的功能
select
row_number() OVER (PARTITION BY 1 ORDER BY 1) id,
md5(concat_ws("|", collect_list(cast (holder_id as varchar(20))))) association_id,
current_timestamp() date_modified,
date_id,
concat_ws("|", collect_list(cast (holder_id as varchar(20)))) horder_ids_string,
size(collect_set(holder_id)) holder_count,
holder_type,
holder_type_name,
dt
from
(
select f.association_tmp_id,
f.holder_id,
f.date_id,
f.holder_type,
h.type_name as holder_type_name,
f.date_id as dt
from
XXXXXXXX f, XXXX h
where f.holder_id=h.id
order by f.date_id, f.association_tmp_id, f.holder_id
)
group by date_id, association_tmp_id, holder_type, holder_type_name, dt
写的比较简洁,若有什么其他问题,欢迎在评论区提问,我看到后会回复的。
此贴来自汇总贴的子问题,只是为了方便查询。
总贴请看置顶帖:
pyspark及Spark报错问题汇总及某些函数用法。