35. Spark 2.4版本以下没有array_join、array_sort 函数，可变通的办法

最新推荐文章于 2023-10-08 20:05:30 发布

元元的李树

最新推荐文章于 2023-10-08 20:05:30 发布

阅读量1.2k

点赞数

文章标签： Spark

本文链接：https://blog.csdn.net/qq0719/article/details/103678103

版权

首先，先要知道 array_join 及 array_sort的函数用法，详情请参考如下网址：

https://www.iteblog.com/archives/2459.html

下面给出Spark 2.4的 demo代码

select 
  row_number()  OVER (PARTITION BY 1 ORDER BY 1) id,
  md5(array_join(array_sort(collect_set(f.holder_id)),'|')) association_id,
  current_timestamp() date_modified,
  first(f.date_id) date_id,
  array_join(array_sort(collect_set(f.holder_id)),'|') horder_ids_string,
  size(collect_set(f.holder_id)) holder_count,
  first(h.type) holder_type,
  first(h.type_name) holder_type_name,
  first(f.date_id) dt           
from XXXXX f, XXXX h
where f.holder_id=h.id
group by f.association_tmp_id

下面给出Spark2.3版本的实现方法，即采用先排序，再用collect_list函数完成array_join、array_sort的功能

select 
  row_number()  OVER (PARTITION BY 1 ORDER BY 1) id,
  md5(concat_ws("|", collect_list(cast (holder_id as varchar(20))))) association_id,
  current_timestamp() date_modified,
  date_id,
  concat_ws("|", collect_list(cast (holder_id as varchar(20)))) horder_ids_string,
  size(collect_set(holder_id)) holder_count,
  holder_type,
  holder_type_name,
  dt           
from 
(
select f.association_tmp_id, 
       f.holder_id, 
       f.date_id, 
       f.holder_type, 
       h.type_name as holder_type_name,
       f.date_id as dt
from       
XXXXXXXX f, XXXX h
where f.holder_id=h.id
order by f.date_id, f.association_tmp_id, f.holder_id
)
group by date_id, association_tmp_id, holder_type, holder_type_name, dt

写的比较简洁，若有什么其他问题，欢迎在评论区提问，我看到后会回复的。

此贴来自汇总贴的子问题，只是为了方便查询。

总贴请看置顶帖：

pyspark及Spark报错问题汇总及某些函数用法。

https://blog.csdn.net/qq0719/article/details/86003435