pyspark.sql.functions中collect_list(col)和array_join(col, delimiter, null_replacement=None)组合使用

qq_34669699

已于 2023-02-17 16:48:49 修改

阅读量763

点赞数

分类专栏：大数据个人笔记文章标签： python 大数据 spark

于 2023-02-17 16:35:45 首次发布

本文链接：https://blog.csdn.net/qq_34669699/article/details/129088653

版权

个人笔记同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

大数据

7 篇文章 0 订阅

订阅专栏

pyspark.sql.functions中collect_list 和 array_join 组合使用

collect_list(col)用法

聚合函数：返回可重复对象的列表。
1.返回的对象是列表list的数据格式
2.返回列表中可以存在重复的数据

>>>df = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>>> df.show()
+---+
|age|
+---+
|  2|
|  5|
|  5|
+---+
>>> import pyspark.sql.functions as F
>>> df1=df.agg(F.collect_list('age'))
>>> df1.show()
+-----------------+                                                             
|collect_list(age)|
+-----------------+
|        [2, 5, 5]|
+-----------------+

collect_set(col)用法

聚合函数：返回一组消除重复元素的对象的集合。
1.返回的对象是集合lset的数据格式
2.返回列表中不存在重复的数据


>>> df2=df.agg(F.collect_set('age'))
>>> df2.show()
+----------------+                                                              
|collect_set(age)|
+----------------+
|          [5, 2]|
+----------------+

array_join(col, delimiter, null_replacement=None)用法

Concatenates the elements of column using the delimiter. Null values are replaced with
null_replacement if set, otherwise they are ignored.
使用 delimiter连接column中的元素，如果元素中存在Null值，可以使用null_replacement替代，否则忽略Null值

可以看出array_join中的col参数为list或set数据类型

>>> df = spark.createDataFrame([(["a", "b", "c"],), (["a", None],)], ['data'])
>>> df.show()
+---------+
|     data|
+---------+
|[a, b, c]|
|     [a,]|
+---------+
>>> df1=df.select(F.array_join(df.data, ",").alias("joined"))
>>> df1.show()
+------+                                                                        
|joined|
+------+
| a,b,c|
|     a|
+------+

使用 null_replacement 参数


>>> df2=df.select(F.array_join(df.data, "*", "无").alias("joined"))
>>>> df2.show()
+------+                                                                        
|joined|
+------+
| a*b*c|
|  a*无|
+------+

二者组合使用

使用collect_list 聚合后可以将数据行转列，此时返回的结果为集合。
再使用array_join 可以将数据按照需要的分隔符，如“|”，“$”, “ , ” 进行拼接

>>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df1=df.agg(F.collect_list('age')).alias('ages')
>>> df1.show()
+---------+
|     ages|
+---------+
|[5, 5, 2]|
+---------+
>>> df2=df1.select(F.array_join(df1.ages, "|@|").alias("joined"))
>>> df2.show()
+---------+                                                                     
|   joined|
+---------+
|5|@|5|@|2|
+---------+