一.问题描述
今天写pyspark遇到一个问题,要实现同mysql的GROUP_CONCAT函数的功能
数据1:
col1 col2
1 a
1 b
1 c
2 d
2 f
想要的结果1:
col1 new_col2
1 a,b,c
2 d,f
如果存在多列是否也可行
数据2:
col1 col2 col3
1 a 100
1 b 200
1 c 300
2 d 400
2 f 500
想要的结果2:
col1 new_col2 new_col3
1 a,b,c 100,200,300
2 d,f 400,500
二.解决方案
pyspark的collect_list可以实现如下需求
代码:
#!/usr/bin/env python
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import collect_list
from pyspark.sql.functions import collect_set
spark_conf = SparkConf().setMaster("local").setAppName("test1")
sc = SparkContext(conf = spark_conf)
spark = SparkSession(sc)
df1 = spark.createDataFrame([('1','a'),('1','b'),('1','c'),('2','d'),('2','f')], ['col1', 'col2'])
df1.groupBy("col1").agg(collect_list('col2').alias('new_col2')).show()
df2 = spark.createDataFrame([('1','a','100'),('1','b','200'),('1','c','300'),('2','d','400'),('2','f','500')], ['col1', 'col2', 'col3'])
df2.groupBy("col1").agg(*[collect_set(col) for col in ['col2','col3']]).show()
sc.stop()
测试记录:
--snip--
21/04/15 17:26:41 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
21/04/15 17:26:41 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/04/15 17:26:41 INFO hive.metastore: Connected to metastore.
+----+---------+
|col1| new_col2|
+----+---------+
| 1|[a, b, c]|
| 2| [d, f]|
+----+---------+
--snip--
21/04/15 17:30:43 INFO scheduler.TaskSetManager: Finished task 64.0 in stage 19.0 (TID 401) in 8 ms on localhost (executor driver) (75/75)
21/04/15 17:30:43 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have all completed, from pool
21/04/15 17:30:43 INFO scheduler.DAGScheduler: ResultStage 19 (showString at NativeMethodAccessorImpl.java:0) finished in 0.272 s
21/04/15 17:30:43 INFO scheduler.DAGScheduler: Job 9 finished: showString at NativeMethodAccessorImpl.java:0, took 0.278859 s
+----+-----------------+-----------------+
|col1|collect_set(col2)|collect_set(col3)|
+----+-----------------+-----------------+
| 1| [c, b, a]| [100, 300, 200]|
| 2| [f, d]| [500, 400]|
+----+-----------------+-----------------+
--snip--
参考:
1.https://www.cnblogs.com/TTyb/p/10196544.html