pyspark案例系列3-dataframe实现mysql的group_concat功能

一.问题描述

今天写pyspark遇到一个问题,要实现同mysql的GROUP_CONCAT函数的功能

数据1:

col1   col2
1        a
1        b
1        c
2        d
2        f

想要的结果1:

col1   new_col2
1        a,b,c
2        d,f

如果存在多列是否也可行
数据2:

col1   col2   col3
1        a        100
1        b        200
1        c        300
2        d        400
2        f        500

想要的结果2:

col1   new_col2    new_col3
1        a,b,c   100,200,300
2        d,f       400,500

二.解决方案

pyspark的collect_list可以实现如下需求

代码:

#!/usr/bin/env python
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import collect_list
from pyspark.sql.functions import collect_set

spark_conf = SparkConf().setMaster("local").setAppName("test1")
sc = SparkContext(conf = spark_conf)
spark = SparkSession(sc)

df1 = spark.createDataFrame([('1','a'),('1','b'),('1','c'),('2','d'),('2','f')], ['col1', 'col2'])
df1.groupBy("col1").agg(collect_list('col2').alias('new_col2')).show()

df2 = spark.createDataFrame([('1','a','100'),('1','b','200'),('1','c','300'),('2','d','400'),('2','f','500')], ['col1', 'col2', 'col3'])
df2.groupBy("col1").agg(*[collect_set(col) for col in ['col2','col3']]).show()

sc.stop()

测试记录:

--snip--
21/04/15 17:26:41 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
21/04/15 17:26:41 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/04/15 17:26:41 INFO hive.metastore: Connected to metastore.
+----+---------+
|col1| new_col2|
+----+---------+
|   1|[a, b, c]|
|   2|   [d, f]|
+----+---------+
--snip--
21/04/15 17:30:43 INFO scheduler.TaskSetManager: Finished task 64.0 in stage 19.0 (TID 401) in 8 ms on localhost (executor driver) (75/75)
21/04/15 17:30:43 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have all completed, from pool 
21/04/15 17:30:43 INFO scheduler.DAGScheduler: ResultStage 19 (showString at NativeMethodAccessorImpl.java:0) finished in 0.272 s
21/04/15 17:30:43 INFO scheduler.DAGScheduler: Job 9 finished: showString at NativeMethodAccessorImpl.java:0, took 0.278859 s
+----+-----------------+-----------------+
|col1|collect_set(col2)|collect_set(col3)|
+----+-----------------+-----------------+
|   1|        [c, b, a]|  [100, 300, 200]|
|   2|           [f, d]|       [500, 400]|
+----+-----------------+-----------------+
--snip--

参考:
1.https://www.cnblogs.com/TTyb/p/10196544.html

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值