pyspark groupby 后将遍历的每一行转成pandas df

关于pyspark分组后遍历分组后的数据参考这篇文章:

https://blog.csdn.net/qq_42363032/article/details/118298108

pyspark分组后如下,在pandas里分组后,每一个小df就是如下的每一行

data = ss.createDataFrame(data)


da_gb = data.groupby('alpos_id').agg(
    fn.collect_list('impressions').alias('impressions_list'),
    fn.collect_list('ecpm').alias('ecpm_list')
)

da_gb.show()

在这里插入图片描述

将pyspark分组后的数据,即每一行,转成pandas的df:

def row_dealwith(data):
    ids = list(data.keys())[0]      # 获取分组id
    values = data.get(ids)          # 获取分组后的字段值
    lens = len(values)
    # print(ids, values[0], values[1])

    # 构造id
    ids_li = []
    for i in range(len(values[0])):
        ids_li.append(ids)

    # 横向分组转为纵向分组
    zdict = {}
    zlis = []
    zdict['alpos_id'] = ids_li
    for i in range(lens):
        zdict[i] = values[i]

    print(zdict)

    da_gb = pd.DataFrame(zdict)
    print(da_gb)
dardds = da_gb.rdd.map(lambda data: ({data.alpos_id: [data.impressions_list, data.ecpm_list]}))

dardds.foreach(row_dealwith)
'''
out:

{'alpos_id': ['0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930', '0_2011082923279930'], 0: [222.0, 2269.0, 212.0, 43.0, 29.0, 172.0, 192.0, 232.0, 288.0, 306.0, 328.0], 1: [14.4595, 14.0899, 14.3868, 12.5581, 12.069, 30.814, 14.1667, 12.6293, 15.5556, 8.5948, 11.2805]}

{'alpos_id': ['0_3001461399082077', '0_3001461399082077', '0_3001461399082077', '0_3001461399082077', '0_3001461399082077', '0_3001461399082077', '0_3001461399082077', '0_3001461399082077', '0_3001461399082077'], 0: [0.2, 0.0, 0.142857142857142, 0.0, 0.181818181818181, 0.3, 0.3125, 0.0, 0.0], 1: [43.6990133333333, 40.1434533333333, 41.21348, 34.8579266666666, 35.2619666666666, 35.6953, 44.22308, 44.4453, 44.18604]}

{'alpos_id': ['0_3071297379437968'], 0: [8.0], 1: [73.75]}

{'alpos_id': ['0_3031798112278383', '0_3031798112278383', '0_3031798112278383', '0_3031798112278383'], 0: [4.0, 62.0, 58.0, 4.0], 1: [2.5, 6.9355, 9.3103, 5.0]}
'''

在这里插入图片描述

汇总关键代码

def row_dealwith(data):
    ids = list(data.keys())[0]      # 获取分组id
    values = data.get(ids)          # 获取分组后的字段值
    lens = len(values)
    # print(ids, values[0], values[1])

    # 构造id
    ids_li = []
    for i in range(len(values[0])):
        ids_li.append(ids)

    # 横向分组转为纵向分组
    zdict = {}
    zlis = []
    zdict['alpos_id'] = ids_li
    for i in range(lens):
        zdict[i] = values[i]

    print(zdict)

    da_gb = pd.DataFrame(zdict)
    print(da_gb)


def pyspark_gb(data):
    data = ss.createDataFrame(data)


    da_gb = data.groupby('alpos_id').agg(
        fn.collect_list('impressions').alias('impressions_list'),
        fn.collect_list('ecpm').alias('ecpm_list')
    )

    da_gb.show()

    dardds = da_gb.rdd.map(lambda data: ({data.alpos_id: [data.impressions_list, data.ecpm_list]}))
    # print(type(sss))            # pyspark.rdd.PipelinedRDD

    # print(sss.take(5))
    # sss.foreach(lambda x: print(x))
    dardds.foreach(row_dealwith)
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

WGS.

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值