pyspark group之后展开列表，计算列表个数

最新推荐文章于 2022-03-16 15:04:38 发布

NoOne-csdn

最新推荐文章于 2022-03-16 15:04:38 发布

阅读量1.1k

点赞数

分类专栏： pyspark python 分布式

本文链接：https://blog.csdn.net/weixin_40161254/article/details/106524840

版权

python 同时被 3 个专栏收录

85 篇文章 0 订阅

订阅专栏

pyspark

63 篇文章 9 订阅

订阅专栏

分布式

7 篇文章 0 订阅

订阅专栏

背景

+-----------+-----------+
|question_id|user_answer|
+-----------+-----------+
|      30530|    122306,|
|      30548|    122378,|
|      30451|    121990,|
|      30530|    122304,|
|      30548|    122378,|
|      30451|    121990,|
|      30530|    122304,|
|      30548|    122378,|
|      30523|    122279,|
|      30543|    122359,|
|      30523|    122279,|
|      30543|    122359,|
|      30493|     122159|
|      30508|     122217|
|      30523|     122278|
|      30493|     122157|
|      30508|     122219|
|      30450|     121986|
|      30523|     122277|
|      30543|     122359|
+-----------+-----------+

按照question_id groupby 合并，把user_answer 按照逗号分隔，合并成一个列表
最后输出每个question_id的每个user_answer 的个数

需求

字符串分隔
group之后列表展平成一个列表
python 删除列表指定元素
pytho 计算列表所有元素出现的次数
pyspark udf自定义函数

solution

先分隔字符串

df=df.withColumn('answers',F.split("user_answer",',')).drop("user_answer")

+-----------+----------+
|question_id|   answers|
+-----------+----------+
|      30530|[122306, ]|
|      30548|[122378, ]|
|      30451|[121990, ]|
|      30530|[122304, ]|
|      30548|[122378, ]|
|      30451|[121990, ]|
|      30530|[122304, ]|
|      30548|[122378, ]|
|      30523|[122279, ]|
|      30543|[122359, ]|
|      30523|[122279, ]|
|      30543|[122359, ]|
|      30493|  [122159]|
|      30508|  [122217]|

出现的问题是列表元素会有空元素
(最后解决)

按照question_id 分组合并列表

        gp=df.groupby("question_id").agg(F.flatten(F.collect_list('answers')).alias("answers"))

+-----------+--------------------+
|question_id|             answers|
+-----------+--------------------+
|      30548|[122378, , 122378...|
|      30519|            [122261]|
|      30523|[122279, , 122279...|
|      30451|[121990, , 121990...|
|      30509|            [122221]|
|      30530|[122306, , 122304...|
|      30493|[122159, 122157, ...|
|      30543|[122359, , 122359...|
|      30536|            [122331]|
|      30508|[122217, 122219, ...|
|      30450|    [121986, 121987]|
|      30529|            [122303]|
|      30485|  [122126, , 122125]|
+-----------+--------------------+

自定义函数删除列表的空元素，计算个数

  def apply(x):
            print("*"*11)
            while '' in x:
                x.remove('')
            res=collections.Counter(x)
            print(res)
            return json.dumps(res)
        # a=F.udf(lambda x:collections.Counter(x))
        a=F.udf(apply)

        gp = gp.withColumn("choices", a(gp.answers))
        gp.show()

-res

+-----------+--------------------+--------------------+
|question_id|             answers|             choices|
+-----------+--------------------+--------------------+
|      30548|[122378, , 122378...|{"122378": 12, "1...|
|      30519|            [122261]|       {"122261": 1}|
|      30523|[122279, , 122279...|{"122279": 3, "12...|
|      30451|[121990, , 121990...|{"121990": 10, "1...|
|      30509|            [122221]|       {"122221": 1}|
|      30530|[122306, , 122304...|{"122306": 7, "12...|
|      30493|[122159, 122157, ...|{"122159": 1, "12...|
|      30543|[122359, , 122359...|{"122359": 3, "12...|
|      30536|            [122331]|       {"122331": 1}|
|      30508|[122217, 122219, ...|{"122217": 1, "12...|
|      30450|    [121986, 121987]|{"121986": 1, "12...|
|      30529|            [122303]|       {"122303": 1}|
|      30485|  [122126, , 122125]|{"122126": 1, "12...|
+-----------+--------------------+--------------------+

perfect

隔天来更

可以判断user_answer 是否以‘,’结尾有就删除，没有就继续
减少计算量
F.when
udf 自定义函数

def apply(s):
   if s.endswith(","):
  	 return s[:-1]
   return s
a = F.udf(apply)
df=df.withColumn('answers', F.when(df.user_answer.endswith(','), a(df.user_answer)).otherwise(df.user_answer))