背景
+-----------+-----------+
|question_id|user_answer|
+-----------+-----------+
| 30530| 122306,|
| 30548| 122378,|
| 30451| 121990,|
| 30530| 122304,|
| 30548| 122378,|
| 30451| 121990,|
| 30530| 122304,|
| 30548| 122378,|
| 30523| 122279,|
| 30543| 122359,|
| 30523| 122279,|
| 30543| 122359,|
| 30493| 122159|
| 30508| 122217|
| 30523| 122278|
| 30493| 122157|
| 30508| 122219|
| 30450| 121986|
| 30523| 122277|
| 30543| 122359|
+-----------+-----------+
按照question_id groupby 合并,把user_answer 按照逗号分隔,合并成一个列表
最后输出每个question_id的每个user_answer 的个数
需求
- 字符串分隔
- group之后列表展平成一个列表
- python 删除列表指定元素
- pytho 计算列表所有元素出现的次数
- pyspark udf自定义函数
solution
- 先分隔字符串
df=df.withColumn('answers',F.split("user_answer",',')).drop("user_answer")
+-----------+----------+
|question_id| answers|
+-----------+----------+
| 30530|[122306, ]|
| 30548|[122378, ]|
| 30451|[121990, ]|
| 30530|[122304, ]|
| 30548|[122378, ]|
| 30451|[121990, ]|
| 30530|[122304, ]|
| 30548|[122378, ]|
| 30523|[122279, ]|
| 30543|[122359, ]|
| 30523|[122279, ]|
| 30543|[122359, ]|
| 30493| [122159]|
| 30508| [122217]|
出现的问题是 列表元素会有空元素
(最后解决)
- 按照question_id 分组合并列表
gp=df.groupby("question_id").agg(F.flatten(F.collect_list('answers')).alias("answers"))
+-----------+--------------------+
|question_id| answers|
+-----------+--------------------+
| 30548|[122378, , 122378...|
| 30519| [122261]|
| 30523|[122279, , 122279...|
| 30451|[121990, , 121990...|
| 30509| [122221]|
| 30530|[122306, , 122304...|
| 30493|[122159, 122157, ...|
| 30543|[122359, , 122359...|
| 30536| [122331]|
| 30508|[122217, 122219, ...|
| 30450| [121986, 121987]|
| 30529| [122303]|
| 30485| [122126, , 122125]|
+-----------+--------------------+
- 自定义函数删除列表的空元素,计算个数
def apply(x):
print("*"*11)
while '' in x:
x.remove('')
res=collections.Counter(x)
print(res)
return json.dumps(res)
# a=F.udf(lambda x:collections.Counter(x))
a=F.udf(apply)
gp = gp.withColumn("choices", a(gp.answers))
gp.show()
-res
+-----------+--------------------+--------------------+
|question_id| answers| choices|
+-----------+--------------------+--------------------+
| 30548|[122378, , 122378...|{"122378": 12, "1...|
| 30519| [122261]| {"122261": 1}|
| 30523|[122279, , 122279...|{"122279": 3, "12...|
| 30451|[121990, , 121990...|{"121990": 10, "1...|
| 30509| [122221]| {"122221": 1}|
| 30530|[122306, , 122304...|{"122306": 7, "12...|
| 30493|[122159, 122157, ...|{"122159": 1, "12...|
| 30543|[122359, , 122359...|{"122359": 3, "12...|
| 30536| [122331]| {"122331": 1}|
| 30508|[122217, 122219, ...|{"122217": 1, "12...|
| 30450| [121986, 121987]|{"121986": 1, "12...|
| 30529| [122303]| {"122303": 1}|
| 30485| [122126, , 122125]|{"122126": 1, "12...|
+-----------+--------------------+--------------------+
perfect
隔天来更
- 可以判断user_answer 是否以‘,’结尾有就删除,没有就继续
减少计算量 - F.when
- udf 自定义函数
def apply(s):
if s.endswith(","):
return s[:-1]
return s
a = F.udf(apply)
df=df.withColumn('answers', F.when(df.user_answer.endswith(','), a(df.user_answer)).otherwise(df.user_answer))