pyspark 条件,PySpark按条件计数值

最新推荐文章于 2023-10-21 22:32:44 发布

阿兹猫

最新推荐文章于 2023-10-21 22:32:44 发布

阅读量705

点赞数

文章标签： pyspark 条件

I have a DataFrame, a snippet here:

[['u1', 1], ['u2', 0]]

basically a string field named f and either a 1 or a 0 for second element (is_fav).

What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. I was hoping to do something like

num_fav = count((col("is_fav") == 1)).alias("num_fav")

num_nonfav = count((col("is_fav") == 0)).alias("num_nonfav")

df.groupBy("f").agg(num_fav, num_nonfav)

It does not work properly, I get in both cases the same result which amounts to the count for the items in the group, so the filter (whether it is a 1 or a 0) seems to be ignored. Does this depend on how count works?

解决方案

There is no filter here. Both col("is_fav") == 1 and col("is_fav") == 0) are just boolean expressions and count doesn't really care about their value as long as it is defined.

There are many ways you can solve this for example by using simple sum:

from pyspark.sql.functions import sum, abs

gpd = df.groupBy("f")

gpd.agg(

sum("is_fav").alias("fv"),

(count("is_fav") - sum("is_fav")).alias("nfv")

)

or making ignored values undefined (a.k.a NULL):

exprs = [

count(when(col("is_fav") == x, True)).alias(c)

for (x, c) in [(1, "fv"), (0, "nfv")]

]