pyspark 条件,使用pyspark进行条件聚合

consider the below as the dataframe

a b c d e

africa 123 1 10 121.2

africa 123 1 10 321.98

africa 123 2 12 43.92

africa 124 2 12 43.92

usa 121 1 12 825.32

usa 121 1 12 89.78

usa 123 2 10 32.24

usa 123 5 21 43.92

canada 132 2 13 63.21

canada 132 2 13 89.23

canada 132 3 21 85.32

canada 131 3 10 43.92

now I want to convert the below case statement to equivalent statement in PYSPARK using dataframes.

we can directly use this in case statement using hivecontex/sqlcontest nut looking for the traditional pyspark nql query

select

case

when c <=10 then sum(e)

when c between 10 and 20 then avg(e)

else 0.00 end

from table

group by a,b,c,d

Regards

Anvesh

解决方案

You can translate your SQL code directly into DataFrame primitives:

from pyspark.sql.functions import when, sum, avg, col

(df

.groupBy("a", "b", "c", "d") # group by a,b,c,d

.agg( # select

when(col("c") < 10, sum("e")) # when c <=10 then sum(e)

.when(col("c").between(10 ,20), avg("c")) # when c between 10 and 20 then avg(e)

.otherwise(0)) # else 0.00

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值