java spark groupby,聚合函数计算Spark中groupBy的使用情况

最新推荐文章于 2023-08-16 10:17:45 发布

住不起了

最新推荐文章于 2023-08-16 10:17:45 发布

阅读量515

点赞数

文章标签： java spark groupby

I'm trying to make multiple operations in one line of code in pySpark,

and not sure if that's possible for my case.

My intention is not having to save the output as a new dataframe.

My current code is rather simple:

encodeUDF = udf(encode_time, StringType())

new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))

.groupBy('timePeriod')

.agg(

mean('DOWNSTREAM_SIZE').alias("Mean"),

stddev('DOWNSTREAM_SIZE').alias("Stddev")

)

.show(20, False)

And my intention is to add count() after using groupBy, to get, well, the count of records matching each value of timePeriod column, printed\shown as output.

When trying to use groupBy(..).count().agg(..) I get exceptions.

Is there any way to achieve both count() and agg().show() prints, without splitting code to two lines of commands, e.g. :

new_log_df.withColumn(..).groupBy(..).count()

new_log_df.withColumn(..).groupBy(..).agg(..).show()

Or better yet, for getting a merged output to agg.show() output - An extra column which states the counted number of records matching the row's value. e.g.:

timePeriod | Mean | Stddev | Num Of Records

X | 10 | 20 | 315

解决方案

count() can be used inside agg() as groupBy expression is same.

With Python

import pyspark.sql.functions as func

new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"]))

.groupBy("timePeriod")

.agg(

func.mean("DOWNSTREAM_SIZE").alias("Mean"),

func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),

func.count(func.lit(1)).alias("Num Of Records")

)

.show(20, False)

With Scala

import org.apache.spark.sql.functions._ //for count()

new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))

.groupBy("timePeriod")

.agg(

mean("DOWNSTREAM_SIZE").alias("Mean"),

stddev("DOWNSTREAM_SIZE").alias("Stddev"),

count(lit(1)).alias("Num Of Records")

)

.show(20, false)

count(1) will count the records by first column which is equal to count("timePeriod")

With Java

import static org.apache.spark.sql.functions.*;

new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))

.groupBy("timePeriod")

.agg(

mean("DOWNSTREAM_SIZE").alias("Mean"),

stddev("DOWNSTREAM_SIZE").alias("Stddev"),

count(lit(1)).alias("Num Of Records")

)

.show(20, false)

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java spark groupby,聚合函数计算Spark中groupBy的使用情况

I'm trying to make multiple operations in one line of code in pySpark,and not sure if that's possible for my case.My intention is not having to save the output as a new dataframe.My current code is ra...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。