spark中DataFrame存在空值是否计算情况说明

一直有个疑问,出现空值数据列,在计算时,是否会被计算?

1. 空值出现在计算字段中

给定某个DataFrame,其中包含空值:

val data1 = Seq(  
 | ("1", "ming", "hlj", null),  
 | ("2", "tian", "jl", "1"),
 | ("3", "ming", "hg", "1"),
 | ("4", "tian", "bj", "1"),
 | ("5", "ming", "tj", "1")
 | ).toDF("useid", "name", "live", "nums")
 
 data1.createOrReplaceTempView("data")
data1.show()


+-----+----+----+----+
|useid|name|live|nums|
+-----+----+----+----+
|    1|ming| hlj|null|
|    2|tian|  jl|   1|
|    3|ming|  hg|   1|
|    4|tian|  bj|   1|
|    5|ming|  tj|   1|
+-----+----+----+----+

进行实验:

spark.sql(s"select name, avg(nums) as nu from data group by name").show()

+----+---+
|name| nu|
+----+---+
|ming|1.0|
|tian|1.0|
+----+---+


spark.sql(s"select name, sum(nums) as nu from data group by name").show()

+----+---+
|name| nu|
+----+---+
|ming|2.0|
|tian|2.0|
+----+---+

运行上述sql语句发现,只要是涉及到空值列的计算,含空值的都会被排除。

spark.sql(s"select name, count(*) as nu from data group by name").show()

+----+---+
|name| nu|
+----+---+
|ming|  3|
|tian|  2|
+----+---+


spark.sql(s"select name, count(nums) as nu from data group by name").show()

+----+---+
|name| nu|
+----+---+
|ming|  2|
|tian|  2|
+----+---+

在涉及到统计数据行数时,如果是针对nums(含空值属性列)count,则会自动排除含空值行;否则不会排除空值行。

spark.sql(s"select *,size(collect_list(nums) over (partition by name)) as freq from data").show()

+-----+----+----+----+----+
|useid|name|live|nums|freq|
+-----+----+----+----+----+
|    1|ming| hlj|null|   2|
|    3|ming|  hg|   1|   2|
|    5|ming|  tj|   1|   2|
|    2|tian|  jl|   1|   2|
|    4|tian|  bj|   1|   2|
+-----+----+----+----+----+

可以通过partition by 在不改变数据的前提下,排除空值,计算出现频次。

2. 空值出现在统计字段中

val data2 = Seq(  
 | ("1", null, "hlj", null),  
 | ("2", "tian", "jl", "1"),
 | ("3", null, "hg", "1"),
 | ("4", "tian", "bj", "1"),
 | ("5", "ming", "tj", "1")
 | ).toDF("useid", "name", "live", "nums")
 
 data2.createOrReplaceTempView("data2")
 
 data2.show()

data2: org.apache.spark.sql.DataFrame = [useid: string, name: string ... 2 more fields]
+-----+----+----+----+
|useid|name|live|nums|
+-----+----+----+----+
|    1|null| hlj|null|
|    2|tian|  jl|   1|
|    3|null|  hg|   1|
|    4|tian|  bj|   1|
|    5|ming|  tj|   1|
+-----+----+----+----+

进行实验

spark.sql(s"select name, count(nums) as nu from data2 group by name").show()

+----+---+
|name| nu|
+----+---+
|null|  1|
|ming|  1|
|tian|  2|
+----+---+

如果空值出现在统计字段中,如group by partition by 则一定会进行计算。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值