一直有个疑问,出现空值数据列,在计算时,是否会被计算?
1. 空值出现在计算字段中
给定某个DataFrame,其中包含空值:
val data1 = Seq(
| ("1", "ming", "hlj", null),
| ("2", "tian", "jl", "1"),
| ("3", "ming", "hg", "1"),
| ("4", "tian", "bj", "1"),
| ("5", "ming", "tj", "1")
| ).toDF("useid", "name", "live", "nums")
data1.createOrReplaceTempView("data")
data1.show()
+-----+----+----+----+
|useid|name|live|nums|
+-----+----+----+----+
| 1|ming| hlj|null|
| 2|tian| jl| 1|
| 3|ming| hg| 1|
| 4|tian| bj| 1|
| 5|ming| tj| 1|
+-----+----+----+----+
进行实验:
spark.sql(s"select name, avg(nums) as nu from data group by name").show()
+----+---+
|name| nu|
+----+---+
|ming|1.0|
|tian|1.0|
+----+---+
spark.sql(s"select name, sum(nums) as nu from data group by name").show()
+----+---+
|name| nu|
+----+---+
|ming|2.0|
|tian|2.0|
+----+---+
运行上述sql语句发现,只要是涉及到空值列的计算,含空值的都会被排除。
spark.sql(s"select name, count(*) as nu from data group by name").show()
+----+---+
|name| nu|
+----+---+
|ming| 3|
|tian| 2|
+----+---+
spark.sql(s"select name, count(nums) as nu from data group by name").show()
+----+---+
|name| nu|
+----+---+
|ming| 2|
|tian| 2|
+----+---+
在涉及到统计数据行数时,如果是针对nums(含空值属性列)count,则会自动排除含空值行;否则不会排除空值行。
spark.sql(s"select *,size(collect_list(nums) over (partition by name)) as freq from data").show()
+-----+----+----+----+----+
|useid|name|live|nums|freq|
+-----+----+----+----+----+
| 1|ming| hlj|null| 2|
| 3|ming| hg| 1| 2|
| 5|ming| tj| 1| 2|
| 2|tian| jl| 1| 2|
| 4|tian| bj| 1| 2|
+-----+----+----+----+----+
可以通过partition by 在不改变数据的前提下,排除空值,计算出现频次。
2. 空值出现在统计字段中
val data2 = Seq(
| ("1", null, "hlj", null),
| ("2", "tian", "jl", "1"),
| ("3", null, "hg", "1"),
| ("4", "tian", "bj", "1"),
| ("5", "ming", "tj", "1")
| ).toDF("useid", "name", "live", "nums")
data2.createOrReplaceTempView("data2")
data2.show()
data2: org.apache.spark.sql.DataFrame = [useid: string, name: string ... 2 more fields]
+-----+----+----+----+
|useid|name|live|nums|
+-----+----+----+----+
| 1|null| hlj|null|
| 2|tian| jl| 1|
| 3|null| hg| 1|
| 4|tian| bj| 1|
| 5|ming| tj| 1|
+-----+----+----+----+
进行实验
spark.sql(s"select name, count(nums) as nu from data2 group by name").show()
+----+---+
|name| nu|
+----+---+
|null| 1|
|ming| 1|
|tian| 2|
+----+---+
如果空值出现在统计字段中,如group by partition by 则一定会进行计算。