Spark DataFrame 统计行/列缺失率 (scala版)

行缺失数/率 统计

进行特征工程得到特征后,如何统计dataframe格式特征行/列的缺失率?

val DataDF = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")

//列名
val columns=DataDF.columns
val cnt=DataDF.count()

// 统计每列的缺失记录数
val missing_cnt=columns.map(x=>DataDF.select(col(x)).where(col(x).isNull).count)

// 统计每列的缺失率,并保留4位小数
val missing_rate=columns.map(x=>((DataDF.select(col(x)).where(col(x).isNull).count.toDouble/cnt).formatted("%.4f")))

// 将列名和缺失率拼接起来,组成一个dataframe //缺失数 df
val result_cnt=sc.parallelize(missing_cnt.zip(columns)).toDF("missing_cnt","column_name")

//缺失率 df
val result_ratio=sc.parallelize(missing_rate.zip(columns)).toDF("missing_rate","column_name")

结果:

scala> DataDF.show
+----+----+------+---------+
|name| age|degree|    Place|
+----+----+------+---------+
| Ram|null|   MCA|Bangalore|
|null|  25|  null|     null|
|null|  26|    BE|     null|
|Raju|  21| Btech|  Chennai|
+----+----+------+---------+

scala> result_cnt.show
+-----------+-----------+
|missing_cnt|column_name|
+-----------+-----------+
|          2|       name|
|          1|        age|
|          1|     degree|
|          2|      Place|
+-----------+-----------+

scala> result_ratio.show
+------------+-----------+
|missing_rate|column_name|
+------------+-----------+
|      0.5000|       name|
|      0.2500|        age|
|      0.2500|     degree|
|      0.5000|      Place|
+------------+-----------+

列缺失数/率 统计


val DataDF = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")

val df2 = DataDF.columns.foldLeft(DataDF)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )

df2.createOrReplaceTempView("ShortConsum")
val sql_str_null = DataDF.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = DataDF.columns.mkString( "select ", ",", " , " + sql_str_null + " from ShortConsum")
val RowMissCount = spark.sql(sql_str_full).select("null_count") //只有一列"null_count"

val ColNum = DataDF.columns.size.toDouble

//各行缺失率
val RowMissCountRatio = RowMissCount.withColumn("ColNum",lit(ColNum)).withColumn("missing_ratio",col("null_count")/col("ColNum")).drop("ColNum")//此时只有两列null_count|missing_ratio

结果:

scala> DataDF.show
+----+----+------+---------+
|name| age|degree|    Place|
+----+----+------+---------+
| Ram|null|   MCA|Bangalore|
|null|  25|  null|     null|
|null|  26|    BE|     null|
|Raju|  21| Btech|  Chennai|
+----+----+------+---------+

scala> df2.show
+----+----+------+---------+---------+--------+-----------+----------+
|name| age|degree|    Place|name_null|age_null|degree_null|Place_null|
+----+----+------+---------+---------+--------+-----------+----------+
| Ram|null|   MCA|Bangalore|        0|       1|          0|         0|
|null|  25|  null|     null|        1|       0|          1|         1|
|null|  26|    BE|     null|        1|       0|          0|         1|
|Raju|  21| Btech|  Chennai|        0|       0|          0|         0|
+----+----+------+---------+---------+--------+-----------+----------+

scala> RowMissCountRatio.show
+----------+-------------+
|null_count|missing_ratio|
+----------+-------------+
|         1|         0.25|
|         3|         0.75|
|         2|          0.5|
|         0|          0.0|
+----------+-------------+

我的博客

  • 2
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值