行缺失数/率 统计
进行特征工程得到特征后,如何统计dataframe格式特征行/列的缺失率?
val DataDF = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
//列名
val columns=DataDF.columns
val cnt=DataDF.count()
// 统计每列的缺失记录数
val missing_cnt=columns.map(x=>DataDF.select(col(x)).where(col(x).isNull).count)
// 统计每列的缺失率,并保留4位小数
val missing_rate=columns.map(x=>((DataDF.select(col(x)).where(col(x).isNull).count.toDouble/cnt).formatted("%.4f")))
// 将列名和缺失率拼接起来,组成一个dataframe //缺失数 df
val result_cnt=sc.parallelize(missing_cnt.zip(columns)).toDF("missing_cnt","column_name")
//缺失率 df
val result_ratio=sc.parallelize(missing_rate.zip(columns)).toDF("missing_rate","column_name")
结果:
scala> DataDF.show
+----+----+------+---------+
|name| age|degree| Place|
+----+----+------+---------+
| Ram|null| MCA|Bangalore|
|null| 25| null| null|
|null| 26| BE| null|
|Raju| 21| Btech| Chennai|
+----+----+------+---------+
scala> result_cnt.show
+-----------+-----------+
|missing_cnt|column_name|
+-----------+-----------+
| 2| name|
| 1| age|
| 1| degree|
| 2| Place|
+-----------+-----------+
scala> result_ratio.show
+------------+-----------+
|missing_rate|column_name|
+------------+-----------+
| 0.5000| name|
| 0.2500| age|
| 0.2500| degree|
| 0.5000| Place|
+------------+-----------+
列缺失数/率 统计
val DataDF = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
val df2 = DataDF.columns.foldLeft(DataDF)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("ShortConsum")
val sql_str_null = DataDF.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = DataDF.columns.mkString( "select ", ",", " , " + sql_str_null + " from ShortConsum")
val RowMissCount = spark.sql(sql_str_full).select("null_count") //只有一列"null_count"
val ColNum = DataDF.columns.size.toDouble
//各行缺失率
val RowMissCountRatio = RowMissCount.withColumn("ColNum",lit(ColNum)).withColumn("missing_ratio",col("null_count")/col("ColNum")).drop("ColNum")//此时只有两列null_count|missing_ratio
结果:
scala> DataDF.show
+----+----+------+---------+
|name| age|degree| Place|
+----+----+------+---------+
| Ram|null| MCA|Bangalore|
|null| 25| null| null|
|null| 26| BE| null|
|Raju| 21| Btech| Chennai|
+----+----+------+---------+
scala> df2.show
+----+----+------+---------+---------+--------+-----------+----------+
|name| age|degree| Place|name_null|age_null|degree_null|Place_null|
+----+----+------+---------+---------+--------+-----------+----------+
| Ram|null| MCA|Bangalore| 0| 1| 0| 0|
|null| 25| null| null| 1| 0| 1| 1|
|null| 26| BE| null| 1| 0| 0| 1|
|Raju| 21| Btech| Chennai| 0| 0| 0| 0|
+----+----+------+---------+---------+--------+-----------+----------+
scala> RowMissCountRatio.show
+----------+-------------+
|null_count|missing_ratio|
+----------+-------------+
| 1| 0.25|
| 3| 0.75|
| 2| 0.5|
| 0| 0.0|
+----------+-------------+