Spark DataFrame的空值形式和空值和非空值之间的相互转换

56 篇文章 4 订阅
47 篇文章 3 订阅

String类型列

1.空值替换为其他值

建df时的空值表示形式为:null

null

val df = Seq("a", null, "c", "b").toDF("col1")
df.show()
var df4 = df.na.fill(value="qqq",Array[String]("col1"))
df4.show()
df: org.apache.spark.sql.DataFrame = [col1: string]
+----+
|col1|
+----+
|   a|
|null|
|   c|
|   b|
+----+

df4: org.apache.spark.sql.DataFrame = [col1: string]
+----+
|col1|
+----+
|   a|
| qqq|
|   c|
|   b|
+----+

2.其他值转换为空值

此时的空值形式为

“null”

val df2 = df.withColumn("col1", regexp_replace(col("col1"), "NullNone", "null"))
df2.show()

df2: org.apache.spark.sql.DataFrame = [col1: string]
+----+
|col1|
+----+
|   a|
|null|
|   c|
|   b|
+----+

val df3 = df2.na.fill(value="qqq",Array[String]("col1"))
df3.show()


df3: org.apache.spark.sql.DataFrame = [col1: string]
+----+
|col1|
+----+
|   a|
| qqq|
|   c|
|   b|
+----+

Double类型等数值型列

相比于String ,需要在最后进行特征列类型改变

    import spark.implicits._
    var data1 = Seq(
      ("0.0", "1002", "1", "1.5", "bai"),
      ("1.0", "2004", "2", "2.1", "wang"),
      ("0.0", "3007", "2", "2.1", "wang"),
      ("0.0", "4004", "3", "3.4", "wa"),
      ("1.0", "5007", "3", "3.4", "wa"),
      ("1.0", "17009", null, "5.9", "wei"),
      ("0.0","18010", "12", "5.9", "wei")
    ).toDF("label", "AMOUNT", "Pclass", "name", "MAC_id")
    data1 = data1.withColumn("Pclass", col("Pclass").cast("double"))

    data1.show()
+-----+------+------+----+------+
|label|AMOUNT|Pclass|name|MAC_id|
+-----+------+------+----+------+
|  0.0|  1002|     1| 1.5|   bai|
|  1.0|  2004|     2| 2.1|  wang|
|  0.0|  3007|     2| 2.1|  wang|
|  0.0|  4004|     3| 3.4|    wa|
|  1.0|  5007|     3| 3.4|    wa|
|  1.0| 17009|  null| 5.9|   wei|
|  0.0| 18010|    12| 5.9|   wei|
+-----+------+------+----+------+


    var result_data = data1
    result_data = result_data.na.fill(value="-100.0".toDouble,ever_colName_list)
    result_data.show()
    println(result_data.dtypes.toMap)
    for(cln <- ever_colName_list){
      
      result_data = result_data.withColumn(cln, regexp_replace(col(cln), "-100.0", "null"))
      result_data = result_data.withColumn(cln, col(cln).cast("double"))

    }
    result_data.show()
    println(result_data.dtypes.toMap)

结果:
+-----+------+------+----+------+
|label|AMOUNT|Pclass|name|MAC_id|
+-----+------+------+----+------+
|  0.0|  1002|   1.0| 1.5|   bai|
|  1.0|  2004|   2.0| 2.1|  wang|
|  0.0|  3007|   2.0| 2.1|  wang|
|  0.0|  4004|   3.0| 3.4|    wa|
|  1.0|  5007|   3.0| 3.4|    wa|
|  1.0| 17009|-100.0| 5.9|   wei|
|  0.0| 18010|  12.0| 5.9|   wei|
+-----+------+------+----+------+

Map(name -> StringType, label -> StringType, Pclass -> DoubleType, AMOUNT -> StringType, MAC_id -> StringType)
+-----+------+------+----+------+
|label|AMOUNT|Pclass|name|MAC_id|
+-----+------+------+----+------+
|  0.0|  1002|   1.0| 1.5|   bai|
|  1.0|  2004|   2.0| 2.1|  wang|
|  0.0|  3007|   2.0| 2.1|  wang|
|  0.0|  4004|   3.0| 3.4|    wa|
|  1.0|  5007|   3.0| 3.4|    wa|
|  1.0| 17009|  null| 5.9|   wei|
|  0.0| 18010|  12.0| 5.9|   wei|
+-----+------+------+----+------+

Map(name -> StringType, label -> StringType, Pclass -> DoubleType, AMOUNT -> StringType, MAC_id -> StringType)




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值