去除null、NaN
去除 dataframe
中的 null
、 NaN
有方法 drop
,用 dataframe.na
找出带有 null
、 NaN
的行,用 drop
删除行:
df.na.drop()
去除空字符串
去除空字符串用 dataframe.where
:
df.where("colname <> '' ")
示例代码
package com.spark.test.offline.filter
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* Created by szh on 2020/5/30.
*/
object NullEmptyFilter {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("TTyb")
.setMaster("local")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
val sentenceDataFrame = spark.createDataFrame(Seq(
(1, "asf"),
(2, "2143"),
(3, "rfds"),
(4, null),
(5, "")
)).toDF("label", "sentence")
sentenceDataFrame.show()
sentenceDataFrame.na.drop().show()
sentenceDataFrame.where("sentence <> ''").show()
spark.stop()
}
}
输出:
+-----+--------+
|label|sentence|
+-----+--------+
| 1| asf|
| 2| 2143|
| 3| rfds|
| 5| |
+-----+--------+
+-----+--------+
|label|sentence|
+-----+--------+
| 1| asf|
| 2| 2143|
| 3| rfds|
+-----+--------+