目录
一、需求
有一张表数据,数据有很多重复,但是重复数据的每行操作日期不一样。要求去重并保留最新操作日期的数据。
分析:这里需要使用开窗函数;
二、代码示例
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.appName("Test")
.getOrCreate()
val df = spark.createDataFrame(Seq(
(1, 13, "2023-04-05", "aa"),
(2, 23, null, "bb"),
(2, 223, "2023-05-08", "cc"),
(2, 224, "2023-07-09", "dd"),
(1, 14, "2023-04-09", "ee"),
(5, 14, null, "ff"),
(5, 23, null, "hh"),
)).toDF("id", "label", "date", "text")
val windowSpec = Window.partitionBy("id").orderBy(col("date").desc)
val new_dataframe = df.withColumn("row_number", row_number.over(windowSpec))
.filter(col("row_number") === 1)
.drop("row_number")
new_dataframe.show()
}
三、运行结果
+---+-----+----------+----+
| id|label| date|text|
+---+-----+----------+----+
| 1| 14|2023-04-09| ee|
| 2| 224|2023-07-09| dd|
| 5| 14| null| ff|
+---+-----+----------+----+