方法一:来个sql
SQL语句:select date,phone from (select *,row_number() over (partition by phone order by date) num from tmp_table1) t where t.num=1
举个例子:
/**
* 按照两个字段进行排重
* Created by prince on 2017/8/8.
*/
object SQLDemo {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SQLDemo").master("local").getOrCreate
val input = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/test")
.option("dbtable", "student")
.option("user", "root")
.option("password", "123456")
.load()
input.createOrReplaceTempView("temp")
val out = spark.sql("select name,age,sex from(" +
"select *,row_number() over (partition by age,sex order by age asc)num from temp" +
")t " +
"where t.num=1")
out.show()
}
}
数据库内容:
输出结果:
方法二:利用dataframe的内置函数
val df2 = df1.dropDuplicates(Seq("phone"))