Assume that I am having a DataFrame like :
val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)
I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only
{"a":3, "b":9, "c":22, "d":12}
and
{"a":1, "b":4, "c":23, "d":12}
Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?
Appreciate some help, thanks.
解决方案
You can use window function in sparksql to achieve this.
df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect