查看重复记录
找到最新的时间戳去重(窗口函数)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val dfResult = dfTraining.withColumn("rn",row_number() over Window.partitionBy($"user",$"event").orderBy($"timestamp".desc)).filter($"rn"===lit(1))
找到最新的时间戳去重(spark 内置去重方式)
val dfResult1 = dfTraining.repartition($"user",$"event").sortWithinPartitions($"timestamp".desc).dropDuplicates("user","event")
dfResult1.count