>>> col_names = ["name", "date", "score"]
>>> value = [
... ("Ali", "20200101", 10.0),
... ("Ali", "20200102", 10.0),
... ("Ali", "20200103", 10.0),
... ("Ali", "20200104", 10.0),
... ("Ali", "20200101", 9.0),
... ("Ali", "20200102", 9.0),
... ]
>>> df = spark.createDataFrame(value, col_names)
>>> df.show()
+----+--------+-----+
|name| date|score|
+----+--------+-----+
| Ali|20200101| 10.0|
| Ali|20200102| 10.0|
| Ali|20200103| 10.0|
| Ali|20200104| 10.0|
| Ali|20200101| 9.0|
| Ali|20200102| 9.0|
+----+--------+-----+
>>> window=Window.partitionBy("name",'score').orderBy(df["date"].desc())
>>> df=df.withColumn('topn',F.row_number().over(window))
>>> df.show()
+----
pyspark重复数据中取时间最新的(最简洁明了)
最新推荐文章于 2023-08-17 15:22:12 发布