假设有如下场景:
df = spark.createDataFrame(
[("anhui", 1, '2019-06-15 13:20'),
("anhui",2, '2019-06-17 13:42'),("anhui",3, '2019-06-15 13:42'),
("anhui",4, '2019-06-6 13:40'),
("anhui",5, '2019-06-14 14:40'),
("anhui",6, "'2019-06-15 13:42'"),
("beijing",1, '2019-06-9 13:42'),
("beijing",2, '2019-06-14 13:42'),
("beijing",3, '2019-06-12 13:42')],
["province","nums", "time"]
)
df.show()
+--------+----+------------------+
|province|nums| time|
+--------+----+------------------+
| anhui| 1| 2019-06-15 13:20|
| anhui| 2| 2019-06-17 13:42|
| anhui| 3| 2019-06-15 13:42|
| anhui| 4| 2019-06-6 13:40|
| anhui| 5| 2019-06-14 14:40|
| anhui| 6|'2019-06-15 13:42'|
| beijing| 1| 2019-06-9 13:42|
| beijing| 2| 2019-06-14 13:42|
| beijing| 3| 2019-06-12 13:42|
+--------+----+------------------+
如果需要以“province”分组,“nums”递增顺序,对“time”进行列数据向下偏移,在pyspark中可以用基于window函数的方式完成,代码如下:
from pyspark.sql import Window
from pyspark.sql.functions import lag
w=Window.partitionBy("province").orderBy("nums")
for i in range(1,6):
df=df.withColumn("time_offset_"+str(i),lag(col="time",count=i).over(w))
结果如下:
+--------+----+------------------+----------------+----------------+----------------+----------------+----------------+
|province|nums| time| time_offset_1| time_offset_2| time_offset_3| time_offset_4| time_offset_5|
+--------+----+------------------+----------------+----------------+----------------+----------------+----------------+
| beijing| 1| 2019-06-9 13:42| null| null| null| null| null|
| beijing| 2| 2019-06-14 13:42| 2019-06-9 13:42| null| null| null| null|
| beijing| 3| 2019-06-12 13:42|2019-06-14 13:42| 2019-06-9 13:42| null| null| null|
| anhui| 1| 2019-06-15 13:20| null| null| null| null| null|
| anhui| 2| 2019-06-17 13:42|2019-06-15 13:20| null| null| null| null|
| anhui| 3| 2019-06-15 13:42|2019-06-17 13:42|2019-06-15 13:20| null| null| null|
| anhui| 4| 2019-06-6 13:40|2019-06-15 13:42|2019-06-17 13:42|2019-06-15 13:20| null| null|
| anhui| 5| 2019-06-14 14:40| 2019-06-6 13:40|2019-06-15 13:42|2019-06-17 13:42|2019-06-15 13:20| null|
| anhui| 6|'2019-06-15 13:42'|2019-06-14 14:40| 2019-06-6 13:40|2019-06-15 13:42|2019-06-17 13:42|2019-06-15 13:20|
+--------+----+------------------+----------------+----------------+----------------+----------------+----------------+