就像数据库连接一样,您可以使用mapPartitions实例化有限数量的此类实例:
In [1]: from datetime import date
...: from astral import Astral
...:
...: df = spark.createDataFrame(
...: ((date(2019, 10, 4), 0),
...: (date(2019, 10, 4), 19)),
...: schema=("date", "longitude"))
...:
...:
...: def solar_noon(rows):
...: a = Astral() # initialize the class once per partition
...: return ((a.solar_noon_utc(date=r.date, longitude=r.longitude), *r)
...: for r in rows) # reuses the same Astral instance for all rows in this partition
...:
...:
...: (df.rdd
...: .mapPartitions(solar_noon)
...: .toDF(schema=("solar_noon_utc", *df.columns))
...: .show()
...: )
...:
...:
+-------------------+----------+---------+
| solar_noon_utc| date|elevation|
+-------------------+----------+---------+
|2019-10-04 13:48:58|2019-10-04| 0|
|2019-10-04 12:32:58|2019-10-04| 19|
+-------------------+----------+---------+
这是相当有效的,因为将函数( solar_noon )分配给了每个工作程序,并且每个分区只能容纳一次该类,而该分区可以容纳许多行。