自Spark 1.5以来,您可以使用许多数据处理功能:
import datetime
from pyspark.sql.functions import year, month, dayofmonth
elevDF = sc.parallelize([
(datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)
]).toDF(["date", "hour", "value"])
elevDF.select(
year("date").alias('year'),
month("date").alias('month'),
dayofmonth("date").alias('day')
).show()
# +----+-----+---+
# |year|month|day|
# +----+-----+---+
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# +----+-----+---+
与任何其他RDD一样,您可以使用简单的map:elevDF = sqlContext.createDataFrame(sc.parallelize([
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))
(elevDF
.map(lambda (date, hour, value): (date.year, date.month, date.day))
.collect())
结果是:[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
顺便说一下:datetime.datetime无论如何都要存储一个小时,所以单独保存它似乎是在浪费内存。