课程《基于pyspark的大数据分析》视频29淘宝数据分析中,源代码如下:
# 需求:按照session_id进行分组,统计次数,会话PV
session_pv = sqlContext.sql("""
SELECT
session_id, COUNT(1) AS cnt
FROM
tmp_page_views
GROUP BY
session_id
ORDER BY
cnt DESC
LIMIT
10
""").map(lambda output: output.session_id + "\t" + str(output.cnt))
for result in session_pv.collect():
print result
在我的环境中运行报错:
AttributeError: ‘DataFrame‘ object has no attribute ‘map‘
原因:我的环境spark是2.1.1,而原案例中老师的代码是基于spark1.6.1写的。
You can't map
a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map()
. Prior to Spark 2.0, spark_df.map
would alias to spark_