使用pyspark的时候遇到下面的问题:
Traceback (most recent call last):
File "C:\Users\xxx\xxx\xxx\xxx.py", line 90, in <module>
df.toPandas().to_csv('before_eda.csv')
File "C:\Anaconda\envs\spark\lib\site-packages\pyspark\sql\dataframe.py", line 2142, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
File "C:\Anaconda\envs\spark\lib\site-packages\pyspark\sql\dataframe.py", line 532, in collect
with SCCallSiteSync(self._sc) as css:
File "C:\Anaconda\envs\spark\lib\site-packages\pyspark\traceback_utils.py", line 72, in __enter__
self._context._jsc.setCallSite(self._call_site)
AttributeError: 'NoneType' object has no attribute 'setCallSite'
上网查了很多,不过我看似乎都是有关SparkSession的,可以参考这几篇:
AttributeError: 'NoneType' 对象没有'setCallSite'属性
pyspark AttributeError: 'NoneType' object has no attribute 'setCallSite'
具体来说,就是建议添加如下代码:
df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
df._sc = spark._sc
这里的spark变量是这样定义的:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
不过,我用的是SparkContext,这种办法对我没用,我最后用的是:
df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
df._sc._jsc = sc._jsc
这样之后代码可以运行。
简单示例代码:
from pyspark import SparkContext, SparkConf
from pyspark.python.pyspark.shell import spark
conf = SparkConf().setAppName("spark_part").setMaster("local[4]")
sc = SparkContext(conf=conf)
rdd = sc.textFile("./xxx.csv")
# 一些处理
# ...
df = rdd.map(lambda line: Row(**GetRows(line))).toDF() # 转化为dataframe
# ----------关键代码---------
df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
df._sc._jsc = sc._jsc
# --------------------------
df.toPandas().to_csv('test.csv')
sc.stop()