环境:Mac,pycharm,spark 3.0.0
原文链接:https://hang-hu.github.io/python/2018/10/11/Python-in-worker-has-different-version-3.5-than-that-in-driver-3.6-PySpark-cannot-run-with-different-minor-versions.html
pyspark在pycharm报错问题:Python In Worker Has Different Version 3.5 Than That In Driver 3.6 Pyspark Cannot Run With Different Minor Versions
17/10/04 15:30:50 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/python/pyspark/worker.py", line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
问题很明显, 是spark的python版本和当前python编译版本不一致,这个问题花了我一个下午处理好的,如果spark在terminal可以在本机default python environment 2.7 和conda python environment 都可以使用的话,但是在pycharm始终报如上错误,建议尝试在pycharm的run/debug configuration里添加环境变量:
PYSPARK_PYTHON ~/usr/bin/python3
如果依然报错,可以试试:(建议修改~为/User/your-user-name,即填写绝对路径。)
PYSPARK_PYTHON ~/anaconda3/bin/python3
其他的关于修改.bash_profile和spark_env.sh的方法已经烂大街了。
希望大家配环境愉快!