此次配置分步进行,便于问题的一步步解决:
- 集群上spark-submit,local模式和yarn模式运行成功
- 非集群spark-submit,local模式和yarn模式运行成功
- 集群python命令行交互模式,local模式,yarn模式运行成功
- 非集群python命令行交互模式,local模式,yarn模式运行成功
- notebook中local模式运行成功
- notebook中yarn模式运行成功
以下为其中遇到的问题及解决措施
1.spark集群 spark-submit --master yarn 时报错(步骤2中遇到):
problem: ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state
原因及措施:
https://blog.csdn.net/clever_wr/article/details/77092754
结果:成功解决。
2.python3命令行执行(步骤3中遇到)
from pyspark.sql import SparkSession
error:No module named pyspark
原因及措施:
对集群所有机器添加环境变量(需要根据具体情况进行修改,如 py4j-0.10.7-src.zip)
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
结果:问题解决,可以在local模式下正常运行。
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("yarnSparkSession").master("local").getOrCreate()
3.非集群上,spark-submit提交到集群yarn运行(步骤2中遇到)
Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
原因及措施:
- 将集群中yarn的配置文件或者Hadoop的配置文件下载到本地;
- 然后配置spark/conf/spark-env.sh文件中的YARN_CONF_DIR或者HADOOP_CONF_DIR。
- 从新提交,成功!
结果:成功解决。
4.python3命令行执行,master("local")(步骤3中遇到)
ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
原因及措施:
本地python为2.7,需要更该python的指向为python3
ln -s /usr/local/bin/python3.5 /usr/bin/python
结果:成功解决。
5.python3命令行执行,master("yarn")(步骤3中遇到)
ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 176, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
原因及措施:
需要统一为集群配置python3环境
http://www.cnblogs.com/nucdy/p/8569606.html
结果:运行成功
6.开启yarn错误WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException(步骤3中遇到)
原因及措施:
参考: https://blog.csdn.net/zyz0225/article/details/81515384
不予处理
结果:暂无影响。
7.在jupyter notebook运行spark代码时(提交到yarn),报错。
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
原因及措施:
本地jupyter notebook版本为python3.5。集群服务器安装两个python版本2.7和3.5,默认为2.7。
因为某些原因,不能对集群服务器进行修改,只能在本地修改。
办法(在代码最前面加上)
import os
os.environ["PYSPARK_PYTHON"] = "python3"
结果:成功运行!!