尝试在oozie上运行pyspark程序:
先配置yarn-env.sh以解决找不到pyspark库等的问题
export SPARK_HOME=/usr/share/spark
$ hdfs dfs -copyFromLocal py4j.zip /user/oozie/share/lib/spark
$ hdfs dfs -copyFromLocal pyspark.zip /user/oozie/share/lib/spark
【问题没有解决】
现在先解决单独用spark-submit运行的问题,再解决通过oozie运行的问题。
单独用spark-submit运行,不带参数,可以成功
带 --master yarn-cluster
会失败,在8088里面提示这样的错误
Application application_1486993422162_0016 failed 2 times due to AM Container for appattempt_1486993422162_0016_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://bigdata-master:8088/cluster/app/application_1486993422162_0016Then, click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://bigdata/user/hadoop/.sparkStaging/application_1486993422162_0016/spark1.py
java.io.FileNotFoundException: File does not exist: hdfs://bigdata/user/hadoop/.sparkStaging/application_1486993422162_0016/spark1.py
【尝试一】把py里面的
#
conf = conf.setMaster("local[*]")
注释掉,让spark自动选取运行的master
再次运行这样的命令:
spark-submit --master yarn-cluster pythonApp/lib/spark1.py
【成功,8088那儿不报错了!】
【失败,去掉local[*]后,单独spark-submit会造成17/02/15 16:18:11 ERROR SparkDeploySchedulerBack
end: Application has been killed. Reason: All masters are unresponsive! Giving up.】
【尝试二(未尝试)】在尝试一的基础上:
SparkConf sc中添加路径
sc.addFile("hdfs:<filepath_on_hdfs>/optimize-spark.py")
【放到oozie那儿还是报错找不到py4j.zip和pyspark.zip】
【尝试一】
更改job里面的properties
把master从local[*]改成yarn-cluster </