jupyter-notebook 以yarn模式运行的出现的问题及解决方法
之前用pyspark虚拟机只跑了单机程序,现在想试试分布式运算。
在做之前找了书和博客来看,总是有各种各样的问题,无法成功。现在特记录一下过程:
这里一共有两个虚拟机,一个做master,一个做slave1
- 虚拟机slave1安装spark
slave1之前已经安装了hadoop,并且可以成功进行Hadoop集群运算。这里就不多说了。
将master的spark安装包复制到slave1,
(1)进入到spark/conf文件夹中,将slaves.template复制成slaves,在里面添加slave1
(2)增加路径到/etc/profile
master与slave1都要做(1),(2)的步骤
-
slave1安装anaconda
可以用scp直接将master的anaconda复制过来,接下来修改/etc/profile就可。上面的图已经显示了修改的内容 -
启动,这时候遇到了好多问题
在master终端输入start-all.sh,使用jps查看,master和slave1都能正常启动
在master终端输入
HADOOP_CONF_DIR=/hadoop/hadoop/etc/hadoop PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=yarn-client pyspark
看资料说,如果没有在spark.env.sh中配置HADOOP_CONF_DIR,需要像上面代码在终端写出。这时候,jupyter-notebook可以成功启动,但是我在其中写入sc.master看它是何种模式运行时,却给我报了好多错误
[root@master home]#HADOOP_CONF_IR=/hadoop/hadoop/etc/hadoop PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
[I 18:58:24.475 NotebookApp]
[nb_conda_kernels] enabled, 2 kernels found
[I 18:58:25.101 NotebookApp] ✓ nbpresent HTML export ENABLED
[W 18:58:25.101 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
[I 18:58:25.163 NotebookApp]
[nb_anacondacloud] enabled
[I 18:58:25.167 NotebookApp] [nb_conda] enabled
[I 18:58:25.167 NotebookApp] Serving
notebooks from local directory: /home
[I 18:58:25.167 NotebookApp] 0 active
kernels
[I 18:58:25.168 NotebookApp] The Jupyter
Notebook is running at: http://localhost:8888/
[I 18:58:25.168 NotebookApp] Use Control-C
to stop this server and shut down all kernels (twice to skip confirmation).
[I 18:58:33.844 NotebookApp] Kernel
started: c15aabde-b441-45f2-b78d-9933e6534c27
Exception in thread "main"
java.lang.Exception: When running with master 'yarn-client' either
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at
org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:263)
at
org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:240)
at
org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[IPKernelApp] WARNING | Unknown error in
handling PYTHONSTARTUP file /hadoop/spark/python/pyspark/shell.py:
[I 19:00:33.829 NotebookApp] Saving file at
/Untitled2.ipynb
[I 19:00:57.754 NotebookApp] Creating new
notebook in
[I 19:00:59.174 NotebookApp] Kernel
started: ebfbdfd5-2343-4149-9fef-28877967d6c6
Exception in thread "main"
java.lang.Exception: When running with master 'yarn-client' either
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at
org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:263)
at
org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:240)
at
org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[IPKernelApp] WARNING | Unknown error in
handling PYTHONSTARTUP file /hadoop/spark/python/pyspark/shell.py:
[I 19:01:12.315 NotebookApp] Saving file at
/Untitled3.ipynb
^C[I 19:01:15.971 NotebookApp] interrupted
Serving notebooks from local directory:
/home
2 active kernels
The Jupyter Notebook is running at:
http://localhost:8888/
Shutdown this notebook server (y/[n])? y
[C 19:01:17.674 NotebookApp] Shutdown
confirmed
[I 19:01:17.675 NotebookApp] Shutting down
kernels
[I 19:01:18.189 NotebookApp] Kernel
shutdown: ebfbdfd5-2343-4149-9fef-28877967d6c6
[I 19:01:18.190 NotebookApp] Kernel
shutdown: c15aabde-b441-45f2-b78d-9933e6534c27
通过日志显示:
Exception in thread "main" java.lang.Exception: When running with master 'yarn-client' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
于是配置spark.env.sh
再次运行:
[root@master conf]#
HADOOP_CONF_DIR=/hadoop/hadoop/etc/hadoop pyspark --master yarn --deploy-mode
client
[TerminalIPythonApp] WARNING | Subcommand
`ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely
want to use `jupyter notebook` in the future
[I 19:15:28.816 NotebookApp]
[nb_conda_kernels] enabled, 2 kernels found
[I 19:15:28.923 NotebookApp] ✓ nbpresent HTML export ENABLED
[W 19:15:28.923 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
[I 19:15:28.986 NotebookApp]
[nb_anacondacloud] enabled
[I 19:15:28.989 NotebookApp] [nb_conda]
enabled
[I 19:15:28.990 NotebookApp] Serving
notebooks from local directory: /hadoop/spark/conf
[I 19:15:28.990 NotebookApp] 0 active
kernels
[I 19:15:28.990 NotebookApp] The Jupyter
Notebook is running at: http://localhost:8888/
[I 19:15:28.990 NotebookApp] Use Control-C
to stop this server and shut down all kernels (twice to skip confirmation).
[I 19:15:44.862 NotebookApp] Creating new
notebook in
[I 19:15:45.742 NotebookApp] Kernel
started: 98d8605a-804a-47af-83fb-2efc8b5a3d60
Setting default log level to
"WARN".
To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/11/20 19:15:48 WARN
util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
18/11/20 19:15:51 WARN yarn.Client: Neither
spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading
libraries under SPARK_HOME.
[W 19:15:55.943 NotebookApp] Timeout
waiting for kernel_info reply from 98d8605a-804a-47af-83fb-2efc8b5a3d60
18/11/20 19:16:11 ERROR spark.SparkContext:
Error initializing SparkContext.
org.apache.spark.SparkException: Yarn
application has already ended! It might have been killed or unable to launch
application master.
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at
org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at
py4j.Gateway.invoke(Gateway.java:236)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at
py4j.GatewayConnection.run(GatewayConnection.java:214)
at
java.lang.Thread.run(Thread.java:748)
18/11/20 19:16:11 ERROR
client.TransportClient: Failed to send RPC 7790789781121901013 to
/192.168.127.131:55928: java.nio.channels.ClosedChannelException
java.