spark-on-yarn安装centos

1. 简述

首先强调一下,本博客讲述的是spark-on-yarn的安装,不是spark-standalone的安装方式。
其实spark-on-yarn 在任何一个可以作为hadoop client的节点安装配置spark即可,因为spark是运行在yarn当中的,所以只需要一个类似client一样的东西,将spark的依赖,用户任务等提交给yarn即可。
从网上看的很多spark-on-yarn的安装方式都要再把spark安装在多个节点称为master-slave模式,然后再往yarn上提交,实际上根本不需要这么麻烦,下面我们也带着大家一起来看一下spark-on-yarn的具体安装模式。

2. 安装过程

1. 下载对应版本

可以在这里找到想要的版本,然后下载

wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.6.tgz

tar -xzf spark-2.3.0-bin-hadoop2.6.tgz  -C /usr/local/
cd /usr/local/
mv spark-2.3.0-bin-hadoop2.6/  spark

2. 对spark设置

1. 当前系统的环境

在安装spark之前,当前的环境设置主要有这些

cat /etc/profile

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

也就是只有hadoop的相关设置。

2. 新增spark设置

在/etc/profile中新增spark相关配置

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

立即使之生效

source /etc/profile

3. 设置spark-env.sh
cp conf/spark-env.sh.template conf/spark-env.sh

在里面添加

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

如果这里你想使用自己安装的scala的话可以进一步设置,如果没有设置的话,spark会使用自带的scala

export SCALA_HOME=/usr/share/scala

3. 使用spark-shell进行测试

经过上面的简单安装spark-on-yarn模式就算是安装成功了,需要注意的是当前机器上面一定要有hadoop的client才行,spark会读取里面的配置,在提交spark任务的时候往yarn上进行提交。
下面测试一下,可以使用spark-shell进行测试



[root@dev-03 spark]# spark-shell --master yarn --deploy-mode client

2020-08-10 15:06:45 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-08-10 15:06:51 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2020-08-10 15:06:51 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
2020-08-10 15:06:52 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2020-08-10 15:07:07 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FINISHED!
2020-08-10 15:07:07 ERROR TransportClient:233 - Failed to send RPC 6571362941741698630 to /10.76.5.198:26773: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2020-08-10 15:07:07 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 6571362941741698630 to /10.76.5.198:26773: java.nio.channels.ClosedChannelException
	at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
	at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2020-08-10 15:07:07 ERROR Utils:91 - Uncaught exception in thread Yarn application state monitor

...
...


java.lang.IllegalStateException: Spark context stopped while waiting for backend
  at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:669)
  at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:177)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:558)
  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
  ... 55 elided
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

可以看到,在启动spark-shell的时候报错了,而且错误信息很模糊,就是channel close的错误
关于这个错误的排查在后面再进行详述,以免看晕了,下面直接给出解决方式。

4. 解决问题

这个问题产生的原因是因为yarn的有些设置导致了spark任务被kill了

进入到hadoop-master所在机器执行

cd /usr/local/hadoop/sbin/
./stop-yarn.sh

修改 /usr/local/hadoop/etc/hadoop/yarn-site.xml
添加

  <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
        <description>Whether virtual memory limits will be enforced for containers</description>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>4</value>
        <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
    </property>

注意,这个配置的修改需要在hadoop集群的所有节点进行设置。

然后重启yarn

cd /usr/local/hadoop/sbin/

./start-yarn.sh

5. 再次使用spark-shell


[root@dev-03 spark]#  spark-shell --master yarn --deploy-mode client
2020-08-10 17:29:12 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-08-10 17:29:16 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2020-08-10 17:29:16 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
2020-08-10 17:29:16 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
2020-08-10 17:29:17 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://dev-03.com:4043
Spark context available as 'sc' (master = yarn, app id = application_1597051689954_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

成功
对应的Application-Master的日志为


2020-08-10 17:29:30 INFO  YarnAllocator:54 - Launching container container_1597051689954_0001_01_000002 on host dev-02.com for executor with ID 1
2020-08-10 17:29:30 INFO  YarnAllocator:54 - Received 1 containers from YARN, launching executors on 1 of them.
...
2020-08-10 17:29:30 INFO  YarnAllocator:54 - Launching container container_1597051689954_0001_01_000003 on host dev-03.com for executor with ID 2
2020-08-10 17:29:30 INFO  YarnAllocator:54 - Received 1 containers from YARN, launching executors on 1 of them.

也就是说该任务会占据3个yarn container,一个用来启动Application-Master,其他两个是用来启动executor

6. 提交一个spark自带的计算任务

3. 小结

spark-on-yarn模式的spark安装方式是非常简单的,不需要搭建集群,同时,设置SCALA_HOME变量也不是必须的,因为spark自带需要的scala库


[root@dev-03 spark]# ll /usr/local/spark/jars/ |grep  scala

-rw-r--r-- 1 1311767953 1876110778   515645 Feb 23  2018 jackson-module-scala_2.11-2.6.7.1.jar
-rw-r--r-- 1 1311767953 1876110778 15487351 Feb 23  2018 scala-compiler-2.11.8.jar
-rw-r--r-- 1 1311767953 1876110778  5744974 Feb 23  2018 scala-library-2.11.8.jar
-rw-r--r-- 1 1311767953 1876110778   802818 Feb 23  2018 scalap-2.11.8.jar
-rw-r--r-- 1 1311767953 1876110778   423753 Feb 23  2018 scala-parser-combinators_2.11-1.0.4.jar
-rw-r--r-- 1 1311767953 1876110778  4573750 Feb 23  2018 scala-reflect-2.11.8.jar
-rw-r--r-- 1 1311767953 1876110778   671138 Feb 23  2018 scala-xml_2.11-1.0.5.jar

4. 错误排查的详细

针对上面的错误排查消耗了不少时间,主要是最开始只关注控制台的报错,而控制台的报错实际上并没有提供什么有用的信息
后面才想到实际上可以看node-manager对应的日志
可以在yarn的后台查看


http://dev-01.com:8088/cluster/apps

可以看到刚才提交的application ,这个任务实际上已经失败了并且结束了,所以只能通过点击最右边的Tracking UI列的 History进行查看
这个时候可以看到该application对应的 Application-Master的日志,也没有看到任何异常的东西,只能错误发生在这之前,只能通过查看该nodemanager的所有日志来试试了。
点击node处的链接,进入node的信息页面。点击左边的tools 菜单栏,再点击local logs子菜单
可以看到该node manager上的所有日志,点开 yarn-root-nodemanager-bj3-stag-search-03.com.log
可以看到有很多日志,过滤一下warn级别的日志


2020-08-10 16:58:04,840 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1597047333349_0006_02_000001 has processes older than 1 iteration running over the configured limit. Limit=2254857728, current usage = 2542718976
2020-08-10 16:58:04,840 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=32553,containerID=container_1597047333349_0006_02_000001] is running beyond virtual memory limits. Current usage: 334.6 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.

这里说的是给了当前container 1GB 内存,使用了33.4M,给了当前container 2.1G虚拟内存,但是使用了2.4G虚拟内存,所以会kill当前container。
也就是在为Application-Master分配container的时候就失败了。
yarn的默认分配逻辑是每分配1G memory,就会分配2.1G virtual-memory。(由yarn.nodemanager.vmem-pmem-ratio控制)

1. 方案一,修改yarn的配置

这里采用的配置是,去除对virtual-memory分配的检查,同时提升virtual-memory的比例

修改 /usr/local/hadoop/etc/hadoop/yarn-site.xml
添加

  <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
        <description>Whether virtual memory limits will be enforced for containers</description>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>4</value>
        <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
    </property>

修改之后的日志是这样的


2020-08-10 17:29:26,892 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1597051689954_0001_01_000001
2020-08-10 17:29:26,914 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2091 for container-id container_1597051689954_0001_01_000001: 66.6 MB of 1 GB physical memory used; 2.2 GB of 4 GB virtual memory used
2020-08-10 17:29:29,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2091 for container-id container_1597051689954_0001_01_000001: 330.6 MB of 1 GB physical memory used; 2.4 GB of 4 GB virtual memory used

2. 方案二,修改application-master,executor的physical memory设置

实际上nodemanager启动container时候需要的虚拟内存和physical memory并不是完全成比例的,只是比physical大一些,因为yarn.nodemanager.vmem-pmem-ratio默认为2.1,那么我们理论上上可以通过修改application-master,executor的physical-memory来增大virtual-memory的阈值,这样就可以运行了。
上面的yarn-site.xml不进行方案一的修改,只是在提交任务的时候改成这样。


 spark-shell --master yarn --deploy-mode client \
 --conf spark.yarn.am.memory=1500M \
 --executor-memory 1500M   \
 --num-executors 1    

通过yarn查看application-master所在container的日志为(在 bj3-stag-search-03.com:8042上)


2020-08-11 09:48:54 INFO  RMProxy:98 - Connecting to ResourceManager at dev-01.com/10.76.0.98:8030
2020-08-11 09:48:54 INFO  YarnRMClient:54 - Registering the ApplicationMaster
2020-08-11 09:48:54 INFO  YarnAllocator:54 - Will request 1 executor container(s), each with 1 core(s) and 1884 MB memory (including 384 MB of overhead)
2020-08-11 09:48:54 INFO  YarnAllocator:54 - Submitted 1 unlocalized container requests.
2020-08-11 09:48:54 INFO  ApplicationMaster:54 - Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
2020-08-11 09:48:55 INFO  AMRMClientImpl:361 - Received new token for : dev-03.com:16935
2020-08-11 09:48:55 INFO  YarnAllocator:54 - Launching container container_1597065725323_0003_01_000002 on host dev-03.com for executor with ID 1
2020-08-11 09:48:55 INFO  YarnAllocator:54 - Received 1 containers from YARN, launching executors on 1 of them.
2020-08-11 09:48:55 INFO  ContainerManagementProtocolProxy:81 - yarn.client.max-cached-nodemanagers-proxies : 0
2020-08-11 09:48:55 INFO  ContainerManagementProtocolProxy:260 - Opening proxy : dev-03.com:16935

在dev-03.com launch了container container_1597065725323_0003_01_000002
也就是说当前任务总共启动了两个container,一个是运行Application-Master 在bj3-stag-search-03.com 上(container_1597065725323_0003_01_000001)
还有一个是运行executor,在dev-03.com 上面(container_1597065725323_0003_01_000002)。

通过Application-Master所在node-manager日志看一下application-master对应的container的分配情况。



2020-08-11 09:48:51,792 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1597065725323_0003_01_000001 transitioned from LOCALIZING to LOCALIZED
2020-08-11 09:48:51,814 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1597065725323_0003_01_000001 transitioned from LOCALIZED to RUNNING
2020-08-11 09:48:51,832 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /usr/local/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1597065725323_0003/container_1597065725323_0003_01_000001/default_container_executor.sh]
2020-08-11 09:48:53,448 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1597065725323_0003_01_000001
2020-08-11 09:48:53,457 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 19654 for container-id container_1597065725323_0003_01_000001: 227.5 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2020-08-11 09:48:56,468 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 19654 for container-id container_1597065725323_0003_01_000001: 357.4 MB of 2 GB physical memory used; 3.4 GB of 4.2 GB virtual memory used
2020-08-11 09:48:59,477 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 19654 for container-id container_1597065725323_0003_01_000001: 342.0 MB of 2 GB physical memory used; 3.4 GB of 4.2 GB virtual memory used

可以看到为Application-Master分配了2G的physical memory 只使用了342M , 4.2G的virtual memory只使用了3.4G,
所以Application-Master所在的container也成功的启动起来了。

在dev-03.com 上看一下executor的container的分配情况。


2020-08-11 09:48:58,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1597065725323_0003_01_000002 transitioned from LOCALIZING to LOCALIZED
2020-08-11 09:48:58,886 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1597065725323_0003_01_000002 transitioned from LOCALIZED to RUNNING
2020-08-11 09:48:58,901 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /usr/local/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1597065725323_0003/container_1597065725323_0003_01_000002/default_container_executor.sh]
2020-08-11 09:49:00,126 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1597065725323_0003_01_000002
2020-08-11 09:49:00,134 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 11023 for container-id container_1597065725323_0003_01_000002: 223.8 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2020-08-11 09:49:03,143 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 11023 for container-id container_1597065725323_0003_01_000002: 327.3 MB of 2 GB physical memory used; 3.4 GB of 4.2 GB virtual memory used


同样executor的分配也是ok的。
内存的分配策略是按照1G递增的,不满1G则向上取整。

参考
https://www.cnblogs.com/freeweb/p/5898850.html
https://cloud.tencent.com/developer/article/1010903

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值