大数据实战第十五课（下）-Spark-Core03

最新推荐文章于 2022-07-28 22:48:01 发布

zhikanjiani

最新推荐文章于 2022-07-28 22:48:01 发布

阅读量339

点赞数

本文链接：https://blog.csdn.net/zhikanjiani/article/details/99682566

版权

一、YARN概述

1.1 YARN的产生背景
1.2 YARN的架构
1.3 YARN的执行流程

二、Spark on YARN概述
- 2.1 Launching Spark on YARN
- 2.2 Spark on YARN使用
- 生产调优点

一、Yarn概述

Spark可以跑Local、Standalone，不管运行模式是什么，代码是一样的，区别在于–master。

1.1 Yarn的产生背景

Hadoop集群、
Spark Standalone、
MPI… 存在高峰期和低峰期，
==>会导致我们整个集群的资源利用率很低，跑mapreduce作业提交到hadoop，跑spark作业提交到Standalone上，怎样能做到统一的资源管理调度呢？？

引出Spark on Yarn，统一的资源管理调度。

在这里插入图片描述
图片解析：

底层是hdfs（可靠的资源存储）
在hdfs之上有YARN：用作集群资源管理
在YARN之上，可以跑批处理作业、交互式的比如TEZ；Hive可以跑在TEZ、MR、Spark上；online HBase；Streaming Storm/Flink；In-Memory Spark（基于内存的Spark）

小结：有了Yarn之后，所有的框架都可以跑在YARN之上，可以把YARN理解为一个操作系统级别的资源管理和调度框架
==>
多种计算框架可以共享集群资源，按需分配 ==> 可以提升整个集群资源的利用率。

1.1 Yarn的架构

RM NM AM Container各自的职责以及重试的机制
YARN的执行流程

在这里插入图片描述

Client客户端提交作业到Resource Manager上
RM在NM节点上启动一个ApplicationMaster，（AM是运行在NM中的，AM是在Container上跑的）
AM去向RM申请资源，比如我们拿到3个container，
AM去向NM节点上启动运行Task,，见上图的Container，Container中运行的task；如果是Map Reduce作业，运行的就是maptask或者reducetask
如果是Spark作业，运行的就是Spark ApplicationMaster，

二、Spark on Yarn概述

在这里插入图片描述
图解：Spark Driver中main方法，装进SparkContext，装进一堆RDD，通过action算子进行触发产生Job；一个应用程序由 1 Driver + N executors构成。

executor是进程级别，其中运行我们的task；不止有一个executor，如上图：绿色的job对应绿色的task，红色的job对应红色的task
一个executor上可以跑多个task，

对于Map Reduce：(基于进程)

each task in its own process: map对应MapTask、reduce对应ReduceTask ==> 这些都是进程（每个任务都有它自己的进程）
when a task completes, the process goes away （当一个task完成以后，不管是什么进程，它就结束了）

对于Spark：（基于线程）

many tasks can run concurrently in a single process（在一个进程中可以并行的跑多个task）

this processs sticks around for the lifetime of the Spark Application（这个进程的生命周期一直会在Spark Application整个的生命周期中）；
even no jobs are running 即使没有job正在运行，只要作业跑起来空的也没有关系

优点如下：

speed
tasks can start up very quickly 对于spark来说，进程拿到后，task可以直接启动，速度非常快，可以直接处理数据。
in-memory 以内存的方式进行计算

Cluster Manager：
Spark Appication ⇒ CM上
Local Standalone YARN Mesos K8S ==> pluggable(可插拔的)

Application Master：AM
在YARN中运行的application都有一个AM，这个进程是第一个Container；AM和RM打交道去请求资源，把请求到的资源告诉NM，NM去启动container来使用。

Worker的概念：在YARN上面是没有的，我们的executor是运行在container里面的（memory of container > executor memory）

对于On YARN的模式：
Spark仅仅只是一个客户端而已；
只要客户端有权限，能够提交到yarn上即可。

只要把spark解压开就能使用，在你想要的机器上使用。

二、Spark on Yarn使用

基础介绍了Spark submit的提交介绍：
http://spark.apache.org/docs/latest/submitting-applications.html

Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.
Spark on yarn是在0.6.0中开始有的。

2.1 Launching Spark on YARN

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (clinet side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager.

确保HADOOP_CONF_DIR或者YARN_CONF_DIR下的配置文件指向Hadoop集群目录，这些配置会被写到hdfs中去连接到yarn上。
比如mapred-site.xml和yarn-site.xml；

The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration

这些配置会被分发到yarn集群上，所有的container使用相同的配置。

if the configuration references Java system properties or environment variables not managed by YARN， they should also be set in the Spark application’s configuration(driver,executor, and the AM when running in client mode)

如果这个配置包含java系统属性或者环境变量那它不能被yarn所管理到，它们应该被设置在spark应用程序配置

Deploy Mode：
client： Driver local
cluster：Driver cluster

in the cluster mode：the spark driver runs inside an application master process which is managed by YARN on the cluster

集群模式：driver运行在cluster上的AM进程中，这个应用程序初始化成功客户端（提交的脚本）就可以关掉了。

in the client mode：the driver runs in the client process, and the application master is only used for requesting resources from YARN.

client模式下，AM仅仅用做申请资源；在client模式下，黑窗口不能关。

2.2 Spark on YARN使用

2.2.1 生产优化点：

提交到client速度慢，怎么解决？

改变日志级别模板：
cd $SPARK_HOME/conf --> 下有个log4j

2.2.2 Spark on YARN Client模式

spark中进行测试：
./spark-shell --master yarn 默认就是client模式
–num-executors NUM 默认数量就是2个
executor-memory 默认是1g

不做任何修改直接配置启动：

spark-shell --master yarn 报错信息如下提示：
Exception in thread “main” org.apache.spark.SparkException: When running with master ‘yarn’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
[hadoop@hadoop002 conf]$ pwd
/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/conf
[hadoop@hadoop002 conf]$ ll
total 44
-rwxr-xr-x 1 hadoop hadoop 4465 May 29 21:48 spark-env.sh
-rwxr-xr-x 1 hadoop hadoop 4221 May 23 12:09 spark-env.sh.template
先复制spark-env.sh.template这个文件，cp spark-env.sh.template spark-env.sh
在末尾进行如下添加：
export JAVA_HOME=/usr/java/jdk1.8.0_45
export SCALA_HOME=/home/hadoop/app/scala-2.11.12
export SPARK_WORKING_MEMORY=1g
export SPARK_MASTER_IP=master
export HADOOP_HOME=/home/hadoop/app/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
[hadoop@hadoop002 hadoop]$ spark-shell --master yarn
Spark context available as ‘sc’ (master = yarn, app id = application_1558975098660_0003).
与spark-shell --master local[2]执行语句的区别：
Spark context available as ‘sc’ (master = local[2], app id = application_1558975098660_0003).

进入到UI界面：

查看得到：如下三个参数的由来？

Running Containers：3
Allocated CPU VCores：3
Allocated Memory MB：5120

查看Executors：查看到有1个Driver和2个executor(spark-shell)， --num-executors NUM 默认数量就是2个。

此时使用jps命令查看：

1、ps -ef|grep CoarseGrainedExecutorBackend
这个是executor进程对应的id号
2、ps -ef|grep ExecutorLauncher

spark on yarn上测试

测试1：

sc.parallize(List(1,1,1,2,3,3,3)).count 		//count是一个action，产生job
查看UI界面，

wordcount测试：

sc.textFile("hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt").faltMap(_.split("\t")).map(_,1).reduceByKey(_+_).collect

重启yarn后，是有两个stage，因为涉及到reduceByKey，有shuffle操作。
此时client模式，退出出来后，再次打开UI界面肯定就是已打不开的。

PK哥运行时出错：报错：Compression.codec.com.hadoop.compression.lzo.lzpCodec not found
回到etc/hadoop，查看配置文件mapred-site.xml、hdfs-site.xml、yarn-site.xml

Spark Properties：

property name	default
spark.yarn.am.memory	512m
spark.yarn.am.core	1
spark.yarn.max.executor.failures	numExecutors*2,with minimum of 3

我们启动spark on yarn，client模式下的时候，注意到日志信息有一句话：
WARN yarn Client：Neither spark.yarn.jars nor spark.yarn.archive is set，falling back to upoading libraires under SPARK_HOME.

这一步是非常耗性能的，会把jars下面的目录都给打包。
可以把这个日志弄掉

2.2.3 Spark on YARN Cluster模式

直接启动报错：
./spark-shell --master yarn --deploy-mode cluster
cluster：Cluster deploy mode is not applicable to Spark shells. 它不能够支持spark-shell

Driver跑在集群上，看日志的话有命令指定：

yarn logs -applicationId <app Id>

运行job的id就叫application id，跑Yarn的话任务开头就是application

Spark on YARN总结：

Driver：Local /Cluster
Client模式：
AM : requesting resources（申请资源）
Cluster模式：
AM：task schedule

zhikanjiani

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据实战第十五课（下）-Spark-Core03

Hadoop上启动spark on yarn：不做任何修改直接配置启动：spark-shell --master yarn 报错信息如下提示：Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CON...
复制链接

扫一扫