Spark on Yarn
1.Yarn的产生背景
Hadoop
Spark Standalone
MPI
…等分布式的框架
集群的资源利用率不够,就需要统一的资源管理和调度。
使用Yarn的话,多种计算框架可以共享集群资源,按需分配,这样可以提升集群资源的利用率。
2.Yarn架构
各自的职责以及重试的机制(挂了之后如何重新执行)
RM:
NM:
AM:
Container:
3.Yarn的执行流程
4.Spark on Yarn Overview
Mapreduce:base-process
each task in its own process: MapTask ReduceTask
When a task completes, the process goes away
Spark:base-thread
many tasks can run concurrently in s single process
this process sticks around for the lifetime of the Spark Appliation
even no jobs are running
Advantage:
speed
tasks can start up very quickly
in -memory
5.Cluster Manager
Spark Allication => CM(申请资源)
可以运行在以下环境:
Local standalon Yarn Mesos K8s => Pluggle(可插拔的)
6.ApplicationMaster:AM
Yarn Application ==> AM(first container)每一个在yarn中运行的Application都有一个AM,是第一个container,AM向RM请求资源,然后RM去NM上启container给executor
7.Worker
Yarn executor runs in container(memory of container > executor-memory)executor运行在container中,yarn中container内存必须要大于executor的内存
Spark仅仅只是一个客户端而已
8.Spark-shell不支持cluster
spark-shell 不能用交互式 ,交互式模式 只用 yarn-client 模式下
1.Driver:Local/Cluster 分为client和cluster模式,client模式driver运行在本地,cluster运行在cluster某台机器的AM上
2.Client(Local)
AM:requesting resourcer 申请资源
Cluster:
AM:requesting resourcer 申请资源
task schedule task调度