Spark Core

最新推荐文章于 2022-08-25 21:36:33 发布

一杯敬朝阳一杯敬月光

最新推荐文章于 2022-08-25 21:36:33 发布

阅读量103

点赞数

分类专栏：大数据文章标签： spark 大数据

本文链接：https://blog.csdn.net/qq_xuanshuang/article/details/117958707

版权

大数据专栏收录该内容

15 篇文章 1 订阅

订阅专栏

概念相关

Application：基于Spark的应用程序 = 1个driver + n个executors组成
- User program built on Spark.
- Consists of a driver program and executors on the cluster.
Application jar
- A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program：这个程序有一个main方法，并且创建了SparkContext
- The process running the main() function of the application and creating the SparkContext
Cluster manager：一个用于获取集群资源的外部服务
- An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode
- Distinguishes where the driver process runs.用来区分Driver运行的地方
  - In "cluster" mode, the framework launches the driver inside of the cluster.cluster模式会在集群启动driver，运行在AM里面。
  - In "client" mode, the submitter launches the driver outside of the cluster.若是client模式，在集群之外启动driver，可以理解为在本节点的进程上启动
Worker node
- any node that can run application code in the cluster，任何可以在集群之上运行application的节点，例如standalone模式可以将其理解slave节点，就是我们slaves配置的那个，yarn就是node manager
Executor
- A process launched for an application on a worker node：work node上用来运行application的进程
- runs tasks and keeps data in memory or disk storage across them.Executor还能将数据缓存到内存或者磁盘
- Each application has its own executors.每一个application都有其独有的executors
Task
- A unit of work that will be sent to one executor，Task是工作单元（从driver发起），它会被送到一个executor上去运行。
Job
- A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect)，一个job里面可以有多个并行执行的task（只有action才会触发job，transform并不会，一个action对应一个job，一个job有多个task）
- you'll see this term used in the driver's logs.
Stage
- Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce);一个job会被拆成一系列的tasks，
- you'll see this term used in the driver's logs.
- 一个Stage的边界往往是从某个地方取数据开始，到shuffle的结束。

一个job是action触发的，一个job里面可能会有一个到多个stage，stage里面有一堆的task，task运行在exectuor里面，executor是跑在worker之上的。

运行架构及注意事项

Spark VS Hadoop

Hadoop

一个MapReduce程序 = 1个Job
一个Job = 1个N个Task（Map/Reduce）
一个Task对应一个进程
Task运行时开启进程，Task执行完毕后销毁进程，对于多个Task来说，开销是比较大的，即使能够JVM共享

Spark

Application = Driver（main方法中创建SparkContext） + Executors
一个Application = 0到多个Job
一个Job = 一个Action
一个Job = 1到N个Stage
一个Stage = 1到N个Task
一个Task对应一个线程，多个Task可以以并行的方式运行在一个Exectuor中

Lineage机制

RDD之间的依赖关系就是Lineage，Spark会把RDD的依赖关系存起来。

An RDD is an immutable(不可变的), deterministically re-computable（能重新计算得出相同结果的）, distributed（分布式的） dataset. Each RDD remembers the lineage of deterministic operations that were used on a fault-tolerant input dataset to create it.
If any partition of an RDD is lost due to a worker node failure, then that partition can be re-computed from the original fault-tolerant dataset using the lineage of operations.若因为工作节点的故障导致RDD的部分分区的数据丢失，那么可以依据RDD的血统从一个原始的可容错的数据集上重新计算该分区。
Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster.假设RDD上的transformations都是确定的，那么无论Spark集群出现何种故障，我们最终得到的RDD的转换总是相同的。

依赖

RDD 的 Transformation 函数中，又分为窄依赖(narrow dependency)和宽依赖(wide dependency)的操作。窄依赖跟宽依赖的区别是是否发生 shuffle(洗牌) 操作.宽依赖会发生 shuffle 操作.。

窄依赖：一个父RDD的partition至多被子RDD的某个partition使用一次，例如map、filter、union等属于第一类窄依赖，join with inputs co-partitioned（对输入进行协同划分的join操作，也就是说先按照key分组然后shuffle write的时候一个父分区对应一个子分区）则为第二类窄依赖，join一般都是宽依赖，只有mapjoin或者broadcast join是不需要shuffle的。MapReduce作业会有数据落地，窄依赖中间不会有数据落地。

宽依赖：一个父RDD的partition会被子RDD的partition使用多次，有shuffle

区分这两种依赖很有用。首先，窄依赖允许在一个集群节点上以流水线的方式（pipeline）计算所有父分区。例如，逐个元素地执行map、然后filter操作；而宽依赖则需要首先计算好所有父分区数据，然后在节点之间进行Shuffle，这与MapReduce类似。第二，窄依赖能够更有效地进行失效节点的恢复，即只需重新计算丢失RDD分区的父分区，而且不同节点之间可以并行计算；而对于一个宽依赖关系的Lineage图，单个节点失效可能导致这个RDD的所有祖先丢失部分分区，因而需要整体重新计算。