spark核心术语解析

最新推荐文章于 2020-05-05 10:30:42 发布

bingo_liu

最新推荐文章于 2020-05-05 10:30:42 发布

阅读量400

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/bingo_liu/article/details/54947401

版权

spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Application

Application（应用）其实就是用spark-submit提交到spark的程序
如spark examples中的计算pi的SparkPi
Spark-shell是一个应用程序，因为spark-shell在启动的时候创建了一个SparkContext对象，其名称为sc

/**
 * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
 * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
 *
 * Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before
 * creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details.
 *
 * @param config a Spark Config object describing the application configuration. Any settings in
 *   this config overrides the default configs as well as system properties.
 */
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
...
}

Driver Program

主要完成任务的调度以及和executor和cluster manager进行协调。有client和cluster联众模式。client模式driver在任务提交的机器上运行，而cluster模式会随机选择机器中的一台机器启动driver。从spark官网截图的一张图可以大致了解driver的功能

这里写图片描述

Cluster Manager

集群资源的管理外部服务，在spark上现在有standalone、yarn、mesos等三种集群资源管理器，spark自带的standalone模式能够满足大部分的spark计算环境对集群资源管理的需求，基本上只有在集群中运行多套计算框架的时候才考虑yarn和mesos
Worker Node：集群中可以运行应用代码的工作节点，相当于Hadoop的slave节点

Executor

在一个Worker Node上为应用启动的工作进程，在进程中赋值任务的运行，并且负责将数据存放在内存或磁盘上，必须注意的是，每个应用在一个Worker Node上只会有一个Executor，在Executor内部通过多线程的方式并发处理应用的任务。

/**
 * Spark executor, backed by a threadpool to run tasks.
 *
 * This can be used with Mesos, YARN, and the standalone scheduler.
 * An internal RPC interface (at the moment Akka) is used for communication with the driver,
 * except in the case of Mesos fine-grained mode.
 */
private[spark] class Executor(
    executorId: String,
    executorHostname: String,
    env: SparkEnv,
    userClassPath: Seq[URL] = Nil,
    isLocal: Boolean = false)
  extends Logging {
  ...
  }

Job

一个action例如count、saveAsTextFile等都会对应一个job实例，该job实例包含多任务的并行计算

Task

被Driver送到Executor上的工作单元，通常情况下一个task会处理一个split的数据，每个split一般就是一个Block块的大小

/**
 * A unit of execution. We have two kinds of Task's in Spark:
 *
 *  - [[org.apache.spark.scheduler.ShuffleMapTask]]
 *  - [[org.apache.spark.scheduler.ResultTask]]
 *
 * A Spark job consists of one or more stages. The very last stage in a job consists of multiple
 * ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
 * and sends the task output back to the driver application. A ShuffleMapTask executes the task
 * and divides the task output to multiple buckets (based on the task's partitioner).
 *
 * @param stageId id of the stage this task belongs to
 * @param partitionId index of the number in the RDD
 */
private[spark] abstract class Task[T](
    val stageId: Int,
    val stageAttemptId: Int,
    val partitionId: Int,
    internalAccumulators: Seq[Accumulator[Long]]) extends Serializable {
    ...
    }

State

一个job会被拆分成很多任务，每一组任务被称为state，这个MapReduce的map和reduce任务很像，划分state的依据在于：state开始一般是由于读取外部数据或者shuffle数据、一个state的结束一般是由于发生shuffle（例如reduceByKey操作）或者整个job结束时，例如要把数据放到hdfs等存储系统上
一般而言一个Job会切换成一定数量的stage。各个stage之间按照顺序执行。至于stage是怎么切分的，首选得知道spark论文中提到的narrow dependency(窄依赖)和wide dependency（宽依赖）的概念。其实很好区分，看一下父RDD中的数据是否进入不同的子RDD，如果只进入到一个子RDD则是窄依赖，否则就是宽依赖。宽依赖和窄依赖的边界就是stage的划分点

这里写图片描述