Spark任务流程解析

最新推荐文章于 2023-03-14 09:09:17 发布

数据小二

最新推荐文章于 2023-03-14 09:09:17 发布

阅读量558

点赞数 1

分类专栏：大数据 spark调度系统文章标签： Spark任务刨析 spark代码刨析 spark 大数据 hadoop

本文链接：https://blog.csdn.net/Mirror_w/article/details/111986596

版权

大数据同时被 2 个专栏收录

84 篇文章 4 订阅

订阅专栏

spark调度系统

3 篇文章 0 订阅

订阅专栏

Spark主要的角色

SparkControlProcesses

Driver	Application entry point that contains the SparkContext instance
Master	In charge of scheduling and resource orchestration
Worker	Responsible for node state and running executors  A worker is a control process that runs on a cluster node
Executor	Allocated per job and in charge of executing tasks from that job An executor is a JVM process that executes on each worker

SparkExecutionHierarchy

示例

提交一个job任务，spark-ui页面的监控图

Job:

Job0，job1分别由action算子触发生成，
job1中有1个skipped的stage,说明有一个RDD的复用

Job 0

Stage0:

1）repartition 切割的stage0中的2个task并行去hdfs上的读取。

2)）每个stage上运行详情可以看到shuffle量、GC量、运行时长等信息

Stage1:

1）2个task运行

Stage2:

Job1

在构成的DAG中和stage执行列表中，可以看到skipped的stage，是哪一个阶段。
Stage4触发直接复用已经划分好的stage
Stage3未执行，直接跳过解析

Stage4:

Stage5:

任务解析

知识点：

job的数据由action算子决定
job中task中的数量由分区数决定

RDD的特性：

*
*  - A list of partitions 
*  - A function for computing each split
*  - A list of dependencies on other RDDs
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Stage的划分：

根据RDD之间的宽窄依赖关系进行划分:

窄依赖，父RDD的分区最多只会被子RDD的一个分区使用，
宽依赖，父RDD的一个分区会被子RDD的多个分区使用（宽依赖指子RDD的每个分区都要依赖于父RDD的所有分区，这是shuffle类操作）

stage划分：

总结：遇到一个宽依赖就分一个stage

问？

RDD数据加载,时一次性加载完所有数据吗？ no

sc进行读取数据文件时，以LineRecordReader对象进行一行行读取，通过参数(数据路径，数据读取开始位置，数的据读取结束位置)进行读取。分装成一个个的splits进行后续的计算。

迭代计算部分源码：

ResultTask.scala

override def runTask(context: TaskContext): U = {
...
...
//传入处理的分区id，和context上下文
  func(context, rdd.iterator(partition, context))
}

RDD.scala

/**
 * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
 * This should ''not'' be called by users directly, but is available for implementors of custom
 * subclasses of RDD.
 */
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
  if (storageLevel != StorageLevel.NONE) {
//若RDD有缓存，则直接从缓存中读取计算
    getOrCompute(split, context)
  } else {
    computeOrReadCheckpoint(split, context)
  }
}
/**
 * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
 */
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
  if (isCheckpointedAndMaterialized) {
	//若有checkpoin持久化，则直接获取数据
    firstParent[T].iterator(split, context)
  } else {
//直接计算
    compute(split, context)
  }
}

MapPartitionsRDD.compte

override def compute(split: Partition, context: TaskContext): Iterator[U] =
  f(context, split.index, firstParent[T].iterator(split, context))

HadoopRDD.scala

override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {

//获取conf配置信息
private val jobConf = getJobConf()
//获取字节读取对象
private val inputMetrics = context.taskMetrics().inputMetrics
//获取上次写入的值
private val existingBytesRead = inputMetrics.bytesRead
//设置读取的数据信息
split.inputSplit.value match {
  case fs: FileSplit =>
    InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength)
  case _ =>
    InputFileBlockHolder.unset()
}

//创建RecordReader对象
reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)


private val key: K = if (reader == null) null.asInstanceOf[K] else reader.createKey()
private val value: V = if (reader == null) null.asInstanceOf[V] else reader.createValue()

override def getNext(): (K, V) = {
  try {
    finished = !reader.next(key, value)
  } catch {
    case e: FileNotFoundException if ignoreMissingFiles =>
      logWarning(s"Skipped missing file: ${split.inputSplit}", e)
      finished = true
    // Throw FileNotFoundException even if `ignoreCorruptFiles` is true
    case e: FileNotFoundException if !ignoreMissingFiles => throw e
    case e: IOException if ignoreCorruptFiles =>
      logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
      finished = true
  }
  if (!finished) {
    inputMetrics.incRecordsRead(1)
  }
  if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
    updateBytesRead()
  }
  (key, value)
}

InterruptibleIterator迭代器进行数据的迭代，使用hasnext 和 hext进行

@DeveloperApi
class InterruptibleIterator[+T](val context: TaskContext, val delegate: Iterator[T])
  extends Iterator[T] {

  def hasNext: Boolean = {
//kill掉中断的任务，继续执行下一批数据 
context.killTaskIfInterrupted()
    delegate.hasNext
  }
  def next(): T = delegate.next()
}

迭代返回

当rdd.HadoopRDD.compute运算完毕后，生成的初始的RDD计算结果。退回到rdd.HadoopRDD.compute便可以调用函数f：

override def compute(split: Partition, context: TaskContext): Iterator[U] =
  f(context, split.index, firstParent[T].iterator(split, context))

f计算出第二个的RDD计算结果，以此类推，一层层的返回。

上述逻辑类似与一个高阶函数：f(f(…f(compute())…))

RDD的复用是DAG逻辑复用还是数据复用？

1）逻辑复用，如示例代码，当代码串行执行的时候，遇到第一个action算子时候首先启动一个jvm进程去进行回溯，而第一个RDD进行回溯完毕，并且DAGScheduler将根据RDD之间的lineage进行划分成一个个的stage时，会将解析逻辑存储下来。而当第二个Action算子触发回溯时，当回溯到pairRDD1,时首先拿取到了已经解析完毕的DAG故无需再进行DAG解析直接复用即可。

2）如何数据复用？

使用持久化算子，cache,persist,checkpoint

Cache=persist(Memory_only)

Checkpoint:会对持久化的算子单独启动job取持久化，并且持久化到磁盘下(建议先cache)

持久化后的和未持久化的区别图 ->

代码刨析：

任务流程：

数据小二

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark任务流程解析

Spark主要的角色SparkControlProcesses Driver Application entry point that contains the SparkContext instance Master In charge of scheduling and resource orchestration  Worker Responsible for node ...
复制链接

扫一扫

专栏目录