spark 笔记 9: Task/TaskContext

最新推荐文章于 2022-11-14 18:05:28 发布

weixin_30567471

最新推荐文章于 2022-11-14 18:05:28 发布

阅读量223

点赞数

文章标签：大数据 java

原文链接：http://www.cnblogs.com/zwCHAN/p/4245302.html

版权

spark 笔记 9: Task/TaskContext

DAGScheduler最终创建了task set，并提交给了taskScheduler。那先得看看task是怎么定义和执行的。

Task是execution执行的一个单元。

Task： executor执行的基本单元，也是spark操作的最小单位。和java executor的task基本上是相同含义的。

/**
 * :
 * - [[]]
 * - [[]]
 *
 * A Spark job consists of one or more stages. The  in a job consists of multiple
 * , while . .  *
 * @param stageId id of the stage this task belongs to
 * @param partitionId index of the number in the RDD
 */
privateabstract class Tval Int, var Intextends

主要属性：

final def run(attemptId: Long): T = {
context = new TaskContext(stageId, partitionId, attemptId, runningLocally = false)
context.taskMetrics.hostname = Utils.localHostName()
taskThread = Thread.currentThread()
if (_killed) {
    kill(interruptThread = false)
  }
  runTask(context)

def runTask(context: TaskContext): T

// Map output tracker epoch. Will be set by TaskScheduler.
var epoch: Long = -1

var metrics: Option[TaskMetrics] = None

// Task context, to be initialized in run().
@transient protected var context: TaskContext = _

// The actual Thread on which the task is running, if any. Initialized in run().
@volatile @transient private var taskThread: Thread = _

/**
 * need to send the list of JARs and files added to the SparkContext with each task to ensure that
 * worker nodes find out about it, but we can't make it part of the Task because the user's code in
 * the task might depend on one of the JARs. Thus we serialize each task as multiple objects, by
 * first writing out its dependencies.
 */
privateobject /**
   * Serialize a task and the current app dependencies (files and JARs added to the SparkContext)
   */
  def serializeWithDependencies

/**
 * Deserialize the list of dependencies in a task serialized with serializeWithDependencies,
 * and return the task itself as a serialized ByteBuffer. The caller can then update its
 * ClassLoaders and deserialize the task.
 *
 * @return (taskFiles, taskJars, taskBytes)
 */
def deserializeWithDependencies(serializedTask: ByteBuffer)

ShuffleMapTask：它是对应于transformation操作的task，主要更能是解决提供action操作所需要的数据。依旧是它是被action依赖的task，需要提前执行。

/**
* *
* See [[org.apache.spark.scheduler.Task]] for more information.
*
 * @param stageId id of the stage this task belongs to
 * @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,
 *                   the type should be (RDD[_], ShuffleDependency[_, _, _]).
 * @param partition partition of the RDD this task is associated with
 * @param locs preferred task execution locations for locality scheduling
 */
privateclass Int,
    Byte,
    ,
    @transient private var Seqextends , with

override def runTask// Deserialize the RDD using the broadcast variable.
  val val , , , wrap, currentThreadmetrics Somevar , null
  try val get, shuffleHandle, , Iterator, return truecatch case Exception if nullfalsethrow finally

ResultTask: 它是与action操作对应的，也就是依赖树的叶子节点上。

/**
 * n.
 *
 * See [[Task]] for more information.
 *
 * @param stageId id of the stage this task belongs to
 * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each
 *                   partition of the given RDD. Once deserialized, the type should be
 *                   (RDD[T], (TaskContext, Iterator[T]) => U).
 * @param partition partition of the RDD this task is associated with
 * @param locs preferred task execution locations for locality scheduling
 * @param outputId index of the task in this job (a job can launch tasks on only a subset of the
 *                 input RDD's partitions).
 */
privateclass T, UInt,
    Byte,
    ,
    @transient Seq,
    val Intextends U, with

override def runTaskU // Deserialize the RDD and the func using the broadcast variables.
  val val , T, , IteratorTUwrap, currentThreadmetrics Sometry finally

来自为知笔记(Wiz)

posted on 2015-01-24 00:07 过雁阅读( ...) 评论( ...) 编辑收藏

转载于:https://www.cnblogs.com/zwCHAN/p/4245302.html

weixin_30567471

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark 笔记 9: Task/TaskContext

spark 笔记 9: Task/TaskContext DAGScheduler最终创建了task set，并提交给了taskScheduler。那先得看看task是怎么定义和执行的。Task是execution执行的一个单元。Task： executor执行的基本单元，也是spark操作的最小单位。和java executor的task基...
复制链接

扫一扫