spark 笔记 9: Task/TaskContext

DAGScheduler最终创建了task set,并提交给了taskScheduler。那先得看看task是怎么定义和执行的。
Task是execution执行的一个单元。

Task: executor执行的基本单元,也是spark操作的最小单位。和java executor的task基本上是相同含义的。
/**
* :
* - [[]]
* - [[]]
*
* A Spark job consists of one or more stages. The in a job consists of multiple
* , while . . *
* @param stageId id of the stage this task belongs to
* @param partitionId index of the number in the RDD
*/
privateabstract class Tval Int, var Intextends
主要属性:
final def run(attemptId: Long): T = {
context = new TaskContext(stageId, partitionId, attemptId, runningLocally = false)
context.taskMetrics.hostname = Utils.localHostName()
taskThread = Thread.currentThread()
if (_killed) {
kill(interruptThread = false)
}
runTask(context)
 
     
def runTask(context: TaskContext): T
 
     
// Map output tracker epoch. Will be set by TaskScheduler.
var epoch: Long = -1

var metrics: Option[TaskMetrics] = None

// Task context, to be initialized in run().
@transient protected var context: TaskContext = _
// The actual Thread on which the task is running, if any. Initialized in run().
@volatile @transient private var taskThread: Thread = _

/**
* need to send the list of JARs and files added to the SparkContext with each task to ensure that
* worker nodes find out about it, but we can't make it part of the Task because the user's code in
* the task might depend on one of the JARs. Thus we serialize each task as multiple objects, by
* first writing out its dependencies.
*/
privateobject /**
* Serialize a task and the current app dependencies (files and JARs added to the SparkContext)
*/
def serializeWithDependencies
/**
* Deserialize the list of dependencies in a task serialized with serializeWithDependencies,
* and return the task itself as a serialized ByteBuffer. The caller can then update its
* ClassLoaders and deserialize the task.
*
* @return (taskFiles, taskJars, taskBytes)
*/
def deserializeWithDependencies(serializedTask: ByteBuffer)
ShuffleMapTask: 它是对应于transformation操作的task,主要更能是解决提供action操作所需要的数据。依旧是它是被action依赖的task,需要提前执行。
/**
* *
* See [[org.apache.spark.scheduler.Task]] for more information.
*
* @param stageId id of the stage this task belongs to
* @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,
* the type should be (RDD[_], ShuffleDependency[_, _, _]).
* @param partition partition of the RDD this task is associated with
* @param locs preferred task execution locations for locality scheduling
*/
privateclass Int,
Byte,
,
@transient private var Seqextends , with
override def runTask// Deserialize the RDD using the broadcast variable.
val val , , , wrap, currentThreadmetrics Somevar , null
try val get, shuffleHandle, , Iterator, return truecatch case Exception if nullfalsethrow finally
ResultTask: 它是与action操作对应的,也就是依赖树的叶子节点上。
/**
* n.
*
* See [[Task]] for more information.
*
* @param stageId id of the stage this task belongs to
* @param taskBinary broadcasted version of the serialized RDD and the function to apply on each
* partition of the given RDD. Once deserialized, the type should be
* (RDD[T], (TaskContext, Iterator[T]) => U).
* @param partition partition of the RDD this task is associated with
* @param locs preferred task execution locations for locality scheduling
* @param outputId index of the task in this job (a job can launch tasks on only a subset of the
* input RDD's partitions).
*/
privateclass T, UInt,
Byte,
,
@transient Seq,
val Intextends U, with
override def runTaskU // Deserialize the RDD and the func using the broadcast variables.
val val , T, , IteratorTUwrap, currentThreadmetrics Sometry finally











posted on 2015-01-24 00:07 过雁 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/zwCHAN/p/4245302.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值