standalone模式下的资源、任务调度源码剖析

整体流程图

在这里插入图片描述

standalone模式简介

standalone模式,是spark自带的资源调度框架,该框架下主要有三个节点:

  • client 客户端,将应用程序提交,记录着要运行业务逻辑
  • master 负责资源管理
  • worker 负责任务的执行

spark Application有个Driver驱动,即用户提交的程序,Driver进程可以运行在Client上也可以运行在worker上 ,其区别如下:

  • client:用户启动Client端,在client端启动Driver进程,在Driver中实例化SparkContext

  • worker: 用户启动Client端,Client提交应用程序给master, master调度资源,指定一个worker节点启动Driver进程

在这里插入图片描述
在这里插入图片描述

运行流程

在这里插入图片描述

sparkContext三大核心:DAGScheduler,TaskScheduler,SchedulerBackend

1)当spark集群启动以后,worker节点会有一个心跳机制和master保持通信;

2)SparkContext连接到master以后会向master申请资源,而master会根据worker心跳来分配worker的资源,并启动worker的executor进程;

3)SparkContext将程序代码解析成DAG,并提交给DagScheduler;

4)DagScheduler根据宽依赖分解成多个stage,每个stage即一个taskSet, 包含着多个task;

5)taskSet会被提交给TaskScheduler,TaskScheduler将task分配到worker,提交给executor进程,executor进程会创建线程池去执行task,并且向SparkContext报告执行情况,直到task完成;

6)所有task完成以后,SparkContext向Master注销并释放资源;

Spark集群启动

启动 Spark 集群,查看 sbin 下的 start-all.sh 脚本,会先启动 Master:

# Start Master
"${SPARK_HOME}/sbin"/start-master.sh

# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh

查看 sbin/start-master.sh 脚本,发现会去执行 org.apache.spark.deploy.master.Master 类,开始在源码中跟进Master,从 main 方法开始:

private[deploy] object Master extends Logging {
  val SYSTEM_NAME = "sparkMaster"
  val ENDPOINT_NAME = "Master"

  def main(argStrings: Array[String]) {
    Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(
      exitOnUncaughtException = false))
    Utils.initDaemon(log)
    val conf = new SparkConf
    val args = new MasterArguments(argStrings, conf)

    /**
      * 创建RPC环境和Endpoint(RPC 远程过程调用),在Spark中Master和Worker角色都有各自的Endpoint,相当于各自的通信邮箱。
      */
    val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
    rpcEnv.awaitTermination()
  }

在 main 方法中执行了 startRpcEnvAndEndpoint 方法,创建RpcEnv和 Endpoint,RpcEnv 是用于接收消息和处理消息的远程通信环境,Master 向 RpcEnv 中去注册,不管是 Master,Driver,Worker,Executor 都有自己的 Endpoint,相当于是邮箱,大意是节点之间的通信首先需要获取到目的节点的邮箱。Master启动时会将 Endpoint 注册在 RpcEnv 里面,用于接收、处理消息。

跟进 startRpcEnvAndEndpoint 方法:

/**
   * Start the Master and return a three tuple of:
   *   (1) The Master RpcEnv
   *   (2) The web UI bound port
   *   (3) The REST server bound port, if any
   */
  def startRpcEnvAndEndpoint(
      host: String,
      port: Int,
      webUiPort: Int,
      conf: SparkConf): (RpcEnv, Int, Option[Int]) = {
    val securityMgr = new SecurityManager(conf)

    /**
      * 创建RpcEnv,后续向RpcEnv中注册[Driver,Master,Worker,Executor]
      */
    val rpcEnv: RpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)

    /**
      * 向RpcEnv中注册Master
      *
      * rpcEnv.setupEndpoint(name,new Master)
      * 新建Master伴生类对象,继承ThreadSafeRpcEndpoint,归根结底继承Trait RpcEndpoint
      *     EndPoint中存在
      *     onstart() :启动当前Endpoint
      *     receive() :负责收消息
      *     receiveAndReply():接受消息并回复
      *  Endpoint 还有各自的引用,方便其他Endpoint发送消息,直接引用对方的EndpointRef 即可找到对方的Endpoint
      *
      * 以下 masterEndpoint就是Master的Endpoint引用RpcEndpointRef
      * RpcEndpointRef中存在:
      *     send():发送消息
      *     ask() :请求消息,并等待回应。
      */
    val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME,
      new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))

    val portsResponse = masterEndpoint.askSync[BoundPortsResponse](BoundPortsRequest)
    (rpcEnv, portsResponse.webUIPort, portsResponse.restPort)

继续跟进 create 方法,查看创建RpcEnv 的原理:

create(name, host, host, port, conf, securityManager, 0, clientMode)

继续跟进 create 方法:

// 配置RPC环境
    val config = RpcEnvConfig(conf, name, bindAddress, advertiseAddress, port, securityManager,
      numUsableCores, clientMode)
    // 工厂模式创建NettyRpcEnv
    new NettyRpcEnvFactory().create(config)

创建一个 NettyRpcEnvFactory 的对象,并调用 create 方法,跟进此方法:

// 实例化NettyRpcEnv,后续返回
    val nettyEnv =
      new NettyRpcEnv(sparkConf, javaSerializerInstance, config.advertiseAddress,
        config.securityManager, config.numUsableCores)

    if (!config.clientMode) {
      val startNettyRpcEnv: Int => (NettyRpcEnv, Int) = { actualPort =>
        nettyEnv.startServer(config.bindAddress, actualPort)
        (nettyEnv, nettyEnv.address.port)
      }

实例化过程中会进行一些初始化操作:

private val dispatcher: Dispatcher = new Dispatcher(this, numUsableCores)

  private val streamManager = new NettyStreamManager(this)

  private val transportContext = new TransportContext(transportConf,
    new NettyRpcHandler(dispatcher, this, streamManager))
  • Dispatcher:存放消息的队列和消息的转发
  • TransportContext:可以创建NettyRpcHandler

Dispatcher

首先看 Dispatcher 实例化时执行的逻辑:

/** Thread pool used for dispatching messages. */
  private val threadpool: ThreadPoolExecutor = {
    val availableCores =
      if (numUsableCores > 0) numUsableCores else Runtime.getRuntime.availableProcessors()
    val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
      math.max(2, availableCores))
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
    for (i <- 0 until numThreads) {
      pool.execute(new MessageLoop)
    }
    pool
  }

在 Dispatcher 实例化的过程中会创建一个 threadpool,在 threadpool 中会执行 MessageLoop:

/** Message loop used for dispatching messages. */
  private class MessageLoop extends Runnable {
    override def run(): Unit = {
      try {
        while (true) {
          try {
            val data = receivers.take()
            if (data == PoisonPill) {
              // Put PoisonPill back so that other MessageLoops can see it.
              receivers.offer(PoisonPill)
              return
            }
            data.inbox.process(Dispatcher.this)
          } catch {
            case NonFatal(e) => logError(e.getMessage, e)
          }
        }
      } catch {
        case _: InterruptedException => // exit
        case t: Throwable =>
          try {
            // Re-submit a MessageLoop so that Dispatcher will still work if
            // UncaughtExceptionHandler decides to not kill JVM.
            threadpool.execute(new MessageLoop)
          } finally {
            throw t
          }
      }
    }
/** Message loop used for dispatching messages. */
  private class MessageLoop extends Runnable {
    override def run(): Unit = {
      try {
        while (true) {
          try {
            val data = receivers.take()
            if (data == PoisonPill) {
              // Put PoisonPill back so that other MessageLoops can see it.
              receivers.offer(PoisonPill)
              return
            }
            data.inbox.process(Dispatcher.this)

在 MessageLoop 中的 receivers.take() 会一直向 receivers消息队列中取数据data,然后调用data.inbox.process(Dispatcher.this);

而 receivers 是在 Dispatcher 初始化时创建的,至此接收消息的程序已经启动起来。

// Track the receivers whose inboxes may contain messages.
  private val receivers = new LinkedBlockingQueue[EndpointData]

其中传入的 EndpointData 对象,实例化时会实例化一个 Inbox 对象:

private class EndpointData(
      val name: String,
      val endpoint: RpcEndpoint,
      val ref: NettyRpcEndpointRef) {
    val inbox = new Inbox(ref, endpoint)
  }

实例化 Inbox 对象时初始化message,随即加入OnStart样例类

protected val messages = new java.util.LinkedList[InboxMessage]()
inbox.synchronized {
    messages.add(OnStart)
  }

data.inbox.process(Dispatcher.this)方法的调用:

while (true) {
      safelyCall(endpoint) {
        message match {
          case RpcMessage(_sender, content, context) =>
            try {
              endpoint.receiveAndReply(context).applyOrElse[Any, Unit](content, { msg =>
                throw new SparkException(s"Unsupported message $message from ${_sender}")
              })
            } catch {
              case e: Throwable =>
                context.sendFailure(e)
                // Throw the exception -- this exception will be caught by the safelyCall function.
                // The endpoint's onError function will be called.
                throw e
            }

          case OneWayMessage(_sender, content) =>
            endpoint.receive.applyOrElse[Any, Unit](content, { msg =>
              throw new SparkException(s"Unsupported message $message from ${_sender}")
            })

          case OnStart =>
            endpoint.onStart()

若消息类型为OnStart时,调用 endpoint.onStart()

TransportContext

NettyRpcEnv实例化先告一段落,回到NettyRpcEnvFactory的create方法继续往下执行

if (!config.clientMode) {
      val startNettyRpcEnv: Int => (NettyRpcEnv, Int) = { actualPort =>
        nettyEnv.startServer(config.bindAddress, actualPort)
        (nettyEnv, nettyEnv.address.port)
      }

跟进startServer方法:

def startServer(bindAddress: String, port: Int): Unit = {
    val bootstraps: java.util.List[TransportServerBootstrap] =
      if (securityManager.isAuthenticationEnabled()) {
        java.util.Arrays.asList(new AuthServerBootstrap(transportConf, securityManager))
      } else {
        java.util.Collections.emptyList()
      }
    server = transportContext.createServer(bindAddress, port, bootstraps)
    dispatcher.registerRpcEndpoint(
      RpcEndpointVerifier.NAME, new RpcEndpointVerifier(this, dispatcher))
  }

transportContext.createServer 方法是用来启动 NettyRpc 服务的,方法中实例化 TransportServer对象

 /** Create a server which will attempt to bind to a specific host and port. */
  public TransportServer createServer(
      String host, int port, List<TransportServerBootstrap> bootstraps) {
    return new TransportServer(this, host, port, rpcHandler, bootstraps);
  }

TransportServer构造函数中,在 try 中会调用 init 方法进行初始化:

boolean shouldClose = true;
    try {
      init(hostToBind, portToBind);
      shouldClose = false;
    } finally {
      if (shouldClose) {
        JavaUtils.closeQuietly(this);
      }
    }

在 init 方法中初始化netty中的bossGroup和workerGroup,通过bootstrap引导类启动netty Rpc服务, 的远程通信对象 bootstrap 会调用 childHandler 方法,会初始化网络通信管道:

private void init(String hostToBind, int portToBind) {

    IOMode ioMode = IOMode.valueOf(conf.ioMode());

    // netty中的bossGroup
    EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
      conf.getModuleName() + "-boss");

    // netty中的workerGroup
    EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, conf.serverThreads(),
      conf.getModuleName() + "-server");

    PooledByteBufAllocator allocator = NettyUtils.createPooledByteBufAllocator(
      conf.preferDirectBufs(), true /* allowCache */, conf.serverThreads());

    // netty中的bootstrap
    bootstrap = new ServerBootstrap()
      .group(bossGroup, workerGroup)
      .channel(NettyUtils.getServerChannelClass(ioMode))
      .option(ChannelOption.ALLOCATOR, allocator)
      .option(ChannelOption.SO_REUSEADDR, !SystemUtils.IS_OS_WINDOWS)
      .childOption(ChannelOption.ALLOCATOR, allocator);

    this.metrics = new NettyMemoryMetrics(
      allocator, conf.getModuleName() + "-server", conf);

    if (conf.backLog() > 0) {
      bootstrap.option(ChannelOption.SO_BACKLOG, conf.backLog());
    }

    if (conf.receiveBuf() > 0) {
      bootstrap.childOption(ChannelOption.SO_RCVBUF, conf.receiveBuf());
    }

    if (conf.sendBuf() > 0) {
      bootstrap.childOption(ChannelOption.SO_SNDBUF, conf.sendBuf());
    }

    // 设置childHandler
    bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
      @Override
      protected void initChannel(SocketChannel ch) {
        RpcHandler rpcHandler = appRpcHandler;
        for (TransportServerBootstrap bootstrap : bootstraps) {
          rpcHandler = bootstrap.doBootstrap(ch, rpcHandler);
        }
        context.initializePipeline(ch, rpcHandler);
      }
    });

    InetSocketAddress address = hostToBind == null ?
        new InetSocketAddress(portToBind): new InetSocketAddress(hostToBind, portToBind);
    // 绑定地址
    channelFuture = bootstrap.bind(address);
    channelFuture.syncUninterruptibly();

    port = ((InetSocketAddress) channelFuture.channel().localAddress()).getPort();
    logger.debug("Shuffle server started on port: {}", port);
  }

bootstrap 会调用 childHandler 方法,会初始化网络通信管道:

// 初始化网络通信管道
        context.initializePipeline(ch, rpcHandler);
public TransportChannelHandler initializePipeline(
      SocketChannel channel,
      RpcHandler channelRpcHandler) {
    try {
      // 自定义一个ChannelHandler
      TransportChannelHandler channelHandler = createChannelHandler(channel, channelRpcHandler);
      // 将需要的Handler添加到pipeline中
      channel.pipeline()
        .addLast("encoder", ENCODER)
        .addLast(TransportFrameDecoder.HANDLER_NAME, NettyUtils.createFrameDecoder())
        .addLast("decoder", DECODER)
        .addLast("idleStateHandler", new IdleStateHandler(0, 0, conf.connectionTimeoutMs() / 1000))
        // NOTE: Chunks are currently guaranteed to be returned in the order of request, but this
        // would require more logic to guarantee if this were not part of the same event loop.
        .addLast("handler", channelHandler);
      return channelHandler;
    } catch (RuntimeException e) {
      logger.error("Error while initializing Netty pipeline", e);
      throw e;
    }
  }

在初始化网络通信管道的过程中,创建处理消息的 channelHandler 对象,该对象的作用是

  • 创建并处理客户端的请求消息和服务消息
/**
   * Creates the server- and client-side handler which is used to handle both RequestMessages and
   * ResponseMessages. The channel is expected to have been successfully created, though certain
   * properties (such as the remoteAddress()) may not be available yet.
   */
  private TransportChannelHandler createChannelHandler(Channel channel, RpcHandler rpcHandler) {
    TransportResponseHandler responseHandler = new TransportResponseHandler(channel);
    TransportClient client = new TransportClient(channel, responseHandler);
    TransportRequestHandler requestHandler = new TransportRequestHandler(channel, client,
      rpcHandler, conf.maxChunksBeingTransferred());
    return new TransportChannelHandler(client, responseHandler, requestHandler,
      conf.connectionTimeoutMs(), closeIdleConnections);
  }

TransportChannelHandler 由以上 responseHandler,client,requestHandler 三个 handler 构建,并且这个对象中有 channelRead 方法,用于读取接收到的消息:

@Override
  public void channelRead(ChannelHandlerContext ctx, Object request) throws Exception {
    // 判断当前消息的类型,是请求消息还是响应消息
    if (request instanceof RequestMessage) {
      requestHandler.handle((RequestMessage) request);
    } else if (request instanceof ResponseMessage) {
      responseHandler.handle((ResponseMessage) request);
    } else {
      ctx.fireChannelRead(request);
    }
  }

以 requestHandler 为例,调用 handle —> processRpcRequest((RpcRequest) request),会看到 rpcHandler.receive,此时调用的是 NettyRpcHandler 的 receive:

private void processRpcRequest(final RpcRequest req) {
    // rpcHandler是传过来的NettyRpcHandler,所以调用NettyRpcHandler.receive
    try {
      rpcHandler.receive(reverseClient, req.body().nioByteBuffer(), new RpcResponseCallback() {
        @Override
        public void onSuccess(ByteBuffer response) {
          respond(new RpcResponse(req.requestId, new NioManagedBuffer(response)));
        }

        @Override
        public void onFailure(Throwable e) {
          respond(new RpcFailure(req.requestId, Throwables.getStackTraceAsString(e)));
        }
      });
    } catch (Exception e) {
      logger.error("Error while invoking RpcHandler#receive() on RPC id " + req.requestId, e);
      respond(new RpcFailure(req.requestId, Throwables.getStackTraceAsString(e)));
    } finally {
      req.body().release();
    }
  }
override def receive(
      client: TransportClient,
      message: ByteBuffer,
      callback: RpcResponseCallback): Unit = {
    val messageToDispatch = internalReceive(client, message)
    // dispatcher负责发送远程的消息,最终调到postMessage 方法
    dispatcher.postRemoteMessage(messageToDispatch, callback)
  }

继续跟进 dispatcher.postRemoteMessage 方法:


  /** Posts a message sent by a remote endpoint. */
  def postRemoteMessage(message: RequestMessage, callback: RpcResponseCallback): Unit = {
    val rpcCallContext =
      new RemoteNettyRpcCallContext(nettyEnv, callback, message.senderAddress)
    val rpcMessage = RpcMessage(message.senderAddress, message.content, rpcCallContext)
    postMessage(message.receiver.name, rpcMessage, (e) => callback.onFailure(e))
  }

在 postRemoteMessage 中,无论是请求消息还是回应消息,都最终会执行到这个 postMessage:

/**
   * Posts a message to a specific endpoint.
   *
   * @param endpointName name of the endpoint.
   * @param message the message to post
   * @param callbackIfStopped callback function if the endpoint is stopped.
   */
  private def postMessage(
      endpointName: String,
      message: InboxMessage,
      callbackIfStopped: (Exception) => Unit): Unit = {
    val error = synchronized {
      // 获取消息的通信端邮箱名称
      val data = endpoints.get(endpointName)
      if (stopped) {
        Some(new RpcEnvStoppedException())
      } else if (data == null) {
        Some(new SparkException(s"Could not find $endpointName."))
      } else {
        // 消息放入通信端的消息队列中
        data.inbox.post(message)
        //
        receivers.offer(data)
        None
      }
    }
    // We don't need to call `onStop` in the `synchronized` block
    error.foreach(callbackIfStopped)
  }

在该方法中会将 message 放入 inbox 中,在处理消息的程序启动后,就能接收到消息,至此 RpcEnv环境启动完毕,紧接着 Master 要在RpcEnv 中注册:

 val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME,
      new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))

Master 注册了自己 EndPoint,可以接受处理消息,Master启动成功。

Worker 同理,Spark 集群启动成功。

Spark 提交任务向 Master 申请启动 Driver

用户提交程序

用户编写Spark应用程序后,使用spark-submit 脚本去提交,执行命令为:

./bin/spark-submit  --class <main-class>  --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options
<application-jar> [application-arguments]

实例如下:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://192.168.1.20:7077 --deploy-mode cluster --supervise --executor-memory 2G --total-executor-cores 5 /path/to/examples.jar  1000

详情参考官网:http://spark.apache.org/docs/latest/submitting-applications.html

spark-submit 脚本

位置:spark-2.4.5\bin\spark-submit

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

/bin/spark-class脚本

位置:spark-2.4.5\bin\spark-class

-z 判断SPARK_HOME变量的长度是否为0,等于0为真
if [ -z "${SPARK_HOME}" ]; then
加载当前目录的find...的变量
  source "$(dirname "$0")"/find-spark-home
fi

加载环境变量
. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
-n 判断变量长度是否不为0,不为0为真
if [ -n "${JAVA_HOME}" ]; then
JAVAHOME存在就赋值RUNNER为这个
  RUNNER="${JAVA_HOME}/bin/java"
else
监测java命令是否存在
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
不存在退出
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.
判断${SPARK_HOME}/jars目录是否存在,存在为真
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

判断下边俩,不都存在就报错退出
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
存在就变量赋值
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# For tests
这变量长度大于1 unset目录权限
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

build_command() {
执行命令获取
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
输出返回值
  printf "%d\0" $?
}

创建数组
CMD=()
把build_commands输出结果,循环加到数组CMD中
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

数组长度
COUNT=${#CMD[@]}
数组长度-1
LAST=$((COUNT - 1))
数组的最后一个值,也就是上边$?的值
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

如果返回值不是数字,退出
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

如果返回值不为0,退出,返回返回值
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD还是原来那些参数,$@
CMD=("${CMD[@]:0:$LAST}")
执行这些
exec "${CMD[@]}"

执行步骤:

  1. 校验 $SPARK_HOME/conf,spark相关依赖目录SPARK_HOME/jars
  2. 将校验所得所有目录地址拼接为LAUNCH_CLASSPATH变量
  3. 将$JAVA_HOME/bin/java 定义为RUNNER变量
  4. 调用build_command()方法,创建执行命令
  5. 把build_command()方法创建的命令,循环加到数组CMD中,最后执行exec执行CMD命令

最终,执行的CMD命令如下 :

/opt/jdk1.8/bin/java -Dhdp.version=2.6.0.3-8 -cp /usr/hdp/current/spark2-historyserver/conf/:/usr/hdp/2.6.0.3-8/spark2/jars/*:/usr/hdp/current/hadoop-client/conf org.apache.spark.deploy.SparkSubmit --master spark://192.168.1.20:7077 \
--deploy-mode cluster --class org.apache.spark.examples.SparkPi --executor-memory 2G --total-executor-cores 5 ../examples/jars/sparkexamples_2.11-2.1.0.2.6.0.3-8.jar 1000

最终指定了程序的入口org.apache.spark.deploy.SparkSubmit

SparkSubmit伴生对象的main方法实例一个伴生类对象submit,执行submit.doSubmit(args)

def doSubmit(args: Array[String]): Unit

  • 解析参数,实例化new SparkSubmitArguments(args),构造函数中调用loadEnvironmentArguments(),该方法的目的是加载环境变量,若参数中无action变量则赋值为submit, action = Option(action).getOrElse(SUBMIT)

  • action变量模式匹配

    appArgs.action match {
          case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
          case SparkSubmitAction.KILL => kill(appArgs)
          case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
          case SparkSubmitAction.PRINT_VERSION => printVersion()
    
  • submit中调用doRunMain,而doRunMain调用runMain

  • runMain中val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)用来初始化变量

    (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
    
    • 根据参数中master和delpoy-mode,设置对应的clusterManager和部署模式
    • 再根据args中的其他参数,设置相关childArgs, childClasspath, sysProps, childMainClass,并返回结果

    重点注意 childMainClass 类,是最后启动 Driver 的类

假设部署模式为standalone-cluster ,

// In legacy standalone cluster mode, use Client as a wrapper around the user class
        childMainClass = STANDALONE_CLUSTER_SUBMIT_CLASS
private[deploy] val STANDALONE_CLUSTER_SUBMIT_CLASS = classOf[ClientApp].getName()
// 加载类
      mainClass = Utils.classForName(childMainClass)
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
      mainClass.newInstance().asInstanceOf[SparkApplication]

这里首先将 childMainClass 加载出来,赋给变量 mainClass,之后会将 mainClass 映射成 SparkApplication

 app.start(childArgs.toArray, sparkConf)

接下来调用start方法,这里调用的是 ClientApp 的 start 方法,因为 childMainClass 是 clientApp 的类型:

private[spark] class ClientApp extends SparkApplication {

  override def start(args: Array[String], conf: SparkConf): Unit = {
    val driverArgs = new ClientArguments(args)

    if (!conf.contains("spark.rpc.askTimeout")) {
      conf.set("spark.rpc.askTimeout", "10s")
    }
    Logger.getRootLogger.setLevel(driverArgs.logLevel)

    val rpcEnv =
      RpcEnv.create("driverClient", Utils.localHostName(), 0, conf, new SecurityManager(conf))

    val masterEndpoints = driverArgs.masters.map(RpcAddress.fromSparkURL).
      map(rpcEnv.setupEndpointRef(_, Master.ENDPOINT_NAME))
    rpcEnv.setupEndpoint("client", new ClientEndpoint(rpcEnv, driverArgs, masterEndpoints, conf))

    rpcEnv.awaitTermination()
  }

在 rpc 中设置提交当前任务的 Endpoint,只要设置肯定会运行 new ClientEndpoint 类的 onStart 方法:

val command = new Command(mainClass,
          Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ driverArgs.driverOptions,
          sys.env, classPathEntries, libraryPathEntries, javaOpts)

        val driverDescription = new DriverDescription(
          driverArgs.jarUrl,
          driverArgs.memory,
          driverArgs.cores,
          driverArgs.supervise,
          command)
        asyncSendToMasterAndForwardReply[SubmitDriverResponse](
          RequestSubmitDriver(driverDescription))

在此方法中发送RequestSubmitDrive消息,对应于橙色2

为什么一定会调用onStart ?

 /** Message loop used for dispatching messages. */
  private class MessageLoop extends Runnable {
    override def run(): Unit = {
      try {
        while (true) {
          try {
            /**
              * 通过消息队列receivers取出一条消息,该消息的类型为EndpointData,内部都封装了inbox实例,
              * 因此直接调用该实例的process函数去处理消息,
              * 在集群启动的过程中,master向消息容器中放入一条onStart消息,因此取出的第一条消息为onStart
              */
            val data = receivers.take()
            if (data == PoisonPill) {
              // Put PoisonPill back so that other MessageLoops can see it.
              receivers.offer(PoisonPill)
              return
            }
            data.inbox.process(Dispatcher.this)
          } catch {
            case NonFatal(e) => logError(e.getMessage, e)
          }
        }
      } catch {
        case _: InterruptedException => // exit
        case t: Throwable =>
          try {
            // Re-submit a MessageLoop so that Dispatcher will still work if
            // UncaughtExceptionHandler decides to not kill JVM.
            threadpool.execute(new MessageLoop)
          } finally {
            throw t
          }
      }
    }
// 调用Endpoint 的onStart函数
          case OnStart =>
            endpoint.onStart()
            if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
              inbox.synchronized {
                if (!stopped) {
                  enableConcurrent = true
                }
              }
            }

这里将 org.apache.spark.deploy.worker.DriverWrapper 封装成 command,并且将 command 封装到 driverDescription 中,然后向 Master 申请启动 Driver :

在master端的MessageLoop的run方法中取出消息,并调用data.inbox.process(Dispatcher.this),跟进process方法:

/**
   * Process stored messages.
   */
  def process(dispatcher: Dispatcher): Unit = {
    var message: InboxMessage = null
    inbox.synchronized {
      if (!enableConcurrent && numActiveThreads != 0) {
        return
      }
      message = messages.poll()
      if (message != null) {
        numActiveThreads += 1
      } else {
        return
      }
    }
    while (true) {
      safelyCall(endpoint) {
        message match {
          case RpcMessage(_sender, content, context) =>
            try {
              endpoint.receiveAndReply(context).applyOrElse[Any, Unit](content, { msg =>
                throw new SparkException(s"Unsupported message $message from ${_sender}")
              })

Master 中的 receiveAndReply 方法会接收此请求消息,跟进 receiveAndReply:

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
    case RequestSubmitDriver(description) =>
      if (state != RecoveryState.ALIVE) {
        val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
          "Can only accept driver submissions in ALIVE state."
        context.reply(SubmitDriverResponse(self, false, None, msg))
      } else {
        logInfo("Driver submitted " + description.command.mainClass)
        val driver = createDriver(description)
        persistenceEngine.addDriver(driver)
        waitingDrivers += driver
        drivers.add(driver)
        schedule()

        // TODO: It might be good to instead have the submission client poll the master to determine
        //       the current status of the driver. For now it's simply "fire and forget".

        context.reply(SubmitDriverResponse(self, true, Some(driver.id),
          s"Driver successfully submitted as ${driver.id}"))
      }

这里首先会判断 Master 的状态,如果符合要求的话,会使用之前封装的 description 对象创建 driver,driver 其实是一个 DriverInfo 的类型,里面封装了一些 Driver 的信息 ; 然后返回SubmitDriverResponse消息,对应橙色3

  • 回到Client端,在inbox中可以看到:
case OneWayMessage(_sender, content) =>
            endpoint.receive.applyOrElse[Any, Unit](content, { msg =>
              throw new SparkException(s"Unsupported message $message from ${_sender}")
            })

跟进Client端的receive方法:

override def receive: PartialFunction[Any, Unit] = {

    case SubmitDriverResponse(master, success, driverId, message) =>
      logInfo(message)
      if (success) {
        activeMasterEndpoint = master
        pollAndReportStatus(driverId.get)
      } else if (!Utils.responseFromBackup(message)) {
        System.exit(-1)
      }

若消息的类型为SubmitDriverResponse,则调用 pollAndReportStatus(driverId.get)

/* Find out driver status then exit the JVM */
  def pollAndReportStatus(driverId: String): Unit = {
    // Since ClientEndpoint is the only RpcEndpoint in the process, blocking the event loop thread
    // is fine.
    logInfo("... waiting before polling master for driver state")
    Thread.sleep(5000)
    logInfo("... polling master for driver state")
    // 发送RequestDriverStatus, 对应橙色4
    val statusResponse =
      activeMasterEndpoint.askSync[DriverStatusResponse](RequestDriverStatus(driverId))
    if (statusResponse.found) {
      logInfo(s"State of $driverId is ${statusResponse.state.get}")
      // Worker node, if present
      (statusResponse.workerId, statusResponse.workerHostPort, statusResponse.state) match {
        case (Some(id), Some(hostPort), Some(DriverState.RUNNING)) =>
          logInfo(s"Driver running on $hostPort ($id)")
        case _ =>
      }
      // Exception, if present
      statusResponse.exception match {
        case Some(e) =>
          logError(s"Exception from cluster was: $e")
          e.printStackTrace()
          System.exit(-1)
        case _ =>
          System.exit(0)
      }
    } else {
      logError(s"ERROR: Cluster master did not recognize $driverId")
      System.exit(-1)
    }
  }

发送RequestDriverStatus, 对应橙色4;退出jvm进程,对应橙色5

  • 在上述master接收到RequestSubmitDriver消息后还看到在waitingDrivers (private val waitingDrivers = new ArrayBuffer[DriverInfo]) 中添加刚才创建完的 DriverInfo对象,进入 schedule() 调度方法:
/**
   * Schedule the currently available resources among waiting apps. This method will be called
   * every time a new app joins or resource availability changes.
   */
  private def schedule(): Unit = {
    // 判断master的状态
    if (state != RecoveryState.ALIVE) {
      return
    }
    // Drivers take strict precedence over executors
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    // 可用的worker数量
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0
    // 取出waitingDrivers中的driver信息
    for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
      // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
      // start from the last worker that was assigned a driver, and continue onwards until we have
      // explored all alive workers.
      var launched = false
      var numWorkersVisited = 0
      while (numWorkersVisited < numWorkersAlive && !launched) {
        val worker = shuffledAliveWorkers(curPos)
        numWorkersVisited += 1

        /**
          * worker中的cpu和内存资源与driver中描述的进行比较,若资源足够则调用launchDriver
          * 
          */

        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          launchDriver(worker, driver)
          waitingDrivers -= driver
          launched = true
        }
        curPos = (curPos + 1) % numWorkersAlive
      }
    }

    /**
      * 在worker上启动Executor进程
      */
    startExecutorsOnWorkers()
  }

launchDriver对应紫色1

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
    logInfo("Launching driver " + driver.id + " on worker " + worker.id)
    worker.addDriver(driver)
    driver.worker = Some(worker)
    // 给worker上的endpoint发送LaunchDriver消息
    worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
    driver.state = DriverState.RUNNING
  }

spark-2.4.5\core\src\main\scala\org\apache\spark\deploy\worker\Worker.scala中的receive 方法匹配 LaunchDriver,实例化DriverRunner实例,对应紫色2

 case LaunchDriver(driverId, driverDesc) =>
      logInfo(s"Asked to launch driver $driverId")
      val driver = new DriverRunner(
        conf,
        driverId,
        workDir,
        sparkHome,
        driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
        self,
        workerUri,
        securityMgr)
      drivers(driverId) = driver
      // 启动Driver,会初始化 org.apache.spark.deploy.worker.DriverWrapper ,运行main方法
      driver.start()

      coresUsed += driverDesc.cores
      memoryUsed += driverDesc.mem

这里说的启动的 Driver 就是刚才说的 val mainClass = “org.apache.spark.deploy.worker.DriverWrapper”,Driver 启动的就是 DriverWrapper 类的启动,DriverWrapper 的启动就是在 Worker 中创建一个 Driver 进程。启动Driver,会初始化org.apache.spark.deploy.worker.DriverWrapper,运行 main 方法,对应紫色3此方法中的mainClass,就是我们真正提交的 Application

// Delegate to supplied main class
        val clazz = Utils.classForName(mainClass)
        // 得到提交的application的主方法
        val mainMethod = clazz.getMethod("main", classOf[Array[String]])

        /**
          * 启动用户提交的application中的main方法
          * 由于application中需要实例化SparkContext对象
          */
        mainMethod.invoke(null, extraArgs.toArray[String])

Spark Driver 启动,并向 Master 注册 Applciation

当在 Worker 启动完 Driver 之后 ,实例化SparkContext对象,在源码SparkContext 类中,可以找到:

  • 根据sparkConf创建SparkEnv,对应红色1
// Create the Spark execution environment (cache, map output tracker, etc)
    _env = createSparkEnv(_conf, isLocal, listenerBus)
  • 创建TaskScheduler和StandaloneSchedulerBackend ,对应红色2红色3
/**
      * sched, ts分别对应StandaloneSchedulerBackend 和 TaskSchedulerImpl 两个对象
      */
    // Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    // TaskSchedulerImpl 对象的start方法
    _taskScheduler.start()

跟进createTaskScheduler:

// standalone 提交任务都是以 “spark://”这种方式提交
      case SPARK_REGEX(sparkUrl) =>
        //scheduler 创建TaskSchedulerImpl 对象
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        //调用TaskSchedulerImpl对象中的initialize方,使用backend将其初始化
        scheduler.initialize(backend)
        (backend, scheduler)

因为是以Standalone的方式提交任务,所以找到匹配的 case。在这里创建了 TaskSchedulerImpl 的对象scheduler,并将其传入了 StandaloneSchedulerBackend(scheduler, sc, masterUrls)中,得知,返回的 backend 是一个 Standalone 大环境的任务调度器,scheduler 则是 TaskScheduler 的调度器。

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

在 scheduler.initialize(backend) 中的作用就是将 backend 设置为 scheduler 的属性,且StandaloneSchedulerBackend 继承了CoarseGrainedSchedulerBackend 类。

  • 回到TaskSchedulerImpl 对象的start方法
override def start() {
    backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleWithFixedDelay(new Runnable {
        override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
          checkSpeculatableTasks()
        }
      }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }

start方法就是调用backend.start() 方法 ,因为 backend 是 StandaloneSchedulerBackend 类型,所有调用的是 StandaloneSchedulerBackend 中的 start() 方法 :

override def start() {
    super.start()

    // SPARK-21159. The scheduler backend should only try to connect to the launcher when in client
    // mode. In cluster mode, the code that submits the application to the Master needs to connect
    // to the launcher instead.
    if (sc.deployMode == "client") {
      launcherBackend.connect()
    }

进入 super.start(),在方法中向 RpcEnv 注册了 Driver 端的 Endpoint(先创建,对应红色4)

override def start() {
    val properties = new ArrayBuffer[(String, String)]
    for ((key, value) <- scheduler.sc.conf.getAll) {
      if (key.startsWith("spark.")) {
        properties += ((key, value))
      }
    }

    // TODO (prashant) send conf instead of properties
    /**
      * 创建Driver的Endpoint
      */
    driverEndpoint = createDriverEndpointRef(properties)
  }

  // 向rpcEnv中注册当前DriverEndpoint
  protected def createDriverEndpointRef(
      properties: ArrayBuffer[(String, String)]): RpcEndpointRef = {
    rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))
  }

  protected def createDriverEndpoint(properties: Seq[(String, String)]): DriverEndpoint = {
    new DriverEndpoint(rpcEnv, properties)
  }

backend.start() 方法执行创建完 DriverEndpoint 之后,还执行了如下代码,功能是向 Driver 注册application 的信息 :

val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)

创建 StandaloneAppClient对象,对应红色5,进入client.start() 后就是在rpcEnv 上注册信息 :

def start() {
    // Just launch an rpcEndpoint; it will call back into the listener.
    /**
      * rpcEnv.setupEndpoint中创建了ClientEndpoint
      * 只要设置Endpoint肯定会调用 ClientEndpoint的onStart方法
      */
    endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
  }

创建ClientEndpoint对应红色6

跟进ClientEndpoint的onStart() 方法

 override def onStart(): Unit = {
      try {
        // 向master注册application信息
        registerWithMaster(1)
      } catch {
        case e: Exception =>
          logWarning("Failed to connect to master", e)
          markDisconnected()
          stop()
      }
    }

注册信息对应红色7

向 Master 注册当前application的信息的 registerWithMaster() 方法,因为我们的 Master 会有高可用,所以要给所有的 Master 注册,进入后看到 tryRegisterAllMasters():

/**
     * Register with all masters asynchronously. It will call `registerWithMaster` every
     * REGISTRATION_TIMEOUT_SECONDS seconds until exceeding REGISTRATION_RETRIES times.
     * Once we connect to a master successfully, all scheduling work and Futures will be cancelled.
     *
     * nthRetry means this is the nth attempt to register with master.
     */
    private def registerWithMaster(nthRetry: Int) {
      // tryRegisterAllMasters 尝试向所有的Master 注册application信息
      registerMasterFutures.set(tryRegisterAllMasters())
      registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable {
        override def run(): Unit = {
          if (registered.get) {
            registerMasterFutures.get.foreach(_.cancel(true))
            registerMasterThreadPool.shutdownNow()
          } else if (nthRetry >= REGISTRATION_RETRIES) {
            markDead("All masters are unresponsive! Giving up.")
          } else {
            registerMasterFutures.get.foreach(_.cancel(true))
            registerWithMaster(nthRetry + 1)
          }
        }
      }, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
    }

进入 tryRegisterAllMasters() 之后,会获取 Master 的 Endpoint,向 Master 注册 application,Master 类中receive 方法中会匹配接收 RegisterApplication 类型,随后在 Master 中匹配 RegisterApplication 的 case:


    /**
     *  Register with all masters asynchronously and returns an array `Future`s for cancellation.
     */
    private def tryRegisterAllMasters(): Array[JFuture[_]] = {
      for (masterAddress <- masterRpcAddresses) yield {
        registerMasterThreadPool.submit(new Runnable {
          override def run(): Unit = try {
            if (registered.get) {
              return
            }
            logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
            // 获取到master上Endpoint的引用
            val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
            // 向master发送RegisterApplication信息,Master类中receive方法中会匹配接收RegisterApplication类型
            masterRef.send(RegisterApplication(appDescription, self))
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        })
      }
    }

Master 匹配 RegisterApplication 类型消息的处理流程:

case RegisterApplication(description, driver) =>
      // TODO Prevent repeated registrations from some driver
      // 如果master的状态是standby 忽略
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
      } else {
        logInfo("Registering app " + description.name)
        // 封装application信息
        val app = createApplication(description, driver)
        // 注册app
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        persistenceEngine.addApplication(app)
        // 向Driver发送RegisteredApplication消息
        driver.send(RegisteredApplication(app.id, self))
        // 又执行通用方法schedule()
        schedule()
      }

此处对应红色8

至此 Driver 向 Master 注册 Application 流程结束。

Master 发送消息启动 Executor

紧接上述的 schedule()方法,之前是从 launchDriver(worker, driver) 进去的,现在又出来继续调用 startExecutorsOnWorkers() 方法:

/**
   * Schedule and launch executors on workers
   */
  private def startExecutorsOnWorkers(): Unit = {
    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    // 从waitingApps中获取提交的app
    for (app <- waitingApps) {
      // 在application中获取启动一个Executor使用的core数量
      val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
      // If the cores left is less than the coresPerExecutor,the cores left will not be allocated
      if (app.coresLeft >= coresPerExecutor) {
        // Filter out workers that don't have enough resources to launch an executor
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
            worker.coresFree >= coresPerExecutor)
          .sortBy(_.coresFree).reverse

        /**
          * 去worker中划分每个worker提供多少core和启动多少Executor,
          * 返回的assignedCores是每个worker节点中应该给当前的application分配多少core
          */
        val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

        // Now that we've decided how many cores to allocate on each worker, let's allocate them
        for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
          // 在worker中给Executor划分资源
          allocateWorkerResourceToExecutors(
            app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
        }
      }
    }
  }

scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) 方法,最后返回最后返回每个Worker上分配多少core ;然后运行

allocateWorkerResourceToExecutors,传入当前 application,每个 worker 分配的核数,每个核上要启动的 Executor 数量以及可用的 usableWorkers

/**
   * Allocate a worker's resources to one or more executors.
   * @param app the info of the application which the executors belong to
   * @param assignedCores number of cores on this worker for this application
   * @param coresPerExecutor number of cores per executor
   * @param worker the worker info
   */
  private def allocateWorkerResourceToExecutors(
      app: ApplicationInfo,
      assignedCores: Int,
      coresPerExecutor: Option[Int],
      worker: WorkerInfo): Unit = {
    // If the number of cores per executor is specified, we divide the cores assigned
    // to this worker evenly among the executors with no remainder.
    // Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
    // 每个Executor要分配多少个core
    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
    for (i <- 1 to numExecutors) {
      val exec = app.addExecutor(worker, coresToAssign)
      // worker中启动Executor
      launchExecutor(worker, exec)
      app.state = ApplicationState.RUNNING
    }
  }

在 launchExecutor(worker, exec) 方法中,会获取 Worker 的通信邮箱,给 Worker 发送启动 Executor 的消息,具体就是启动Executor需要多少个 core 和 内存,然后在 Worker 中有 receive 方法一直匹配 LaunchExecutor 类型,对应蓝色1

private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)

    /**
      * 获取worker的通信邮箱,向其发送LaunchExecutor消息
      */
    worker.endpoint.send(LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))

    /**
      * 获取driver的通信邮箱,向其发送ExecutorAdded消息
      */
    exec.application.driver.send(
      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

worker接收到消息,进行蓝色2蓝色3的操作:

//  接收到LaunchExecutor消息
    case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
      if (masterUrl != activeMasterUrl) {
        logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
      } else {
        try {
          logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))

          // Create the executor's working directory
          val executorDir = new File(workDir, appId + "/" + execId)
          if (!executorDir.mkdirs()) {
            throw new IOException("Failed to create directory " + executorDir)
          }

          // Create local dirs for the executor. These are passed to the executor via the
          // SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
          // application finishes.
          val appLocalDirs = appDirectories.getOrElse(appId, {
            val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)
            val dirs = localRootDirs.flatMap { dir =>
              try {
                val appDir = Utils.createDirectory(dir, namePrefix = "executor")
                Utils.chmod700(appDir)
                Some(appDir.getAbsolutePath())
              } catch {
                case e: IOException =>
                  logWarning(s"${e.getMessage}. Ignoring this directory.")
                  None
              }
            }.toSeq
            if (dirs.isEmpty) {
              throw new IOException("No subfolder can be created in " +
                s"${localRootDirs.mkString(",")}.")
            }
            dirs
          })
          appDirectories(appId) = appLocalDirs
          // 创建ExecutorRunner对象

          val manager = new ExecutorRunner(
            appId,
            execId,

            /**
              * appDesc中有 Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",.......)
              * 第一个参数就是Executor类
              */

            appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
            cores_,
            memory_,
            self,
            workerId,
            host,
            webUi.boundPort,
            publicAddress,
            sparkHome,
            executorDir,
            workerUri,
            conf,
            appLocalDirs, ExecutorState.RUNNING)
          executors(appId + "/" + execId) = manager
          // 启动ExecutorRunner对象
          manager.start()
          coresUsed += cores_
          memoryUsed += memory_
          // 向master发送ExecutorStateChanged消息
          sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))

通过 manager 启动Executor,启动的就是 CoarseGrainedExecutorBackend 类,下面看 CoarseGrainedExecutorBackend 类中的 main 方法有反向注册给Driver:

val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
      
      //注册Executor的通信邮箱,会调用CoarseGrainedExecutorBackend的onstart方法
      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
        env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
      workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
      }
      env.rpcEnv.awaitTermination()

会调用 CoarseGrainedExecutorBackend 类的 onStart() 方法:

override def onStart() {
    logInfo("Connecting to driver: " + driverUrl)
    //从RPC中拿到Driver的引用,给Driver反向注册Executor
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      //拿到Driver的引用
      driver = Some(ref)
      /**
        * 给Driver反向注册Executor信息,这里就是注册给之前看到的 CoarseGrainedSchedulerBackend 类中的DriverEndpoint
        * DriverEndpoint类中会有receiveAndReply 方法来匹配RegisterExecutor
        */
      ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
    }(ThreadUtils.sameThread).onComplete {
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      case Success(msg) =>
        // Always receive `true`. Just ignore it
      case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
  }

在该方法中从 RPC 中拿到 Driver 的引用,将 Executor 反向注册给 Driver,方法中的 ref 就是 CoarseGrainedSchedulerBackend 类的引用,之后在 CoarseGrainedSchedulerBackend 中找到匹配 RegisterExecutor 的 case,用于反向注册:

/**
            * 拿到Execuotr的通信邮箱,发送消息给ExecutorRef 告诉 Executor已经被注册。
            * 在 CoarseGrainedExecutorBackend 类中 receive方法一直监听有没有被注册,匹配上就会启动Executor
            */
          executorRef.send(RegisteredExecutor)

在 Driver 端告诉 Execuotr 端已经被注册,匹配上就会启动 Executor。去看 Executor 端匹配 RegisteredExecutor 的 case,用于启动 Executor:

override def receive: PartialFunction[Any, Unit] = {
    //匹配上Driver端发过来的消息,已经接受注册Executor了,启动Executor
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

Executor 在被告知反向注册成功之后,开始真正的创建 Executor,Executor 中有线程池用于task运行:

private val threadPool = {
    val threadFactory = new ThreadFactoryBuilder()
      .setDaemon(true)
      .setNameFormat("Executor task launch worker-%d")
      .setThreadFactory(new ThreadFactory {
        override def newThread(r: Runnable): Thread =
          // Use UninterruptibleThread to run tasks so that we can allow running codes without being
          // interrupted by `Thread.interrupt()`. Some issues, such as KAFKA-1894, HADOOP-10622,
          // will hang forever if some methods are interrupted.
          new UninterruptibleThread(r, "unused") // thread name will be set by ThreadFactoryBuilder
      })
      .build()
    Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
  }

至此,Executor 创建完毕。

master接收到ExecutorStateChanged,进行蓝色4的处理:

//  master接收到ExecutorStateChanged消息
    case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
      val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
      execOption match {
        case Some(exec) =>
          val appInfo = idToApp(appId)
          val oldState = exec.state
          exec.state = state

          if (state == ExecutorState.RUNNING) {
            assert(oldState == ExecutorState.LAUNCHING,
              s"executor $execId state transfer from $oldState to RUNNING is illegal")
            appInfo.resetRetryCount()
          }
          //向driver发送ExecutorUpdated消息
          exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))

spark 任务调度

Executor 创建完毕 ,开始执行调度 ,用户提交的程序需要有一个action算子来触发计算,不妨以count()为例:

位置:spark-2.4.5\core\src\main\scala\org\apache\spark\rdd\RDD.scala

/**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

sc 是 SparkContext 的对象,进入 runJob 方法:

/**
   * Run a job on all partitions in an RDD and return the results in an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

再次进入 runJob 方法:

 /**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

再次进入 runJob 方法:

/**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   * The function that is run against each partition additionally takes `TaskContext` argument.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

再进此进入 runJob 方法,发现了 rdd.doCheckpoint() 方法,该方法的功能就是对整个 RDD 进行回溯;在 runJob 中传了一个参数 rdd,跟进该参数,在 dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) 中调用:

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

进入方法后继续跟进 rdd 这个参数,在 submitJob 中调用,跟进 submitJob 方法中的 rdd 参数在何时使用:

/**
   * Run an action job on the given RDD and pass all the results to the resultHandler function as
   * they arrive.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @note Throws `Exception` when the job fails
   */
  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
    waiter.completionFuture.value.get match {
      case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case scala.util.Failure(exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }
/**
   * Submit an action job to the scheduler.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @return a JobWaiter object that can be used to block until the job finishes executing
   *         or can be used to cancel the job.
   *
   * @throws IllegalArgumentException when partitions ids are illegal
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    //提交任务,eventProcessLoop是DAGSchedulerEventProcessLoop对象
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

rdd 传进了 eventProcessLoop.post 方法,进入 post 方法,该方法内部调用了 eventQueue.put(event),目的是将提交的任务放入队列:

/**
   * Put the event into the event queue. The event thread will process it later.
   */
  def post(event: E): Unit = {
    //将提交的任务放入队列
    eventQueue.put(event)
  }

观察 eventQueue 类型:

private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()

eventQueue 存在于 EventLoop 类中,该类中有一个后台线程,从eventQueue中取出数据进行计算,运算逻辑在onReceive(event)

private[spark] val eventThread = new Thread(name) {
    setDaemon(true)

    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) =>
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
          }
        }

进入该方法无法看到代码逻辑,去找 eventProcessLoop 的子类, DAGSchedulerEventProcessLoop :

private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
  extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging {

  private[this] val timer = dagScheduler.metricsSource.messageProcessingTimer

  /**
   * The main event loop of the DAG scheduler.
   */
  override def onReceive(event: DAGSchedulerEvent): Unit = {
    val timerContext = timer.time()
    try {
      doOnReceive(event)
    } finally {
      timerContext.stop()
    }
  }

在 doOnReceive(event) 中运行 handleJobSubmitted 方法:

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

在 handleJobSubmitted 方法中通过 submitStage(finalStage) 根据宽窄依赖递归寻找 Stage 并提交:

private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: ResultStage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: BarrierJobSlotsNumberCheckFailed =>
        logWarning(s"The job $jobId requires to run a barrier stage that requires more slots " +
          "than the total number of slots in the cluster currently.")
        // If jobId doesn't exist in the map, Scala coverts its value null to 0: Int automatically.
        val numCheckFailures = barrierJobIdToNumTasksCheckFailures.compute(jobId,
          new BiFunction[Int, Int, Int] {
            override def apply(key: Int, value: Int): Int = value + 1
          })
        if (numCheckFailures <= maxFailureNumTasksCheck) {
          messageScheduler.schedule(
            new Runnable {
              override def run(): Unit = eventProcessLoop.post(JobSubmitted(jobId, finalRDD, func,
                partitions, callSite, listener, properties))
            },
            timeIntervalNumTasksCheck,
            TimeUnit.SECONDS
          )
          return
        } else {
          // Job failed, clear internal data.
          barrierJobIdToNumTasksCheckFailures.remove(jobId)
          listener.jobFailed(e)
          return
        }

      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    // Job submitted, clear internal data.
    barrierJobIdToNumTasksCheckFailures.remove(jobId)

    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
    // 递归寻找Stage
    submitStage(finalStage)
  }
/** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug(s"submitStage($stage (name=${stage.name};" +
        s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          // 将stage划分为taskSet,并提交
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            // 递归实现
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

划分完 Stage 之后,submitMissingTasks(stage, jobId.get) 方法会将 stage 划分成 task 发送到 Exeuctor 中执行,并以以 TaskSet 形式提交任务:

 //以taskSet形式提交任务
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))

提交task, TaskSchedulerImpl的submitTasks执行 backend.reviveOffers() 调用的是 CoarseGrainedSchedulerBackend 对象中的方法:

backend.reviveOffers()
override def reviveOffers() {
    driverEndpoint.send(ReviveOffers)
  }

对应粉色2

注:粉色1的流程如下

  • SparkContex的初始化过程中会调用TaskSchedulerImpl 对象的start方法,后者调用backend.start(),实质是调用CoarseGrainedSchedulerBackend.start(),

    /**
          * 创建Driver的Endpoint
          */
        driverEndpoint = createDriverEndpointRef(properties)
    

CoarseGrainedSchedulerBackend 中内部类DriverEndpoint.receive方法来接收数据 :

case ReviveOffers =>
        makeOffers()

// Make fake resource offers on all executors
    private def makeOffers() {
      // Make sure no executor is killed while some task is launching on it
      val taskDescs = withLock {
        // Filter out executors under killing
        val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
        val workOffers = activeExecutors.map {
          case (id, executorData) =>
            new WorkerOffer(id, executorData.executorHost, executorData.freeCores,
              Some(executorData.executorAddress.hostPort))
        }.toIndexedSeq
        // TaskSchedulerImpl.resourceOffers方法
        scheduler.resourceOffers(workOffers)
      }
      if (!taskDescs.isEmpty) {
        // 启动任务
        launchTasks(taskDescs)
      }
    }

对应粉色3

跟进launchTasks方法,对应粉色4

// Launch tasks returned by a set of resource offers
    private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
      for (task <- tasks.flatten) {
        val serializedTask = TaskDescription.encode(task)
        if (serializedTask.limit() >= maxRpcMessageSize) {
          Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
            try {
              var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                "spark.rpc.message.maxSize or using broadcast variables for large values."
              msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
              taskSetMgr.abort(msg)
            } catch {
              case e: Exception => logError("Exception in error callback", e)
            }
          }
        }
        else {
          val executorData = executorDataMap(task.executorId)
          executorData.freeCores -= scheduler.CPUS_PER_TASK

          logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
            s"${executorData.executorHost}.")
          // 向Executor发送LaunchTask消息
          executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
        }
      }

CoarseGrainedSchedulerBackend.receive接收到消息:

case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskDesc)
      }

Executor的线程池执行任务,对应粉色6

def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)
  }

至此,任务调度完毕。

任务执行与返回

Executor中线程池的某个线程调用run()方法执行task,完成后状态更新,对应绿色1

setTaskFinishedAndClearInterruptStatus()
        execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)

跟进CoarseGrainedSchedulerBackend.statusUpdate方法,

override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
    val msg = StatusUpdate(executorId, taskId, state, data)
    driver match {
      //  向DriverEndPoint发送StatusUpdate消息
      case Some(driverRef) => driverRef.send(msg)
      case None => logWarning(s"Drop $msg because has not yet connected to driver")
    }
  }

DriverEndPoint接收到StatusUpdate消息,执行绿色3绿色4

override def receive: PartialFunction[Any, Unit] = {
      case StatusUpdate(executorId, taskId, state, data) =>
        // 执行TaskSchedulerImpl.statusUpdate
        scheduler.statusUpdate(taskId, state, data.value)
        if (TaskState.isFinished(state)) {
          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              executorInfo.freeCores += scheduler.CPUS_PER_TASK
              //回到makeOffers中,再次执行scheduler.resourceOffers()
              makeOffers(executorId)
            case None =>
              // Ignoring the update since we don't know about the executor.
              logWarning(s"Ignored task status update ($taskId state $state) " +
                s"from unknown executor with ID $executorId")
          }
        }

总结

在这里插入图片描述

流程详解如下:

  1. 集群启动之后,Worker节点会向Master节点汇报资源情况,Master就掌握了集群资源情况。
  2. 当Spark提交一个Application后,会根据RDD之间的依赖关系将Application形成一个DAG有向无环图。任务提交之后,Spark会在Driver端创建两个对象:DAGScheduler和TaskScheduler
  3. DAGScheduler是任务调度的高层调度器,是一个对象。DAGScheduler的主要作用就是将DAG根据RDD之间的宽窄依赖关系划分为一个个的Stage,然后将这些Stage以TaskSet的形式提交给TaskScheduler(TaskScheduler是任务调度的低层调度器,这里TaskSet其实就是一个集合,里面封装的就是一个个的task任务,也就是stage中的并行度task任务)
  4. TaskSchedule会遍历TaskSet集合,拿到每个task后会将task发送到计算节点Executor中去执行(其实就是发送到Executor中的线程池ThreadPool去执行)。
  5. task在Executor线程池中的运行情况会向TaskScheduler反馈,当task执行失败时,则由TaskScheduler负责重试,将task重新发送给Executor去执行,默认重试3次。如果重试3次依然失败,那么这个task所在的stage就失败了。stage失败了则由DAGScheduler来负责重试,重新发送TaskSet到TaskSchdeuler**,Stage默认重试4次。**如果重试4次以后依然失败,那么这个job就失败了。job失败了,Application就失败了。因此一个task默认情况下重试3*4=12次。
  6. TaskScheduler不仅能重试失败的task,还会重试straggling(落后,缓慢)task(也就是执行速度比其他task慢太多的task)。如果有运行缓慢的task那么TaskScheduler会启动一个新的task来与这个运行缓慢的task执行相同的处理逻辑。两个task哪个先执行完,就以哪个task的执行结果为准。这就是Spark的推测执行机制。在Spark中推测执行默认是关闭的。推测执行可以通过spark.speculation属性来配置。

参考链接:

1.Spark 资源调度和任务调度源码解析
2. 深入理解Spark内核
3. Spark源码阅读: Spark Submit任务提交
4. standalone运行模式
5. Standalone模式运行机制

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值