八、worker原理剖析

最新推荐文章于 2022-08-31 10:25:43 发布

tianlan996

最新推荐文章于 2022-08-31 10:25:43 发布

阅读量422

点赞数

分类专栏： spark原理与源码(基于spark2.4.0)

本文链接：https://blog.csdn.net/tianlan996/article/details/86578546

版权

spark原理与源码(基于spark2.4.0) 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

一 worker启动driver

1 master向worker发送启动driver消息 worker.endpoint.send(LaunchDriver(driver.id, driver.desc))，见第七节。

2 worker收到消息后，进入LaunchDriver流程。

3 创建DriverRunner实例并调用其start方法。

4 DriverRunner start方法中，创建一个线程，线程内部做如下操作：

4.1 创建driver工作目录。

4.2 将jar包下载到driver工作目录中。

4.3 封装启动driver的命令，用ProcessBuilder启动driver。

4.4 向当前worker发送driver状态改变消息。

5 更改woker的cpu和内存使用情况。

    // 2 worker收到消息后，调用LaunchDriver方法
    case LaunchDriver(driverId, driverDesc) =>
      logInfo(s"Asked to launch driver $driverId")
      // 3 创建DriverRunner实例并调用其start方法
      val driver = new DriverRunner(
        conf,
        driverId,
        workDir,
        sparkHome,
        driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
        self,
        workerUri,
        securityMgr)
      drivers(driverId) = driver
      driver.start()

      // 5 更改woker的cpu和内存使用情况
      coresUsed += driverDesc.cores
      memoryUsed += driverDesc.mem


  /** Starts a thread to run and manage the driver. */
  private[worker] def start() = {
    // 4 DriverRunner start方法中，创建一个线程
    new Thread("DriverRunner for " + driverId) {
      override def run() {
        var shutdownHook: AnyRef = null
        try {
          shutdownHook = ShutdownHookManager.addShutdownHook { () =>
            logInfo(s"Worker shutting down, killing driver $driverId")
            kill()
          }

          // prepare driver jars and run driver
          val exitCode = prepareAndRunDriver()

          // set final state depending on if forcibly killed and process exit code
          finalState = if (exitCode == 0) {
            Some(DriverState.FINISHED)
          } else if (killed) {
            Some(DriverState.KILLED)
          } else {
            Some(DriverState.FAILED)
          }
        } catch {
          case e: Exception =>
            kill()
            finalState = Some(DriverState.ERROR)
            finalException = Some(e)
        } finally {
          if (shutdownHook != null) {
            ShutdownHookManager.removeShutdownHook(shutdownHook)
          }
        }

        // notify worker of final driver state, possible exception
        worker.send(DriverStateChanged(driverId, finalState.get, finalException)) // 4.4 向当前worker发送driver状态改变消息
      }
    }.start()
  }

    private[worker] def prepareAndRunDriver(): Int = {
    val driverDir = createWorkingDirectory() // 4.1 创建driver工作目录
    val localJarFilename = downloadUserJar(driverDir) // 4.2 将jar包下载到driver工作目录中

    def substituteVariables(argument: String): String = argument match {
      case "{{WORKER_URL}}" => workerUrl
      case "{{USER_JAR}}" => localJarFilename
      case other => other
    }

    // TODO: If we add ability to submit multiple jars they should also be added here
    val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
      driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables) // 4.3 封装启动driver的命令

    runDriver(builder, driverDir, driverDesc.supervise) // 4.3 用ProcessBuilder启动driver
  }


    private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = {
    builder.directory(baseDir)
    def initialize(process: Process): Unit = {
      // Redirect stdout and stderr to files
      val stdout = new File(baseDir, "stdout")
      CommandUtils.redirectStream(process.getInputStream, stdout)

      val stderr = new File(baseDir, "stderr")
      val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"")
      val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40)
      Files.append(header, stderr, StandardCharsets.UTF_8)
      CommandUtils.redirectStream(process.getErrorStream, stderr)
    }
    runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise) // 4.3 用ProcessBuilder启动driver
  }

二 worker启动executor

1 master向worker发送启动executor消息 worker.endpoint.send(LaunchExecutor(masterUrl,exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))，见第七节。

2 worker收到消息后，进入LaunchExecutor流程。

3 创建executor的工作目录。

4 创建ExecutorRunner实例并调用其start方法。

5 ExecutorRunner start方法中，创建一个线程，线程内部做如下操作：

5.1 封装启动executor的命令，用ProcessBuilder启动executor。

5.2 向当前worker发送executor状态改变消息。

6 向master发送ExecutorStateChanged消息。

    // 2 worker收到消息后，进入LaunchExecutor流程
    case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
      if (masterUrl != activeMasterUrl) {
        logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
      } else {
        try {
          logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))

          // Create the executor's working directory
          // 3 创建executor的工作目录。
          val executorDir = new File(workDir, appId + "/" + execId)
          if (!executorDir.mkdirs()) {
            throw new IOException("Failed to create directory " + executorDir)
          }

          // Create local dirs for the executor. These are passed to the executor via the
          // SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
          // application finishes.
          val appLocalDirs = appDirectories.getOrElse(appId, {
            val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)
            val dirs = localRootDirs.flatMap { dir =>
              try {
                val appDir = Utils.createDirectory(dir, namePrefix = "executor")
                Utils.chmod700(appDir)
                Some(appDir.getAbsolutePath())
              } catch {
                case e: IOException =>
                  logWarning(s"${e.getMessage}. Ignoring this directory.")
                  None
              }
            }.toSeq
            if (dirs.isEmpty) {
              throw new IOException("No subfolder can be created in " +
                s"${localRootDirs.mkString(",")}.")
            }
            dirs
          })
          appDirectories(appId) = appLocalDirs
          // 4 创建ExecutorRunner实例并调用其start方法
          val manager = new ExecutorRunner(
            appId,
            execId,
            appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
            cores_,
            memory_,
            self,
            workerId,
            host,
            webUi.boundPort,
            publicAddress,
            sparkHome,
            executorDir,
            workerUri,
            conf,
            appLocalDirs, ExecutorState.RUNNING)
          executors(appId + "/" + execId) = manager
          manager.start()
          coresUsed += cores_
          memoryUsed += memory_
          sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None)) // 6 向master发送ExecutorStateChanged消息
        } catch {
          case e: Exception =>
            logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)
            if (executors.contains(appId + "/" + execId)) {
              executors(appId + "/" + execId).kill()
              executors -= appId + "/" + execId
            }
            sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,
              Some(e.toString), None))
        }
      }


  private[worker] def start() {
    // 5 ExecutorRunner start方法中，创建一个线程
    workerThread = new Thread("ExecutorRunner for " + fullId) {
      override def run() { fetchAndRunExecutor() // 5.1 封装启动executor的命令，用ProcessBuilder启动executor }
    }
    workerThread.start()
    // Shutdown hook that kills actors on shutdown.
    shutdownHook = ShutdownHookManager.addShutdownHook { () =>
      // It's possible that we arrive here before calling `fetchAndRunExecutor`, then `state` will
      // be `ExecutorState.RUNNING`. In this case, we should set `state` to `FAILED`.
      if (state == ExecutorState.RUNNING) {
        state = ExecutorState.FAILED
      }
      killProcess(Some("Worker shutting down")) }
  }

    /**
   * Download and run the executor described in our ApplicationDescription
   */
  private def fetchAndRunExecutor() {
    try {
      // Launch the process
      val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),
        memory, sparkHome.getAbsolutePath, substituteVariables)
      val command = builder.command()
      val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")
      logInfo(s"Launch command: $formattedCommand")

      builder.directory(executorDir)
      builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))
      // In case we are running this from within the Spark Shell, avoid creating a "scala"
      // parent process for the executor command
      builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")

      // Add webUI log urls
      val baseUrl =
        if (conf.getBoolean("spark.ui.reverseProxy", false)) {
          s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="
        } else {
          s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="
        }
      builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")
      builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")

      process = builder.start()
      val header = "Spark Executor Command: %s\n%s\n\n".format(
        formattedCommand, "=" * 40)

      // Redirect its stdout and stderr to files
      val stdout = new File(executorDir, "stdout")
      stdoutAppender = FileAppender(process.getInputStream, stdout, conf)

      val stderr = new File(executorDir, "stderr")
      Files.write(header, stderr, StandardCharsets.UTF_8)
      stderrAppender = FileAppender(process.getErrorStream, stderr, conf)

      // Wait for it to exit; executor may exit with code 0 (when driver instructs it to shutdown)
      // or with nonzero exit code
      val exitCode = process.waitFor()
      state = ExecutorState.EXITED
      val message = "Command exited with code " + exitCode
      worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode))) // 5.2 向当前worker发送executor状态改变消息
    } catch {
      case interrupted: InterruptedException =>
        logInfo("Runner thread for executor " + fullId + " interrupted")
        state = ExecutorState.KILLED
        killProcess(None)
      case e: Exception =>
        logError("Error running executor", e)
        state = ExecutorState.FAILED
        killProcess(Some(e.toString))
    }
  }

如下内容参考了：

https://www.cnblogs.com/jcchoiling/p/6436185.html

https://blog.csdn.net/chic_data/article/details/77317730

executor启动的进程叫CoarseGrainedExecutorBackend，其信息在app desc中。

第一步：Master 发指令给 Worker 启动 Executor；
第二步：Worker 接收到 Master 发送过来的指令通过 ExecutorRunner 远程启动另外一个线程来运行 Executor；
第三步：通过发送 RegisterExecutor 向 Driver 注册 Executor，这个时侯Worker 会启动另外一个进程来向 Driver 发送注册的信息，思考题：为什么要多开一个新进程而不在原有的 Worker 进程里发送信息给 Driver 呢？因为Worker 主要是管理当前机器上的资源的，而当前机器上的资源有变动的时候需要汇报给 Master，Worker 不是用来计算的，所以不可以在 Worker 里做计算；而且，在 Spark 中可能有很多不同的的应用程序，有很多应用程序你就需要有很多 Executor，如果你不是为每个 Executor 启动一个进程的话，这会导致当一个程序崩溃时，其他程序也会崩溃。

1.在CoarseGrainedExecutorBackend启动时，向Driver注册Executor其实质是注册ExecutorBackend实例，和Executor实例之间没有直接的关系！！！

  override def onStart() {
    logInfo("Connecting to driver: " + driverUrl)
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      // 获取到driver
      driver = Some(ref)
      // 向driver发送RegisterExecutor消息
      ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
    }(ThreadUtils.sameThread).onComplete {
      // This is a very fast action so we can use "ThreadUtils.sameThread"
      case Success(msg) =>
        // Always receive `true`. Just ignore it
      case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
  }

2.CoarseGrainedExecutorBackend是Executor运行所在的进程名称，Executor才是真正在处理Task的对象，Executor内部是通过线程池的方式来完成Task的计算的。

  // Executor类
  def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription) // 对于每一个task，都会实现一个TaskRunner
    runningTasks.put(taskDescription.taskId, tr) // 放入缓存
    threadPool.execute(tr) // 放入线程池去调度执行
  }

3. CoarseGrainedExecutorBackend和Executor是一一对应的。

4. CoarseGrainedExecutorBackend是一个消息通信体（其实现了ThreadSafeRpcEndpoint）。可以发送信息给Driver，并可以接收Driver中发过来的指令，例如启动Task等。

  override def receive: PartialFunction[Any, Unit] = {
    // 向driver注册消息成功，则driver会返回RegisteredExecutor消息
    // 创建Executor对象，作为执行句柄
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

    ...

    // 启动task
    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskDesc) // 调用Executor中的launchTask，最终在线程池中执行
      }

   ...
  }

5.在Driver进程中，有两个至关重要的Endpoint，

a）第一个就是ClientEndpoint，主要负责向Master注册当前的程序；是AppClient的内部成员。

b）另外一个就是DriverEndpoint，这是整个程序运行时候的驱动器！！是CoarseGrainedExecutorBackend的内部成员。

6.在Driver中通过ExecutorData封装并注册ExecutorBackend的信息到Driver的内存数据结构ExecutorMapData中。ExecutorMapData是CoarseGrainedSchedulerBackend的成员。最终是注册给CoarseGrainedSchedulerBackend。

7.实际在执行的时候，DriverEndpoint会把信息写入CoarseGrainedSchedulerBackend的内存数据结构ExecutorMapData中，所以说最终是注册给CoarseGrainedSchedulerBackend，也就是说CoarseGrainedSchedulerBackend掌握了为当前程序分配的所有的ExucutorBackend进程，而在每一个ExecutorBackend进程实例中会通过Executor对象来负责具体Task的运行。在欲行的时候使用syschronized关键字来保证ExecutorMapData安全的并发写操作。

8.CoarseGrainedExecutorBackend收到DriverEndpoint发送过来的RegisteredExecutor消息后会启动Executor实例对象，而Executor实例对象是事实上负责真正Task计算的。

Executor是如何工作的？

1.当Driver发送过来Task的时候，其实是发送给了CoarseGrainedExecutorBackend这个RpcEndpoint，而不是直接发送给了Executor（Executor由于不是消息循环体，所以永远无法直接接收远程发送过来的信息）。

case LaunchTask(data) =>
  if (executor == null) {
    logError("Received LaunchTask command but executor was null")
    System.exit(1)
  } else {
    val taskDesc = ser.deserialize[TaskDescription](data.value)
    logInfo("Got assigned task " + taskDesc.taskId)
    executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
      taskDesc.name, taskDesc.serializedTask)
  }

2.ExecutorBackend在收到Driver中发送过来的消息后，会通过调用launchTask来交给Executor去执行。

在介绍CoarseGrainedExecutorBackend的启动流程前，先了解下CoarseGrainedExecutorBackend。我们知道Executor负责计算任务，即执行task，而Executor对象的创建及维护是由CoarseGrainedExecutorBackend负责的，CoarseGrainedExecutorBackend在Spark运行期是一个单独的进程，在Worker节点可以通过Java的jps命令查看，如下

CoarseGrainedExecutorBackend是RpcEndpoint的子类，能够和Driver进行RPC通信，其生命周期方法onStart一定要关注，看执行了哪些动作。
CoarseGrainedExecutorBackend维护了两个属性executor和driver，executor负责运行task，driver负责和Driver通信。
ExecutorBackend有抽象方法statusUpdate，负责将Executor的计算结果返回给Driver。
最后，CoarseGrainedExecutorBackend是spark运行期的一个进程，Executor运行在该进程内。

tianlan996

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
八、worker原理剖析

一 worker启动driver1 master向worker发送启动driver消息 worker.endpoint.send(LaunchDriver(driver.id, driver.desc))，见第七节。2 worker收到消息后，进入LaunchDriver流程。3 创建DriverRunner实例并调用其start方法。4 DriverRunner start方法中...
复制链接

扫一扫