pyspark源码之SparkSubmit学习（ SparkSubmit.scala）

最新推荐文章于 2024-04-22 13:41:01 发布

清萝卜头

最新推荐文章于 2024-04-22 13:41:01 发布

阅读量2.1k

点赞数

分类专栏： spark 文章标签： spark sparkSubmit

本文链接：https://blog.csdn.net/xiaoQL520/article/details/79064979

版权

本系列文章是下载的是spark2.2.1版本的源码进行相关分析和学习。

SparkSubmit.scala包含了3个Object和1个class，分别是SparkSubmitAction、SparkSubmit、SparkSubmitUtil和OptionAssigner。
（1）首先来看一下SparkSubmitAction
SparkSubmitAction是一个只允许在deploy包中访问的枚举子类，用来判断sparksubmit命令的请求类型。
类型分为三种：提交、杀死或请求应用程序的状态，但后两种操作目前仅支持独立模式和Mesos集群模式
源码如下：
private[deploy] object SparkSubmitAction extends Enumeration {
type SparkSubmitAction = Value
val SUBMIT, KILL, REQUEST_STATUS = Value
}
（2）其次SparkSubmitUtils也是一个Object，由它是一个sparksubmit的辅助类，主要用于提供在SparkSubmit内部使用的一些方法.
（3）最后我们说一下SparkSubmit，它是一个非常重要的Object。
// Cluster managers集群管理器
private val YARN = 1
private val STANDALONE = 2
private val MESOS = 4
private val LOCAL = 8
private val ALL_CLUSTER_MGRS = YARN | STANDALONE | MESOS | LOCAL

// Deploy modes部署模式
private val CLIENT = 1
private val CLUSTER = 2
private val ALL_DEPLOY_MODES = CLIENT | CLUSTER

// Special primary resource names that represent shells rather than application jars.
//表示shell而不是应用程序jars的特殊的主要资源名。
private val SPARK_SHELL = "spark-shell"
private val PYSPARK_SHELL = "pyspark-shell"
private val SPARKR_SHELL = "sparkr-shell"
private val SPARKR_PACKAGE_ARCHIVE = "sparkr.zip"
private val R_PACKAGE_ARCHIVE = "rpkg.zip"

我们可以看下主方法main()：
override def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args)//新建一个SparkSubmitArguments对象
if (appArgs.verbose) { //如果appArgs是冗长的则打印，且在打印时会修改敏感信息
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)//通过spark-submit提交应用程序
case SparkSubmitAction.KILL => kill(appArgs)//通过spark-submit取消应用程序，目前仅支持独立模式和Mesos集群模式
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)//通过spark-submit请求得到应用程序状态，目前仅支持独立模式和Mesos集群模式

}
}

通过主程序可以看出，在主程序通过匹配action来完成相关操作：包括提交应用程序，取消应用程序，或请求应用程序状态三种，
但后两种目前仅支持独立模式和Mesos集群模式
我们先来看看submit操作：
submit方法中首先通过CLI传递过来的参数，设置不同模式下的合适的类路径、系统属性及应用参数，然后创建环境运行应用程序的Main方法。
具体内容如下：
private def submit(args: SparkSubmitArguments): Unit = {
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)//准备提交环境

def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
})
} catch {
case e: Exception =>
if (e.getStackTrace().length == 0) {
// scalastyle:off println
printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
// scalastyle:on println
exitFn(1)
} else {
throw e
}
}
} else {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
}

if (args.isStandaloneCluster && args.useRest) {
try {
// scalastyle:off println
printStream.println("Running Spark using the REST application submission protocol.")
// scalastyle:on println
doRunMain()
} catch {
// Fail over to use the legacy submission gateway
case e: SubmitRestConnectionException =>
printWarning(s"Master endpoint ${args.master} was not a REST server. " +
"Falling back to legacy submission gateway instead.")
args.useRest = false
submit(args)
}
// 在所有其他模式中，只需要运行主类就可以了
} else {
doRunMain()
}
}

由此可以看出submit调用doRunMain方法，然后doRunMain方法调用runMain方法触发应用程序的main方法。
runMain方法如下：
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
// scalastyle:off println
if (verbose) {
printStream.println(s"Main class:\n$childMainClass")
printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
// sysProps may contain sensitive information, so redact before printing
//sysProps可能包含敏感信息，所以在打印前要重新编辑
printStream.println(s"System properties:\n${Utils.redact(sysProps).mkString("\n")}")
printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
printStream.println("\n")
}
// scalastyle:on println

val loader =//获取类路径
if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)

for (jar <- childClasspath) {//将jar包添加到类路径中
addJarToClasspath(jar, loader)
}

for ((key, value) <- sysProps) { //设置相关属性
System.setProperty(key, value)
}

var mainClass: Class[_] = null

try {
mainClass = Utils.classForName(childMainClass)//找到主方法
} catch {
case e: ClassNotFoundException =>
e.printStackTrace(printStream)
if (childMainClass.contains("thriftserver")) {
// scalastyle:off println
printStream.println(s"Failed to load main class $childMainClass.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
// scalastyle:on println
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
case e: NoClassDefFoundError =>
e.printStackTrace(printStream)
if (e.getMessage.contains("org/apache/hadoop/hive")) {
// scalastyle:off println
printStream.println(s"Failed to load hive class.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
// scalastyle:on println
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
}

// SPARK-4170
if (classOf[scala.App].isAssignableFrom(mainClass)) { //如果没找到应用程序的主方法则给出警告
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}

val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)//确保main方法为静态方法
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}

@tailrec
def findCause(t: Throwable): Throwable = t match {
case e: UndeclaredThrowableException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: InvocationTargetException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: Throwable =>
e
}

try {
mainMethod.invoke(null, childArgs.toArray)
} catch {
case t: Throwable =>
findCause(t) match {
case SparkUserAppException(exitCode) =>
System.exit(exitCode)

case t: Throwable =>
throw t
}
}
}

接着我们来看看kill操作：
kill操作利用CLI传递过来的子任务ID和master通过REST协议的Post方式取消现有的任务。仅适合独立和Mesos的集群模式。
private def kill(args: SparkSubmitArguments): Unit = {
new RestSubmissionClient(args.master)
.killSubmission(args.submissionToKill)
}

最后我们来看看requestStatus操作：
该操作利用CLI传递过来的子任务ID和master通过REST协议Get方式得到任务的具体信息。仅适合独立和Mesos的集群模式。
private def requestStatus(args: SparkSubmitArguments): Unit = {
new RestSubmissionClient(args.master)
.requestSubmissionStatus(args.submissionToRequestStatusFor)
}

(4)最后附上SparkSubmit.scala源码全文（里面的汉语备注是自己理解翻译的，而英文备注是源码里提供的，可能存在着偏差，如果您有好的解读欢迎指正与分享）

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.deploy

import java.io.{File, IOException}
import java.lang.reflect.{InvocationTargetException, Modifier, UndeclaredThrowableException}
import java.net.URL
import java.nio.file.Files
import java.security.PrivilegedExceptionAction
import java.text.ParseException

import scala.annotation.tailrec
import scala.collection.mutable.{ArrayBuffer, HashMap, Map}
import scala.util.Properties

import org.apache.commons.lang3.StringUtils
import org.apache.hadoop.conf.{Configuration => HadoopConfiguration}
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.security.UserGroupInformation
import org.apache.ivy.Ivy
import org.apache.ivy.core.LogOptions
import org.apache.ivy.core.module.descriptor._
import org.apache.ivy.core.module.id.{ArtifactId, ModuleId, ModuleRevisionId}
import org.apache.ivy.core.report.ResolveReport
import org.apache.ivy.core.resolve.ResolveOptions
import org.apache.ivy.core.retrieve.RetrieveOptions
import org.apache.ivy.core.settings.IvySettings
import org.apache.ivy.plugins.matcher.GlobPatternMatcher
import org.apache.ivy.plugins.repository.file.FileRepository
import org.apache.ivy.plugins.resolver.{ChainResolver, FileSystemResolver, IBiblioResolver}

import org.apache.spark._
import org.apache.spark.api.r.RUtils
import org.apache.spark.deploy.rest._
import org.apache.spark.launcher.SparkLauncher
import org.apache.spark.util._

/**
 * Whether to submit, kill, or request the status of an application.是否提交、杀死或请求应用程序的状态
 * The latter two operations are currently supported only for standalone and Mesos cluster modes.
 *后两种操作目前仅支持独立模式和Mesos集群模式
 */
private[deploy] object SparkSubmitAction extends Enumeration {
  type SparkSubmitAction = Value
  val SUBMIT, KILL, REQUEST_STATUS = Value
}

/**
 * Main gateway of launching a Spark application.动Spark应用程序的主要关口。
 *
 * This program handles setting up the classpath with relevant Spark dependencies and provides
 * a layer over the different cluster managers and deploy modes that Spark supports.
 *该程序将使用相关的Spark依赖性设置类路径classpath ，并提供Spark上支持的不同集群管理器和部署模式的图层。
 */
object SparkSubmit extends CommandLineUtils {

  // Cluster managers集群管理器
  private val YARN = 1
  private val STANDALONE = 2
  private val MESOS = 4
  private val LOCAL = 8
  private val ALL_CLUSTER_MGRS = YARN | STANDALONE | MESOS | LOCAL

  // Deploy modes部署模式
  private val CLIENT = 1
  private val CLUSTER = 2
  private val ALL_DEPLOY_MODES = CLIENT | CLUSTER

最低0.47元/天解锁文章

清萝卜头

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pyspark源码之SparkSubmit学习（ SparkSubmit.scala）

本系列文章是下载的是spark2.2.1版本的源码进行相关分析和学习。SparkSubmit.scala包含了3个Object和1个class，分别是SparkSubmitAction、SparkSubmit、SparkSubmitUtil和OptionAssigner。（1）首先来看一下SparkSubmitActionSparkSubmitAction是一个只允许在deploy包中
复制链接

扫一扫