文章目录
Spark内核解析(一) Spark向Yarn提交应用(源码解析)
执行脚本提交任务
实际是启动一个SparkSubmit的JVM进程
- 提交应用的脚本如下:
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \ // 默认client
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-2.4.5.jar \
10
- 我们打开bin目录下的spark-submit文件,看看做了啥:
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
- 可以看见执行了bin/spark-class脚本,最终形成了如下指令:
exec ${JAVA_HOME}/bin/java org.apache.spark.deploy.SparkSubmit
- 用bin/java启动的类,就会启动相应的JVM进程,所以我们去看看SparkSubmit的main方法
override def main(args: Array[String]): Unit = {
val submit = new SparkSubmit() {
self =>
override def doSubmit(args: Array[String]): Unit = {
try {
super.doSubmit(args)
} catch {
case e: SparkUserAppException =>
exitFn(e.exitCode)
}
}
}
submit.doSubmit(args)
}
执行提交操作
- 代码有删减,只看关键部分。我们点击submit.doSubmit(args)进入到super.doSubmit(args),可以看到:
def doSubmit(args: Array[String]): Unit = {
val appArgs = parseArguments(args)
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
case SparkSubmitAction.PRINT_VERSION => printVersion()
}
}
解析参数
- 进入parseArguments(args),可以看到返回了SparkSubmitArguments的实例对象:
protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
new SparkSubmitArguments(args)
}
- Scala里面的主构造方法会被调用,以下代码会被执行:
var master: String = null
var deployMode: String = null
var mainClass: String = null
var action: SparkSubmitAction = null
// 解析一系列spark-submit命令行的选项
parse(args.asJava)
- 这里主要就是看parse(args.asJava)利用正则,匹配出key和value,然后交给handle(name, value)处理:
// SparkSubmitArguments.scala
override protected def handle(opt: String, value: String): Boolean = {
opt match {
case MASTER =>
master = value
case CLASS =>
mainClass = value
case DEPLOY_MODE =>
if (value != "client" && value != "cluster") {
error("--deploy-mode must be either \"client\" or \"cluster\"")
}
deployMode = value
}
- 可以看到,该方法将命令行参数进行了模式匹配:
--master yarn => master
--deploy-mode cluster => deployMode
--class SparkPI(WordCount) => mainClass
提交
- action = Option(action).getOrElse(SUBMIT),所以进入submit(appArgs, uninitLog):
private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
def doRunMain(): Unit = {
if (args.proxyUser != null) {
} else {
runMain(args, uninitLog)
}
}
if (args.isStandaloneCluster && args.useRest) {
} else {
doRunMain()
}
}
使用提交的参数,运行child class的main方法
- 因为是Yarn模式,所以会进入到doRunMain(),接着进入到runMain(args, uninitLog):
private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
var mainClass: Class[_] = null
mainClass = Utils.classForName(childMainClass)
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
mainClass.newInstance().asInstanceOf[SparkApplication]
} else {
new JavaMainApplication(mainClass)
}
app.start(childArgs.toArray, sparkConf)
}
准备提交环境
- prepareSubmitEnvironment方法很重要,返回参数也很重要,我们根据它的返回值(childArgs, childClasspath, sparkConf, childMainClass)往上搜索childMainClass可以看到:
cluster:
childMainClass = org.apache.spark.deploy.yarn.YarnClusterApplication
client:
childMainClass = mainClass
这里,我们主要想了解Yarn的cluster模式
- 设置类加载器,用于后面的反射
Thread.currentThread.setContextClassLoader(loader)
通过类名加载这个类
mainClass = Utils.classForName(childMainClass)
反射创建类的对象并进行类型转换
val app: SparkApplication = mainClass.newInstance().asInstanceOf[SparkApplication]
运行childMainClass的start方法
app.start(childArgs.toArray, sparkConf)