Spark环境更新与应用启动

最新推荐文章于 2023-04-27 12:03:28 发布

涛声依旧（竞涛）

最新推荐文章于 2023-04-27 12:03:28 发布

阅读量349

点赞数

分类专栏： spark 文章标签： spark 环境更新 environment update

本文链接：https://blog.csdn.net/qq_27639777/article/details/89434774

版权

spark 专栏收录该内容

34 篇文章 4 订阅

订阅专栏

文章目录

简介
环境更新
应用启动

简介

在SparkContext的初始化过程中，可能对其环境造成影响，所以需要更新环境，然后再进行应用启动：

postEnvironmentUpdate()
postApplicationStart()

环境更新

环境更新前，首先判断taskScheduler是否已经创建完成，如果创建完成了则进行环境更新：

解析jar包和文件配置参数；
获取环境相关的详细配置，得到JVM参数、Spark属性、系统属性和classPath等；

生成环境更新事件SparkListenerEnvironmentUpdate，并利用listenerBus发送出去。

 /** 一旦task scheduler就绪后就可以发送环境更新的事件了 */
 private def postEnvironmentUpdate() {
   if (taskScheduler != null) {
     val schedulingMode = getSchedulingMode.toString
     val addedJarPaths = addedJars.keys.toSeq
     val addedFilePaths = addedFiles.keys.toSeq
     val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,
       addedFilePaths)
     val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)
     listenerBus.post(environmentUpdate)
   }
 }

解析jar包和文件

SparkContext初始化过程中，如果设置了spark.jars属性，spark.jars指定的jar包将由addJar方法加入httpFileServer的jarDir变量指定的路径下。spark.files指定的文件将由addFile方法加入httpFileServer的fileDir变量指定的路径下。

Utils.scala
def getUserJars(conf: SparkConf): Seq[String] = {
  val sparkJars = conf.getOption("spark.jars")
  sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
}

_jars = Utils.getUserJars(_conf)
_files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
  .toSeq.flatten

def jars: Seq[String] = _jars
// Add each JAR given through the constructor
if (jars != null) {
  jars.foreach(addJar)
}

def files: Seq[String] = _files
if (files != null) {
  files.foreach(addFile)
}

获取环境配置

调用SparkEnv的静态方法environmentDetails得到影响环境的JVM参数、Spark属性、系统属性和classPath等。

/**
 * 返回一个表示jvm信息、spark属性、系统属性和类路径的映射。
 * 映射的键表示信息的种类，映射的值表示相应的键值序列对属性。
 * 这主要是被SparkListenerEnvironmentUpdate事件使用。
 */
private[spark]
def environmentDetails(
    conf: SparkConf,
    schedulingMode: String,
    addedJars: Seq[String],
    addedFiles: Seq[String]): Map[String, Seq[(String, String)]] = {

  import Properties._
  val jvmInformation = Seq(
    ("Java Version", s"$javaVersion ($javaVendor)"),
    ("Java Home", javaHome),
    ("Scala Version", versionString)
  ).sorted

  // Spark属性
  // 包括scheduling mode参数是否被配置 (被SparkUI使用)
  val schedulerMode =
    if (!conf.contains("spark.scheduler.mode")) {
      Seq(("spark.scheduler.mode", schedulingMode))
    } else {
      Seq[(String, String)]()
    }
  val sparkProperties = (conf.getAll ++ schedulerMode).sorted

  // 不包含java classpaths外的系统属性
  val systemProperties = Utils.getSystemProperties.toSeq
  val otherProperties = systemProperties.filter { case (k, _) =>
    k != "java.class.path" && !k.startsWith("spark.")
  }.sorted

  // 包含了所有添加的jar包和文件的Class paths
  val classPathEntries = javaClassPath
    .split(File.pathSeparator)
    .filterNot(_.isEmpty)
    .map((_, "System Classpath"))
  val addedJarsAndFiles = (addedJars ++ addedFiles).map((_, "Added By User"))
  val classPaths = (addedJarsAndFiles ++ classPathEntries).sorted
  // 组装返回值
  Map[String, Seq[(String, String)]](
    "JVM Information" -> jvmInformation,
    "Spark Properties" -> sparkProperties,
    "System Properties" -> otherProperties,
    "Classpath Entries" -> classPaths)
}

发送环境更新事件

生成环境更新事件SparkListenerEnvironmentUpdate，并利用listenerBus发送出去。此事件被EnvironmentListener和JobProgressListener监听，最终影响SparkUI上的输出内容。

EnvironmentListener#onEnvironmentUpdate
// 更新EnvrionmentTab上的信息
override def onEnvironmentUpdate(environmentUpdate: SparkListenerEnvironmentUpdate) {
  synchronized {
    val environmentDetails = environmentUpdate.environmentDetails
    jvmInformation = environmentDetails("JVM Information")
    sparkProperties = environmentDetails("Spark Properties")
    systemProperties = environmentDetails("System Properties")
    classpathEntries = environmentDetails("Classpath Entries")
  }
}

JobProgressListener#onEnvironmentUpdate
// 更新SparkUI上的schedulingMode信息
override def onEnvironmentUpdate(environmentUpdate: SparkListenerEnvironmentUpdate) {
  synchronized {
    schedulingMode = environmentUpdate
      .environmentDetails("Spark Properties").toMap
      .get("spark.scheduler.mode")
      .map(SchedulingMode.withName)
  }
}

应用启动

spark应用启动很简单，只是向listenerBus发送了SparkListenerApplicationStart事件，代码如下：

/** 发送应用启动事件 */
private def postApplicationStart() {
  // 注意：这段代码假设task scheduler已经被初始化完成，并且已经联系cluster manager拿到了application ID。
  listenerBus.post(SparkListenerApplicationStart(appName, Some(applicationId),
    startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))
}

SparkListenerApplicationStart事件会同时被ApplicationEventListener、ExecutorsListener和JobProgressListener监听。

ApplicationEventListener#onApplicationStart
// 获取应用相关的配置信息
override def onApplicationStart(applicationStart: SparkListenerApplicationStart) {
  appName = Some(applicationStart.appName)
  appId = applicationStart.appId
  appAttemptId = applicationStart.appAttemptId
  startTime = Some(applicationStart.time)
  sparkUser = Some(applicationStart.sparkUser)
}

ExecutorsListener#onApplicationStart
// 获取driver对应的executorId（driver相当于一个特殊的executor），并组装task执行的executor相关日志
override def onApplicationStart(
    applicationStart: SparkListenerApplicationStart): Unit = {
  applicationStart.driverLogs.foreach { logs =>
    val storageStatus = activeStorageStatusList.find { s =>
      s.blockManagerId.executorId == SparkContext.LEGACY_DRIVER_IDENTIFIER ||
      s.blockManagerId.executorId == SparkContext.DRIVER_IDENTIFIER
    }
    storageStatus.foreach { s =>
      val eid = s.blockManagerId.executorId
      val taskSummary = executorToTaskSummary.getOrElseUpdate(eid, ExecutorTaskSummary(eid))
	  // TODO：组装完taskSummary后没有后续操作，而taskSummary是局部变量！！
      taskSummary.executorLogs = logs.toMap
    }
  }
}

JobProgressListener#onApplicationStart
// 获取应用的启动时间
override def onApplicationStart(appStarted: SparkListenerApplicationStart) {
  startTime = appStarted.time
}

涛声依旧（竞涛）

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark环境更新与应用启动

文章目录简介环境更新解析jar包和文件获取环境配置发送环境更新事件应用启动简介在SparkContext的初始化过程中，可能对其环境造成影响，所以需要更新环境，然后再进行应用启动：postEnvironmentUpdate()postApplicationStart()环境更新环境更新前，首先判断taskScheduler是否已经创建完成，如果创建完成了则进行环境更新：解析ja...
复制链接

扫一扫

专栏目录