Spark中Utils.getCallSite()的作用

最新推荐文章于 2023-01-19 23:35:36 发布

worldchinalee

最新推荐文章于 2023-01-19 23:35:36 发布

阅读量549

点赞数

分类专栏： spark 文章标签： spark 源码

spark 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

二话不说，亮出源代码

def getCallSite(skipClass: String => Boolean = sparkInternalExclusionFunction): CallSite = {
  // Keep crawling up the stack trace until we find the first function not inside of the spark
  // package. We track the last (shallowest) contiguous Spark method. This might be an RDD
  // transformation, a SparkContext function (such as parallelize), or anything else that leads
  // to instantiation of an RDD. We also track the first (deepest) user method, file, and line.
  var lastSparkMethod = "<unknown>"
  var firstUserFile = "<unknown>"
  var firstUserLine = 0
  var insideSpark = true
  var callStack = new ArrayBuffer[String]() :+ "<unknown>"

  Thread.currentThread.getStackTrace().foreach { ste: StackTraceElement =>
    // When running under some profilers, the current stack trace might contain some bogus
    // frames. This is intended to ensure that we don't crash in these situations by
    // ignoring any frames that we can't examine.
    if (ste != null && ste.getMethodName != null
      && !ste.getMethodName.contains("getStackTrace")) {
      if (insideSpark) {
        if (skipClass(ste.getClassName)) {
          lastSparkMethod = if (ste.getMethodName == "<init>") {
            // Spark method is a constructor; get its class name
            ste.getClassName.substring(ste.getClassName.lastIndexOf('.') + 1)
          } else {
            ste.getMethodName
          }
          callStack(0) = ste.toString // Put last Spark method on top of the stack trace.
        } else {
          if (ste.getFileName != null) {
            firstUserFile = ste.getFileName
            if (ste.getLineNumber >= 0) {
              firstUserLine = ste.getLineNumber
            }
          }
          callStack += ste.toString
          insideSpark = false
        }
      } else {
        callStack += ste.toString
      }
    }
  }

  val callStackDepth = System.getProperty("spark.callstack.depth", "20").toInt
  val shortForm =
    if (firstUserFile == "HiveSessionImpl.java") {
      // To be more user friendly, show a nicer string for queries submitted from the JDBC
      // server.
      "Spark JDBC Server Query"
    } else {
      s"$lastSparkMethod at $firstUserFile:$firstUserLine"
    }
  val longForm = callStack.take(callStackDepth).mkString("\n")

  CallSite(shortForm, longForm)
}

首先这个方法返回的是一个CallSite对象,CallSite是Utils类的一个内部类，附上CallSite的源码

/** CallSite represents a place in user code. It can have a short and a long form. */
private[spark] case class CallSite(shortForm: String, longForm: String)

private[spark] object CallSite {
  val SHORT_FORM = "callSite.short"
  val LONG_FORM = "callSite.long"
  val empty = CallSite("", "")
}

这个对象是一个case class,case class 通常情况下被用做数据载体，也即是Java里面的VO，这个类里面保存了两个东西，

一个是SHORT_FORM,一个是LONG_FORM,字面上的意思是短格式和长格式，那么这两个东西究竟是什么东西呢？请看getCallSite()方法

Thread.currentThread.getStackTrace().foreach { ste: StackTraceElement =>
  // When running under some profilers, the current stack trace might contain some bogus
  // frames. This is intended to ensure that we don't crash in these situations by
  // ignoring any frames that we can't examine.
  if (ste != null && ste.getMethodName != null
    && !ste.getMethodName.contains("getStackTrace")) {
    if (insideSpark) {
      if (skipClass(ste.getClassName)) {
        lastSparkMethod = if (ste.getMethodName == "<init>") {
          // Spark method is a constructor; get its class name
          ste.getClassName.substring(ste.getClassName.lastIndexOf('.') + 1)
        } else {
          ste.getMethodName
        }
        callStack(0) = ste.toString // Put last Spark method on top of the stack trace.
      } else {
        if (ste.getFileName != null) {
          firstUserFile = ste.getFileName
          if (ste.getLineNumber >= 0) {
            firstUserLine = ste.getLineNumber
          }
        }
        callStack += ste.toString
        insideSpark = false
      }
    } else {
      callStack += ste.toString
    }
  }
}
分析代码可知，这个方法是取当前线程的堆栈信息，遍历堆栈，将方法名符合一定规则的放入栈顶
这个规则源码如下：
/** Default filtering function for finding call sites using `getCallSite`. */
private def sparkInternalExclusionFunction(className: String): Boolean = {
  // A regular expression to match classes of the internal Spark API's
  // that we want to skip when finding the call site of a method.
  val SPARK_CORE_CLASS_REGEX =
    """^org\.apache\.spark(\.api\.java)?(\.util)?(\.rdd)?(\.broadcast)?\.[A-Z]""".r
  val SPARK_SQL_CLASS_REGEX = """^org\.apache\.spark\.sql.*""".r
  val SCALA_CORE_CLASS_PREFIX = "scala"
  val isSparkClass = SPARK_CORE_CLASS_REGEX.findFirstIn(className).isDefined ||
    SPARK_SQL_CLASS_REGEX.findFirstIn(className).isDefined
  val isScalaClass = className.startsWith(SCALA_CORE_CLASS_PREFIX)
  // If the class is a Spark internal class or a Scala class, then exclude.
  isSparkClass || isScalaClass
}
也即是符合 org\.apache\.spark(\.api\.java)?(\.util)?(\.rdd)?(\.broadcast)?\.[A-Z]和org\.apache\.spark\.sql.* 
以上两个正则表达式的方法名，赋值给lastSparkMethod并且将该栈元素放入栈顶，记住是每一次都放入栈顶，也即是覆盖之前的，得到的是最后一个方法名。以官方提供的LogQuery为栗子。最后得到的是这样的堆栈信息

如果不符合上面两个正则表达式的则将调用SparkContext的文件名放入firstUserFile变量，堆栈里的行数放入firstUserLine变量，并且将insideSpark赋值为负数，
这样堆栈里的下一个元素则直接放入callStack变量的最后
如果SparkContext是在HiveSessionImpl实例化的，则short_form标记为Spark JDBC Server Query

worldchinalee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark中Utils.getCallSite()的作用

二话不说，亮出源代码def getCallSite(skipClass: String => Boolean = sparkInternalExclusionFunction): CallSite = { // Keep crawling up the stack trace until we find the first function not inside of the sp
复制链接

扫一扫