二话不说,亮出源代码
def getCallSite(skipClass: String => Boolean = sparkInternalExclusionFunction): CallSite = { // Keep crawling up the stack trace until we find the first function not inside of the spark // package. We track the last (shallowest) contiguous Spark method. This might be an RDD // transformation, a SparkContext function (such as parallelize), or anything else that leads // to instantiation of an RDD. We also track the first (deepest) user method, file, and line. var lastSparkMethod = "<unknown>" var firstUserFile = "<unknown>" var firstUserLine = 0 var insideSpark = true var callStack = new ArrayBuffer[String]() :+ "<unknown>" Thread.currentThread.getStackTrace().foreach { ste: StackTraceElement => // When running under some profilers, the current stack trace might contain some bogus // frames. This is intended to ensure that we don't crash in these situations by // ignoring any frames that we can't examine. if (ste != null && ste.getMethodName != null && !ste.getMethodName.contains("getStackTrace")) { if (insideSpark) { if (skipClass(ste.getClassName)) { lastSparkMethod = if (ste.getMethodName == "<init>") { // Spark method is a constructor; get its class name ste.getClassName.substring(ste.getClassName.lastIndexOf('.') + 1) } else { ste.getMethodName } callStack(0) = ste.toString // Put last Spark method on top of the stack trace. } else { if (ste.getFileName != null) { firstUserFile = ste.getFileName if (ste.getLineNumber >= 0) { firstUserLine = ste.getLineNumber } } callStack += ste.toString insideSpark = false } } else { callStack += ste.toString } } } val callStackDepth = System.getProperty("spark.callstack.depth", "20").toInt val shortForm = if (firstUserFile == "HiveSessionImpl.java") { // To be more user friendly, show a nicer string for queries submitted from the JDBC // server. "Spark JDBC Server Query" } else { s"$lastSparkMethod at $firstUserFile:$firstUserLine" } val longForm = callStack.take(callStackDepth).mkString("\n") CallSite(shortForm, longForm) }
首先这个方法返回的是一个CallSite对象,CallSite是Utils类的一个内部类,附上CallSite的源码
/** CallSite represents a place in user code. It can have a short and a long form. */ private[spark] case class CallSite(shortForm: String, longForm: String) private[spark] object CallSite { val SHORT_FORM = "callSite.short" val LONG_FORM = "callSite.long" val empty = CallSite("", "") }
这个对象是一个case class,case class 通常情况下被用做数据载体,也即是Java里面的VO,这个类里面保存了两个东西,
一个是SHORT_FORM,一个是LONG_FORM,字面上的意思是短格式和长格式,那么这两个东西究竟是什么东西呢?请看getCallSite()方法
Thread.currentThread.getStackTrace().foreach { ste: StackTraceElement => // When running under some profilers, the current stack trace might contain some bogus // frames. This is intended to ensure that we don't crash in these situations by // ignoring any frames that we can't examine. if (ste != null && ste.getMethodName != null && !ste.getMethodName.contains("getStackTrace")) { if (insideSpark) { if (skipClass(ste.getClassName)) { lastSparkMethod = if (ste.getMethodName == "<init>") { // Spark method is a constructor; get its class name ste.getClassName.substring(ste.getClassName.lastIndexOf('.') + 1) } else { ste.getMethodName } callStack(0) = ste.toString // Put last Spark method on top of the stack trace. } else { if (ste.getFileName != null) { firstUserFile = ste.getFileName if (ste.getLineNumber >= 0) { firstUserLine = ste.getLineNumber } } callStack += ste.toString insideSpark = false } } else { callStack += ste.toString } } }分析代码可知,这个方法是取当前线程的堆栈信息,遍历堆栈,将方法名符合一定规则的放入栈顶这个规则源码如下:如果SparkContext是在HiveSessionImpl实例化的,则short_form标记为Spark JDBC Server Query/** Default filtering function for finding call sites using `getCallSite`. */ private def sparkInternalExclusionFunction(className: String): Boolean = { // A regular expression to match classes of the internal Spark API's // that we want to skip when finding the call site of a method. val SPARK_CORE_CLASS_REGEX = """^org\.apache\.spark(\.api\.java)?(\.util)?(\.rdd)?(\.broadcast)?\.[A-Z]""".r val SPARK_SQL_CLASS_REGEX = """^org\.apache\.spark\.sql.*""".r val SCALA_CORE_CLASS_PREFIX = "scala" val isSparkClass = SPARK_CORE_CLASS_REGEX.findFirstIn(className).isDefined || SPARK_SQL_CLASS_REGEX.findFirstIn(className).isDefined val isScalaClass = className.startsWith(SCALA_CORE_CLASS_PREFIX) // If the class is a Spark internal class or a Scala class, then exclude. isSparkClass || isScalaClass }也即是符合org\.apache\.spark(\.api\.java)?(\.util)?(\.rdd)?(\.broadcast)?\.[A-Z]和org\.apache\.spark\.sql.*
以上两个正则表达式的方法名,赋值给lastSparkMethod并且将该栈元素放入栈顶,记住是每一次都放入栈顶,也即是覆盖之前的,得到的是最后一个方法名。以官方提供的LogQuery为栗子。最后得到的是这样的堆栈信息
如果不符合上面两个正则表达式的则将调用SparkContext的文件名放入firstUserFile变量,堆栈里的行数放入firstUserLine变量,并且将insideSpark赋值为负数,
这样堆栈里的下一个元素则直接放入callStack变量的最后