记一个Spark2.3 JDBC连接thriftServer 创建临时函数的bug

最新推荐文章于 2022-07-13 16:56:53 发布

彼岸枫雪非

最新推荐文章于 2022-07-13 16:56:53 发布

阅读量894

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/u012543819/article/details/106113826

版权

Spark 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

文章内容

问题描述
问题定位
解决方案
总结

问题描述

我们的生产目前使用的是spark2.3版本。客户最近在使用UDF完成一些功能，操作方式如下：

编写UDF.jar
使用beeline (或JDBC)连接 thriftserver（yarn模式），执行create temporary function using udf.jar 的命令，创建一个临时函数。这时候就可以在当前的session中使用该临时函数。使用结束后，删除jar包，后续其他session中执行sql，就会报FileNotFoundException：File XXX does not exist. 重启thrift服务后该问题可以解决。

问题定位

为什么删除jar以后，其他的task 还是会需要这个jar呢？带着这个疑问，我又开始了扒源码找问题之路。
首先从报错的地方看起， Executor.scala 801：

   * Download any missing dependencies if we receive a new set of files and JARs from the
   * SparkContext. Also adds any new JARs we fetched to the class loader.
   */
  private def updateDependencies(newFiles: Map[String, Long], newJars: Map[String, Long]) {
    lazy val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
    synchronized {
      // Fetch missing dependencies
      for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) {
        logInfo("Fetching " + name + " with timestamp " + timestamp)
        // 报错的就是这个地方
        // Fetch file with useCache mode, close cache for local mode.
        Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf,
          env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
        currentFiles(name) = timestamp
      }
      initThreadCurrentJars()
      //以下省略部分代码

在这里插入代码片

这里其实是执行了一个下载task 的jar和file 依赖的操作，首先对比task依赖文件的时间戳，如果达到文件fetchFile 条件则会去下载依赖，优先搜索executor的本地缓存，如果没有，则去uri指定的文件系统下载。

这个方法的入参有两个，分别是新增的依赖jars和file。那我们接着跟，看下这两个参数来自哪里。然后，我们发现在executor启动task时通过参数taskDescription传递过来的。

  def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)
  }

然后，在taskSetManager里面初始化taskDescription的，这里将addedJars和addedFiles 作为参数，也就是后面task需要去下载的jars 和files。


  // SPARK-21563 make a copy of the jars/files so they are consistent across the TaskSet
  private val addedJars = HashMap[String, Long](sched.sc.addedJars.toSeq: _*)
  private val addedFiles = HashMap[String, Long](sched.sc.addedFiles.toSeq: _*)

可以看到这两个变量来自于job的SparkContext的addedFiles 和 addedJars 两个变量。

  // Used to store a URL for each static file/jar together with the file's local timestamp
  private[spark] val addedFiles = new ConcurrentHashMap[String, Long]().asScala
  private[spark] val addedJars = new ConcurrentHashMap[String, Long]().asScala

这里我产生了疑问,难到不是每个session都对应一个SparkContext吗？于是我只好去找openSession的时候到底怎么做的。

override def openSession(
                            protocol: TProtocolVersion,
                            username: String,
                            passwd: String,
                            ipAddress: String,
                            sessionConf: java.util.Map[String, String],
                            withImpersonation: Boolean,
                            delegationToken: String): SessionHandle = {
    val sessionHandle =
      super.openSession(protocol, username, passwd, ipAddress, sessionConf, withImpersonation,
        delegationToken)
    val session = super.getSession(sessionHandle)
    val ss = session.getSessionState
    val hiveConf = session.getHiveConf
    ss.initTxnMgr(hiveConf)
    val txnManager = ss.getTxnMgr

    val ctx = if (sqlContext.conf.hiveThriftServerSingleSession) {
      sqlContext
    } else {
      sqlContext.newSession()
    }
    // 以下省略部分代码
  }

可以看到，重点在sqlContext.newSession()这行代码中。

  /**
   * Returns a [[SQLContext]] as new session, with separated SQL configurations, temporary
   * tables, registered functions, but sharing the same `SparkContext`, cached data and
   * other things.
   *
   * @since 1.6.0
   */
  def newSession(): SQLContext = sparkSession.newSession().sqlContext

这里有一句很重要的注释：
Returns a [[SQLContext]] as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, cached data andother things.
也就是说， jdbc中所有的session时共享SparkContext，缓存数据等东西的。至此上述问题就可以解释了。

通过jdbc连接spark thriftserver的方式执行sql，多个session共享同一个SparkContext，当在某个session中通过创建临时函数的方式引用了jar，这个jar会被永久地添加到SparkContext中，并且此后每个job的tasks都会将这些jar作为依赖，执行阶段会去下载依赖，并且我们上面提到过，是优先获取本地缓存的jar。如果thrift不重启，那么这个sparkcontext就一直都是共享的。如果用户中途不使用这个函数了，并删除了函数依赖的jar，当excutor本地缓存失效或executor重启后，执行task时都会去文件系统重新下载依赖jar，这时候就会报上述错误，导致task执行失败了。

解决方案

找到问题原因后，我们告诉用户先不要做删除jar的操作，优先保证环境可用，但这毕竟时权宜之计。然后我们解决问题的角度，提出了两个解决方案：
1 . executor中下载jar失败时，只打印log，不报错，当函数执行找不到jar时，会报错，可能是classNotFound之类的，这样用户知道再次去上传jar。此实现比较简单，但是有一部分风险，比如spark_submit等指定的jar，或者其他方式添加的jar，如果这时候确实没上传成功，那么执行时可能报错不具有明确指向，会增加问题排查的难度。
2. 在session close的时和drop function时，将函数依赖的jar从SparkContext中删除。避免其他job引用到不需要的依赖文件。这个解决方式可以从根本上解决问题，但是开发和测试相对复杂一些。

总结

以上就是这个诡异问题的发现排查过程，以此记录，欢迎讨论。

彼岸枫雪非

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
记一个Spark2.3 JDBC连接thriftServer 创建临时函数的bug

使用jdbc连接thrift，上传依赖jar到hdfs并创建临时udf'函数，而后删除jar，运行时间过久后，其他sql的task 会报File XX does not exist 错误，该问题的原因时什么呢？又该如何解决呢？
复制链接

扫一扫

专栏目录