记录问题：spark-streaming使用fileStream无法读取文件

gugoen

已于 2022-07-05 15:55:57 修改

阅读量505

点赞数

分类专栏： spark 文章标签：大数据

于 2022-07-05 14:59:33 首次发布

本文链接：https://blog.csdn.net/qq_39026943/article/details/125619835

版权

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

问题描述：如下代码，在对应路径添加文件时，无法被读取，通过debug日志查看可知，新添加的数据都被过滤了。

  def run(ssc: StreamingContext, directory: String): Unit = {
    ssc.fileStream[LongWritable, Text, TextInputFormat](directory,
      filter => defaultFilter(filter),
      newFilesOnly = false)
      .map(_._2.toString)
      .foreachRDD(rdd => {
        System.err.println(s"hahahaha: ${rdd.collect().mkString("Array(", ", ", ")")}")
        System.err.println(s"wawawawa: ${rdd.getNumPartitions}")
        rdd.foreachPartition(iter => {
          var length = 0
          System.err.println("foreachPartitionLog")
          while (iter.hasNext) {
            length += 1
            val row = iter.next()
            println(s"$length: $row")
          }
        })
      })
  }

过滤日志：由于两个文件的修改日期小于对应区间的最小时间

2022-07-05 14:33:15.047 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: FileInputDStream.writeObject used
2022-07-05 14:33:15.941 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Clearing checkpoint data
2022-07-05 14:33:15.941 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Cleared checkpoint data
2022-07-05 14:33:20.000 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Time 1657002800000 ms is valid
2022-07-05 14:33:20.000 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Getting new files for time 1657002800000, ignoring files older than 1657002740000
filter: demo.txt
filter: demo2.txt
2022-07-05 14:33:20.019 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: hdfs://****:8020/user/flowaix/warehouse/evil_apk_analyse.db/apk_evil/demo.txt ignored as mod time 1656988559509 <= ignore time 1657002740000
2022-07-05 14:33:20.035 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: hdfs://****:8020/user/flowaix/warehouse/evil_apk_analyse.db/apk_evil/demo2.txt ignored as mod time 1657002288707 <= ignore time 1657002740000
2022-07-05 14:33:20.035 [ INFO] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Finding new files took 35 ms
2022-07-05 14:33:20.035 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: # cached file times = 2
2022-07-05 14:33:20.035 [ INFO] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: New files at time 1657002800000 ms:

实际上demo2.txt文件时新添加上去的，之所以还是小于了，是因为我的程序是在自己的电脑上，而hdfs在另一个集群上，两种时间不同步。集群上的时间比实际时间小了8分钟。

具体源码分析：

调取fileStream实际是注册FileInputDStream类到Graph中。其中lastNewFileFindingTime变量获取的是clock时钟的当前时间，。其中该类的对于文件的过滤使用如下方法，currentTime参数是获取RDD区间的最后时间，也就是当前时间

  private def findNewFiles(currentTime: Long): Array[String] = {
    try {
      lastNewFileFindingTime = clock.getTimeMillis()

      // Calculate ignore threshold
      val modTimeIgnoreThreshold = math.max(
        initialModTimeIgnoreThreshold,   // initial threshold based on newFilesOnly setting
        currentTime - durationToRemember.milliseconds  // trailing end of the remember window
      )
      logDebug(s"Getting new files for time $currentTime, " +
        s"ignoring files older than $modTimeIgnoreThreshold")

      val newFileFilter = new PathFilter {
        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
      }
      val directoryFilter = new PathFilter {
        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
      }
      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
      val newFiles = directories.flatMap(dir =>
        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
      val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
      logInfo("Finding new files took " + timeTaken + " ms")
      logDebug("# cached file times = " + fileToModTime.size)
      if (timeTaken > slideDuration.milliseconds) {
        logWarning(
          "Time taken to find new files exceeds the batch size. " +
            "Consider increasing the batch size or reducing the number of " +
            "files in the monitored directory."
        )
      }
      newFiles
    } catch {
      case e: Exception =>
        logWarning("Error finding new files", e)
        reset()
        Array.empty
    }
  }

其中关键逻辑在判断方法isNewFile方法中，通过findNewFiles方法传过来的时间参数，可知程序的当前时间是与文件系统fs获取的修改时间进行比较。

  private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = {
    val pathStr = path.toString
    // Reject file if it does not satisfy filter
    if (!filter(path)) {
      logDebug(s"$pathStr rejected by filter")
      return false
    }
    // Reject file if it was created before the ignore time
    val modTime = getFileModTime(path)
    if (modTime <= modTimeIgnoreThreshold) {
      // Use <= instead of < to avoid SPARK-4518
      logDebug(s"$pathStr ignored as mod time $modTime <= ignore time $modTimeIgnoreThreshold")
      return false
    }
    // Reject file if mod time > current batch time
    if (modTime > currentTime) {
      logDebug(s"$pathStr not selected as mod time $modTime > current time $currentTime")
      return false
    }
    // Reject file if it was considered earlier
    if (recentlySelectedFiles.contains(pathStr)) {
      logDebug(s"$pathStr already considered")
      return false
    }
    logDebug(s"$pathStr accepted with mod time $modTime")
    return true
  }

  private def getFileModTime(path: Path) = {
    fileToModTime.getOrElseUpdate(path.toString, fs.getFileStatus(path).getModificationTime())
  }