记录问题:spark-streaming使用fileStream无法读取文件

问题描述:如下代码,在对应路径添加文件时,无法被读取,通过debug日志查看可知,新添加的数据都被过滤了。

  def run(ssc: StreamingContext, directory: String): Unit = {
    ssc.fileStream[LongWritable, Text, TextInputFormat](directory,
      filter => defaultFilter(filter),
      newFilesOnly = false)
      .map(_._2.toString)
      .foreachRDD(rdd => {
        System.err.println(s"hahahaha: ${rdd.collect().mkString("Array(", ", ", ")")}")
        System.err.println(s"wawawawa: ${rdd.getNumPartitions}")
        rdd.foreachPartition(iter => {
          var length = 0
          System.err.println("foreachPartitionLog")
          while (iter.hasNext) {
            length += 1
            val row = iter.next()
            println(s"$length: $row")
          }
        })
      })
  }

过滤日志:由于两个文件的修改日期小于对应区间的最小时间

2022-07-05 14:33:15.047 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: FileInputDStream.writeObject used
2022-07-05 14:33:15.941 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Clearing checkpoint data
2022-07-05 14:33:15.941 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Cleared checkpoint data
2022-07-05 14:33:20.000 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Time 1657002800000 ms is valid
2022-07-05 14:33:20.000 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Getting new files for time 1657002800000, ignoring files older than 1657002740000
filter: demo.txt
filter: demo2.txt
2022-07-05 14:33:20.019 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: hdfs://****:8020/user/flowaix/warehouse/evil_apk_analyse.db/apk_evil/demo.txt ignored as mod time 1656988559509 <= ignore time 1657002740000
2022-07-05 14:33:20.035 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: hdfs://****:8020/user/flowaix/warehouse/evil_apk_analyse.db/apk_evil/demo2.txt ignored as mod time 1657002288707 <= ignore time 1657002740000
2022-07-05 14:33:20.035 [ INFO] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Finding new files took 35 ms
2022-07-05 14:33:20.035 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: # cached file times = 2
2022-07-05 14:33:20.035 [ INFO] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: New files at time 1657002800000 ms:

实际上demo2.txt文件时新添加上去的,之所以还是小于了,是因为我的程序是在自己的电脑上,而hdfs在另一个集群上,两种时间不同步。集群上的时间比实际时间小了8分钟。

具体源码分析:

调取fileStream实际是注册FileInputDStream类到Graph中。其中lastNewFileFindingTime变量获取的是clock时钟的当前时间,。其中该类的对于文件的过滤使用如下方法,currentTime参数是获取RDD区间的最后时间,也就是当前时间

  private def findNewFiles(currentTime: Long): Array[String] = {
    try {
      lastNewFileFindingTime = clock.getTimeMillis()

      // Calculate ignore threshold
      val modTimeIgnoreThreshold = math.max(
        initialModTimeIgnoreThreshold,   // initial threshold based on newFilesOnly setting
        currentTime - durationToRemember.milliseconds  // trailing end of the remember window
      )
      logDebug(s"Getting new files for time $currentTime, " +
        s"ignoring files older than $modTimeIgnoreThreshold")

      val newFileFilter = new PathFilter {
        def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
      }
      val directoryFilter = new PathFilter {
        override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
      }
      val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
      val newFiles = directories.flatMap(dir =>
        fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
      val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
      logInfo("Finding new files took " + timeTaken + " ms")
      logDebug("# cached file times = " + fileToModTime.size)
      if (timeTaken > slideDuration.milliseconds) {
        logWarning(
          "Time taken to find new files exceeds the batch size. " +
            "Consider increasing the batch size or reducing the number of " +
            "files in the monitored directory."
        )
      }
      newFiles
    } catch {
      case e: Exception =>
        logWarning("Error finding new files", e)
        reset()
        Array.empty
    }
  }
其中关键逻辑在判断方法isNewFile方法中,通过findNewFiles方法传过来的时间参数,可知程序的当前时间是与文件系统fs获取的修改时间进行比较。
  private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = {
    val pathStr = path.toString
    // Reject file if it does not satisfy filter
    if (!filter(path)) {
      logDebug(s"$pathStr rejected by filter")
      return false
    }
    // Reject file if it was created before the ignore time
    val modTime = getFileModTime(path)
    if (modTime <= modTimeIgnoreThreshold) {
      // Use <= instead of < to avoid SPARK-4518
      logDebug(s"$pathStr ignored as mod time $modTime <= ignore time $modTimeIgnoreThreshold")
      return false
    }
    // Reject file if mod time > current batch time
    if (modTime > currentTime) {
      logDebug(s"$pathStr not selected as mod time $modTime > current time $currentTime")
      return false
    }
    // Reject file if it was considered earlier
    if (recentlySelectedFiles.contains(pathStr)) {
      logDebug(s"$pathStr already considered")
      return false
    }
    logDebug(s"$pathStr accepted with mod time $modTime")
    return true
  }

  private def getFileModTime(path: Path) = {
    fileToModTime.getOrElseUpdate(path.toString, fs.getFileStatus(path).getModificationTime())
  }
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值