问题描述:如下代码,在对应路径添加文件时,无法被读取,通过debug日志查看可知,新添加的数据都被过滤了。
def run(ssc: StreamingContext, directory: String): Unit = {
ssc.fileStream[LongWritable, Text, TextInputFormat](directory,
filter => defaultFilter(filter),
newFilesOnly = false)
.map(_._2.toString)
.foreachRDD(rdd => {
System.err.println(s"hahahaha: ${rdd.collect().mkString("Array(", ", ", ")")}")
System.err.println(s"wawawawa: ${rdd.getNumPartitions}")
rdd.foreachPartition(iter => {
var length = 0
System.err.println("foreachPartitionLog")
while (iter.hasNext) {
length += 1
val row = iter.next()
println(s"$length: $row")
}
})
})
}
过滤日志:由于两个文件的修改日期小于对应区间的最小时间
2022-07-05 14:33:15.047 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: FileInputDStream.writeObject used
2022-07-05 14:33:15.941 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Clearing checkpoint data
2022-07-05 14:33:15.941 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Cleared checkpoint data
2022-07-05 14:33:20.000 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Time 1657002800000 ms is valid
2022-07-05 14:33:20.000 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Getting new files for time 1657002800000, ignoring files older than 1657002740000
filter: demo.txt
filter: demo2.txt
2022-07-05 14:33:20.019 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: hdfs://****:8020/user/flowaix/warehouse/evil_apk_analyse.db/apk_evil/demo.txt ignored as mod time 1656988559509 <= ignore time 1657002740000
2022-07-05 14:33:20.035 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: hdfs://****:8020/user/flowaix/warehouse/evil_apk_analyse.db/apk_evil/demo2.txt ignored as mod time 1657002288707 <= ignore time 1657002740000
2022-07-05 14:33:20.035 [ INFO] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: Finding new files took 35 ms
2022-07-05 14:33:20.035 [DEBUG] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: # cached file times = 2
2022-07-05 14:33:20.035 [ INFO] {JobGenerator} apache.spark.streaming.dstream.FileInputDStream: New files at time 1657002800000 ms:
实际上demo2.txt文件时新添加上去的,之所以还是小于了,是因为我的程序是在自己的电脑上,而hdfs在另一个集群上,两种时间不同步。集群上的时间比实际时间小了8分钟。
具体源码分析:
调取fileStream实际是注册FileInputDStream类到Graph中。其中lastNewFileFindingTime变量获取的是clock时钟的当前时间,。其中该类的对于文件的过滤使用如下方法,currentTime参数是获取RDD区间的最后时间,也就是当前时间
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val newFileFilter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val directoryFilter = new PathFilter {
override def accept(path: Path): Boolean = fs.getFileStatus(path).isDirectory
}
val directories = fs.globStatus(directoryPath, directoryFilter).map(_.getPath)
val newFiles = directories.flatMap(dir =>
fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
其中关键逻辑在判断方法isNewFile方法中,通过findNewFiles方法传过来的时间参数,可知程序的当前时间是与文件系统fs获取的修改时间进行比较。
private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = {
val pathStr = path.toString
// Reject file if it does not satisfy filter
if (!filter(path)) {
logDebug(s"$pathStr rejected by filter")
return false
}
// Reject file if it was created before the ignore time
val modTime = getFileModTime(path)
if (modTime <= modTimeIgnoreThreshold) {
// Use <= instead of < to avoid SPARK-4518
logDebug(s"$pathStr ignored as mod time $modTime <= ignore time $modTimeIgnoreThreshold")
return false
}
// Reject file if mod time > current batch time
if (modTime > currentTime) {
logDebug(s"$pathStr not selected as mod time $modTime > current time $currentTime")
return false
}
// Reject file if it was considered earlier
if (recentlySelectedFiles.contains(pathStr)) {
logDebug(s"$pathStr already considered")
return false
}
logDebug(s"$pathStr accepted with mod time $modTime")
return true
}
private def getFileModTime(path: Path) = {
fileToModTime.getOrElseUpdate(path.toString, fs.getFileStatus(path).getModificationTime())
}