spark Streaming newFilesOnly无法处理已有文件记录

pioneer_LL

已于 2022-05-31 16:54:13 修改

阅读量158

点赞数

分类专栏： spark 文章标签： spark 大数据

于 2022-05-31 16:50:17 首次发布

本文链接：https://blog.csdn.net/pioneer_LL/article/details/125068709

版权

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

val sparkconf = new SparkConf().setAppName("st").setMaster("local[2]")
val ssc = new StreamingContext(sc, Seconds(60))

val path = "C:\\Users\\Desktop\\files\\*"
val lines: DStream[String] = ssc.fileStream[LongWritable, Text, TextInputFormat](path, (t: org.apache.hadoop.fs.Path) => true, newFilesOnly = false).map(_._2.toString)
    val wc = lines
      .flatMap(_.split(","))
      .map(arr => (arr, 1))
        .reduceByKey(_+_)
 wc.count().print()

这里本想连同路径下已经存在的文件一起处理，但是设置newFilesOnly = false并没有生效，经过查询测试，newFilesOnly =false只会处理与当前处理窗口相邻的上一个窗口中的文件，上述代码中设置的窗口时长是60s，假如当前时间是12:05, 则(12:04,12:05]创建的历史文件是可以被处理的，再往前就处理不了了，处理办法是加上参数: sparkconf.set("spark.streaming.fileStream.minRememberDuration", "86400s"),指定多少时间，就能处理当前时间向前推这个时间段内生成的历史数据

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

pioneer_LL

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark Streaming newFilesOnly无法处理已有文件记录

val sparkconf = new SparkConf().setAppName("st").setMaster("local[2]")val ssc = new StreamingContext(sc, Seconds(60))val path = "C:\\Users\\Desktop\\files\\*"val lines: DStream[String] = ssc.fileStream[LongWritable, Text, TextInputFormat](path, (t: ...
复制链接

扫一扫