flume 监听linux下的文件夹下所有文件,并将文件内容存入到hdfs,生成多个以时间戳结尾的文件,通过spark批量读取数据。
-
配置 flume-spooldir.conf
### define agent a3.sources = r3 a3.channels = c3 a3.sinks = k3 ### define sources a3.sources.r3.type = spooldir ### 要扫描的文件夹 a3.sources.r3.spoolDir = /usr/local/src/apache-flume-1.6.0-bin/data ### 以.log结尾的文件不扫描 a3.sources.r3.ignorePattern = ^(.)*\\.log$ ### 扫描完成的文件加一个后缀 a3.sources.r3.fileSuffix = .delete ### define channels a3.channels.c3.type = file a3.channels.c3.checkpointDir = /usr/local/src/apache-flume-1.6.0-bin/data/filechannel/checkpoint a3.channels.c3.dataDirs = /usr/local/src/apache-flume-1.6.0-bin/data/filechannel/data ### define sink a3.sinks.k3.type = hdfs ### 已当天日期在hdfs上创建一个文件夹 a3.sinks.k3.hdfs.path = hdfs://master:9000/user/root/%Y%m%d a3.sinks.k3.hdfs.writeFormat = Text a3.sinks.k3.hdfs.batchSize = 100 a3.sinks.k3.hdfs.useLocalTimeStamp =