下载源码包
flume-ng-1.6.0-cdh5.7.0-src.tar.gz
解压导入IDEA
找到我们需要修改的getMatchFiles方法
/**
* 修改flume源码,使其支持递归
* @param parentDir
* @param fileNamePattern
* @return
*/
private List<File> getMatchFiles(File parentDir, final Pattern fileNamePattern) {
//所有指定文件夹下的所有文件,在通过正则匹配规则过滤不符合条件的文件
List<File> result = Lists.newArrayList();
for(File f: getAllFiles(parentDir)){
String fileName = f.getName();
if (fileNamePattern.matcher(fileName).matches()) {
result.add(f);
}
}
Collections.sort(result, new TailFile.CompareByLastModifiedTime());
return result;
}
/**
* 新增方法
* 获取指定目录下的所有文件,通过递归的方式
* @param parentDir
* @return
*/
private List<File> getAllFiles(File parentDir){
List<File> fileList = Lists.newArrayList();
getAllFiles(parentDir,fileList);
return fileList;
}
/**
* 新增方法
*/
private void getAllFiles(File parentDir,List<File> fileList){
File[] files = parentDir.listFiles();
if(null != files){
for(File file: parentDir.listFiles()){
if(file.isDirectory()){
getAllFiles(file,fileList);
}else{
fileList.add(file);
}
}
}
}
上传到服务器,编译
把这个类ReliableTaildirEventReader上传到该路径下替换
[hadoop@hadoop001 taildir]$ ll
total 36
-rw-rw-r-- 1 hadoop hadoop 11411 Mar 24 2016 ReliableTaildirEventReader.java
-rw-rw-r-- 1 hadoop hadoop 2418 Mar 24 2016 TaildirSourceConfigurationConstants.java
-rw-rw-r-- 1 hadoop hadoop 12027 Mar 24 2016 TaildirSource.java
-rw-rw-r-- 1 hadoop hadoop 5129 Mar 24 2016 TailFile.java
[hadoop@hadoop001 taildir]$ pwd
/home/hadoop/source/flume-ng-1.6.0-cdh5.7.0/flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir
[hadoop@hadoop001 taildir]$
[hadoop@hadoop001 flume-taildir-source]$ pwd
/home/hadoop/source/flume-ng-1.6.0-cdh5.7.0/flume-ng-sources/flume-taildir-source
[hadoop@hadoop001 flume-taildir-source]$ mvn clean package
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] --- maven-jar-plugin:2.3.1:jar (default-jar) @ flume-taildir-source ---
[INFO] Building jar: /home/hadoop/source/flume-ng-1.6.0-cdh5.7.0/flume-ng-sources/flume-taildir-source/target/flume-taildir-source-1.6.0-cdh5.7.0.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:33 min
[INFO] Finished at: 2020-05-20T19:57:29+08:00
[INFO] Final Memory: 37M/981M
[INFO] ------------------------------------------------------------------------
[hadoop@hadoop001 flume-taildir-source]$ ll
total 4
-rw-rw-r-- 1 hadoop hadoop 1970 Mar 24 2016 pom.xml
drwxrwxr-x 4 hadoop hadoop 30 Mar 24 2016 src
drwxrwxr-x 8 hadoop hadoop 212 May 20 19:57 target
[hadoop@hadoop001 flume-taildir-source]$ cd target/
[hadoop@hadoop001 target]$ ll
total 36
drwxrwxr-x 4 hadoop hadoop 33 May 20 19:55 classes
-rw-rw-r-- 1 hadoop hadoop 31327 May 20 19:57 flume-taildir-source-1.6.0-cdh5.7.0.jar
drwxrwxr-x 4 hadoop hadoop 49 May 20 19:55 generated-sources
drwxrwxr-x 2 hadoop hadoop 28 May 20 19:57 maven-archiver
drwxrwxr-x 3 hadoop hadoop 22 May 20 19:55 maven-shared-archive-resources
drwxrwxr-x 2 hadoop hadoop 4096 May 20 19:57 surefire-reports
drwxrwxr-x 4 hadoop hadoop 33 May 20 19:56 test-classes
[hadoop@hadoop001 target]$
把该目录下的flume-taildir-source-1.6.0-cdh5.7.0.jar包复制到Flume应用程序的lib目录下
[hadoop@hadoop001 target]$ cp flume-taildir-source-1.6.0-cdh5.7.0.jar ~/app/apache-flume-1.6.0-cdh5.7.0-bin/lib/
新建conf文件,测试TaildirSource
我们这里直接sink到HDFS上
# example.conf: A single-node Flume configuration
# Name the components on this agent
taildir-hdfs-agent.sources = taildir-source
taildir-hdfs-agent.sinks = hdfs-sink
taildir-hdfs-agent.channels = memory-channel
# Describe/configure the source
taildir-hdfs-agent.sources.taildir-source.type = TAILDIR
taildir-hdfs-agent.sources.taildir-source.filegroups = f1
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/hadoop/data/flume/taildir/input/.*.txt
taildir-hdfs-agent.sources.taildir-source.positionFile = /home/hadoop/data/flume/taildir/taildir_position/taildir_position.json
# Describe the sink
taildir-hdfs-agent.sinks.hdfs-sink.type = hdfs
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://hadoop001:9000/flume/taildir/%Y%m%d%H%M
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = leo
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 100000000
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0
# Use a channel which buffers events in memory
taildir-hdfs-agent.channels.memory-channel.type = memory
taildir-hdfs-agent.channels.memory-channel.capacity = 1000
taildir-hdfs-agent.channels.memory-channel.transactionCapacity = 100
# Bind the source and sink to the channel
taildir-hdfs-agent.sources.taildir-source.channels = memory-channel
taildir-hdfs-agent.sinks.hdfs-sink.channel = memory-channel
启动flume-agent,测试
flume-ng agent \
--name taildir-hdfs-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/taildir-hdfs-agent.conf \
-Dflume.root.logger=INFO,console
克隆一个窗口
[hadoop@hadoop001 input]$ mkdir -p /home/hadoop/data/flume/taildir/input/1/2
[hadoop@hadoop001 input]$ echo "hello hadoop" >> /home/hadoop/data/flume/taildir/input/1/2/test.txt
[hadoop@hadoop001 input]$ echo "666" >> /home/hadoop/data/flume/taildir/input/1/test.txt
测试成功
[hadoop@hadoop001 input]$ hdfs dfs -ls /flume/taildir/
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2020-05-20 20:14 /flume/taildir/202005202013
[hadoop@hadoop001 input]$ hdfs dfs -ls /flume/taildir/202005202013
Found 1 items
-rw-r--r-- 1 hadoop supergroup 57 2020-05-20 20:14 /flume/taildir/202005202013/leo.1589976812778.gz
[hadoop@hadoop001 input]$ hdfs dfs -text /flume/taildir/202005202013/leo.1589976812778.gz
hello hadoop
666
[hadoop@hadoop001 input]$