Flume版本 1.9.0
场景:test文件夹下有目录dir_a,dir_b,以及文件c.log;目录dir_a下又有子目录dir_a_1和文件a_01.log,目录dir_a_1下又可能有子目录和子文件......
需求:监控test目录下所有递归子目录及子文件变化。
flume官方没有提供 满足需求的Source,可以考虑两种方式:
-
①自定义Source
-
②在已有的Source基础上修改源码进行功能拓展
1.下载flume源码包
https://github.com/apache/flume.git
2.材料选择:TaildirSource
TairdirSource支持正则匹配监控父目录下的多个子项目,同时内部维护了offset支持断点续传的功能
3.分析源码
TaildirSource.java
public synchronized void start() {
logger.info("{} TaildirSource source starting with directory: {}", getName(), filePaths);
try {
reader = new ReliableTaildirEventReader.Builder()
.filePaths(filePaths)
.headerTable(headerTable)
.positionFilePath(positionFilePath)
.skipToEnd(skipToEnd)
.addByteOffset(byteOffsetHeader)
.cachePatternMatching(cachePatternMatching)
.annotateFileName(fileHeader)
.fileNameHeader(fileHeaderKey)
.build();
} catch (IOException e) {
throw new FlumeException("Error instantiating ReliableTaildirEventReader", e);
}
idleFileChecker = Executors.newSingleThreadScheduledExecutor(
new ThreadFactoryBuilder().setNameFormat("idleFileChecker").build());
idleFileChecker.scheduleWithFixedDelay(new idleFileCheckerRunnable(),
idleTimeout, checkIdleInterval, TimeUnit.MILLISECONDS);
positionWriter = Executors.newSingleThreadScheduledExecutor(
new ThreadFactoryBuilder().setNameFormat("positionWriter").build());
positionWriter.scheduleWithFixedDelay(new PositionWriterRunnable(),
writePosInitDelay, writePosInterval, TimeUnit.MILLISECONDS);
super.start();
logger.debug("TaildirSource started");
sourceCounter.start();
}
start()方法为TaildirSource的入口方法,追踪 ReliableTaildirEventReader.Builder().build()
追踪构造方法:
向下看看到他调用 updateTailFiles(skipToEnd);
该方法描述为:如果有新文件创建或追加更新tailFiles映射,正是我们要找的东西
追踪getMatchingFiles() 方法
TaildirMatcher.java
List<File> getMatchingFiles() {
long now = TimeUnit.SECONDS.toMillis(
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
long currentParentDirMTime = parentDir.lastModified();
List<File> result;
if (!cachePatternMatching ||
lastSeenParentDirMTime < currentParentDirMTime ||
!(currentParentDirMTime < lastCheckedTime)) {
lastMatchedFiles = sortByLastModifiedTime(getMatchingFilesNoCache());
lastSeenParentDirMTime = currentParentDirMTime;
lastCheckedTime = now;
}
return lastMatchedFiles;
}
private List<File> getMatchingFilesNoCache() {
List<File> result = Lists.newArrayList();
try (DirectoryStream<Path> stream = Files.newDirectoryStream(parentDir.toPath(), fileFilter)) {
for (Path entry : stream) {
result.add(entry.toFile());
}
} catch (IOException e) {
logger.error("I/O exception occurred while listing parent directory. " +
"Files already matched will be returned. " + parentDir.toPath(), e);
}
return result;
}
返回结果为 lastMatchedFiles //包含多个File对象的List;
sortByLastModifiedTime(List<File> files) 方法 只对传入的list按修改时间进行排序 ,真正返回的结果是 getMatchingFilesNoCache() 的返回值。
官方提供的getMatchingFilesNoCache() 方法仅对 给定的父目录进行的监控,添加逻辑使其可以监控给定的父目录及父目录下所有递归子目录。
4.修改源码
private List<File> getMatchingFilesNoCache() {
List<File> result = Lists.newArrayList();
List<File> dirs = Lists.newArrayList();
dirs.add(parentDir);
getDirs(parentDir,dirs);
for (File dir : dirs) {
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir.toPath(), fileFilter)) {
for (Path entry : stream) {
result.add(entry.toFile());
}
} catch (IOException e) {
logger.error("I/O exception occurred while listing parent directory. " +
"Files already matched will be returned. " + parentDir.toPath(), e);
}
}
return result;
}
/**
* 获取所有子孙目录
*/
private void getDirs(File f,List<File> dirs){
File[] files = f.listFiles();
if (files != null) {
for (File file : files) {
if (file.isDirectory()) {
dirs.add(file);
getDirs(file, dirs);
}
}
}
}
}
}
5. 将修改好的TaildirSource模块打包 ,将 flume-taildir-source-1.9.0.jar 上传到flume的lib目录下
6.编写测试用的conf文件
test.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1a1.sources.r1.type=TAILDIR
a1.sources.r1.filegroups=f1
a1.sources.r1.filegroups.f1=/opt/test/.*a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
7.启动Agent进行测试
bin/flume-ng agent -c conf/ -n a1 -f conf/test.conf -Dflume.root.logger=INFO,console