TaildirSource文件监控原理探究

雲Miao

已于 2023-12-17 17:27:57 修改

阅读量314

点赞数

分类专栏：大数据文章标签： flume

于 2023-12-17 17:25:18 首次发布

本文链接：https://blog.csdn.net/Moonlightxiang/article/details/135047242

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

TaildirSource源码分析及自定义配置

TaildirSource的用途

TaildirSource 是 Apache Flume 中的一个源（Source）类型，用于监视指定目录下的文件，并将文件的内容作为事件发送到 Flume 的通道（Channel）。

TaildirSource 的作用是实时监控指定目录中新增文件的变化，并将这些新增文件的内容发送给 Flume 的后续处理组件，如拦截器、转换器和接收器等。它的主要用途是用于日志收集和实时数据传输场景，特别适用于处理流式数据。

以下是 TaildirSource 的一些关键功能和特点：

实时监控：TaildirSource 实时监控指定目录中新增文件的变化。当有新文件创建或既有文件发生变化（如追加内容）时，TaildirSource 可以立即检测到并开始读取文件内容。
多文件支持：TaildirSource 可以同时监视多个文件，而不仅仅是单个文件。您可以通过配置文件路径模式来指定要监视的文件集合，例如通配符或正则表达式。
文件位置追踪：TaildirSource 会跟踪每个被监视文件的位置，以保证在断电或重启后能够继续从上次读取的位置开始。它会记录文件的偏移量，以便在下次启动时继续读取文件的新增内容。
可定制化配置：TaildirSource 提供了丰富的配置选项，允许您根据需要进行灵活的配置和调整。您可以设置文件编码、忽略特定文件、添加头部信息等。

版本介绍：示例及本文下述的环境均使用flume:1.7和jdk8

TaildirSource的使用示例

这里笔者提供一个TaildirSource的使用示例，其配置文件如下所示

a1.sources = s1  
a1.channels = c1  
a1.sinks = k1  
  
# describe/configure s1  
a1.sources.s1.type = TAILDIR
a1.sources.s1.channels = c1
a1.sources.s1.channels.skipToEnd = True
# throught JSON format to record the inode, the absolute path and the last position of each tailing file.For to continual work
a1.sources.s1.positionFile = ./taildir/taildir_position.json
# throught Space-separated list file dir which will been tail
a1.sources.s1.filegroups = f1
# define f1 info.
a1.sources.s1.filegroups.f1 = ./test/.*log.*
a1.sources.s1.headers.f1.headerKey1 = value1
# a1.sources.s1.filegroups.f2 = ./test/*.txt.*
# a1.sources.s1.headers.f2.headerKey1 = value2
# a1.sources.s1.headers.f2.headerKey2 = value2-2
# a1.sources.s1.fileHeader = true
  
# use a channel which buffers events in memory  
# type:memory or file is to temporary to save buffer data which is sink using  
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

这个简单的示例主要作用是监视工作目录下test目录里面所有包含log的文件，变化事件将会使用loggerSink进行打印输出

效果很简单，我们在test目录下写入一个log文件

$ echo 'hello' > test/1.log

# flume端控制台打印
2023-12-16 23:35:50,351 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{headerKey1=value1} body: 68 65 6C 6C 6F                                  hello }

但是这里flume会有一个问题，就是我们使用mv命令将文件重命名为另外一个log文件后，会重复发送消息，如下所示

$ mv test/1.log test/2.log

# flume端控制台重复打印
2023-12-16 23:40:07,614 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{headerKey1=value1} body: 68 65 6C 6C 6F                                  hello }

为什么会出现这个问题，俗话讲知其然知其所以然，这里就需要我们来扒源码了解其中的缘由！

TaildirSource源码分析

TaildirSource

public class TaildirSource extends AbstractSource implements  
        PollableSource, Configurable {  // 注意点1
  // 缓存监控中文件的inode值
  private List<Long> existingInodes = new CopyOnWriteArrayList<Long>();
  // 注意点2
  @Override
    public synchronized void configure(Context context) {
        logger.info("{} TaildirSource source 开始执行自定义装配过程", getName());
        String fileGroups = context.getString(FILE_GROUPS);
        Preconditions.checkState(fileGroups != null, "Missing param: " + FILE_GROUPS);

        filePaths = selectByKeys(context.getSubProperties(FILE_GROUPS_PREFIX),
                fileGroups.split("\s+"));
        Preconditions.checkState(!filePaths.isEmpty(),
                "Mapping for tailing files is empty or invalid: '" + FILE_GROUPS_PREFIX + "'");

        String homePath = System.getProperty("user.home").replace('\', '/');
        positionFilePath = context.getString(POSITION_FILE, homePath + DEFAULT_POSITION_FILE);
        Path positionFile = Paths.get(positionFilePath);
        try {
            Files.createDirectories(positionFile.getParent());
        } catch (IOException e) {
            throw new FlumeException("Error creating positionFile parent directories", e);
        }
        headerTable = getTable(context, HEADERS_PREFIX);
        batchSize = context.getInteger(BATCH_SIZE, DEFAULT_BATCH_SIZE);
        skipToEnd = context.getBoolean(SKIP_TO_END, DEFAULT_SKIP_TO_END);
        byteOffsetHeader = context.getBoolean(BYTE_OFFSET_HEADER, DEFAULT_BYTE_OFFSET_HEADER);
        idleTimeout = context.getInteger(IDLE_TIMEOUT, DEFAULT_IDLE_TIMEOUT);
        writePosInterval = context.getInteger(WRITE_POS_INTERVAL, DEFAULT_WRITE_POS_INTERVAL);
        cachePatternMatching = context.getBoolean(CACHE_PATTERN_MATCHING,
                DEFAULT_CACHE_PATTERN_MATCHING);

        backoffSleepIncrement = context.getLong(PollableSourceConstants.BACKOFF_SLEEP_INCREMENT,
                PollableSourceConstants.DEFAULT_BACKOFF_SLEEP_INCREMENT);
        maxBackOffSleepInterval = context.getLong(PollableSourceConstants.MAX_BACKOFF_SLEEP,
                PollableSourceConstants.DEFAULT_MAX_BACKOFF_SLEEP);
        fileHeader = context.getBoolean(FILENAME_HEADER,
                DEFAULT_FILE_HEADER);
        fileHeaderKey = context.getString(FILENAME_HEADER_KEY,
                DEFAULT_FILENAME_HEADER_KEY);
        openLevel = context.getBoolean("openLevel");

        if (sourceCounter == null) {
            sourceCounter = new SourceCounter(getName());
        }
    }
  
  @Override
    public synchronized void start() {
        logger.info("{} TaildirSource source starting with directory: {}", getName(), filePaths);
        try {
            // 注意点3
            reader = new ReliableTaildirEventReader.Builder()
                    .filePaths(filePaths)
                    .headerTable(headerTable)
                    .positionFilePath(positionFilePath)
                    .skipToEnd(skipToEnd)
                    .addByteOffset(byteOffsetHeader)
                    .cachePatternMatching(cachePatternMatching)
                    .annotateFileName(fileHeader)
                    .fileNameHeader(fileHeaderKey)
                    .openLevel(openLevel)
                    .build();
        } catch (IOException e) {
            throw new FlumeException("Error instantiating ReliableTaildirEventReader", e);
        }
        idleFileChecker = Executors.newSingleThreadScheduledExecutor(
                new ThreadFactoryBuilder().setNameFormat("idleFileChecker").build());
        idleFileChecker.scheduleWithFixedDelay(new idleFileCheckerRunnable(),
                idleTimeout, checkIdleInterval, TimeUnit.MILLISECONDS);

        positionWriter = Executors.newSingleThreadScheduledExecutor(
                new ThreadFactoryBuilder().setNameFormat("positionWriter").build());
        positionWriter.scheduleWithFixedDelay(new PositionWriterRunnable(),
                writePosInitDelay, writePosInterval, TimeUnit.MILLISECONDS);

        super.start();
        logger.debug("TaildirSource started");
        sourceCounter.start();
    }
  
  @Override
    public Status process() {
        // 轮询状态
        Status status = Status.READY;
        try {
            // 注意点4
            existingInodes.clear();
            existingInodes.addAll(reader.updateTailFiles());
            for (long inode : existingInodes) {
                TailFile tf = reader.getTailFiles().get(inode);
                if (tf.needTail()) {
                    tailFileProcess(tf, true);
                }
            }
            closeTailFiles();
            try {
                TimeUnit.MILLISECONDS.sleep(retryInterval);
            } catch (InterruptedException e) {
                logger.info("Interrupted while sleeping");
            }
        } catch (Throwable t) {
            logger.error("Unable to tail files", t);
            status = Status.BACKOFF;
        }
        return status;
    }
}

TaildirSource我们主要需要关注以上方法，这里笔者逐一介绍：

注意点1：TaildirSource实现了PollableSource接口，具体来说这个接口是代码底层类PollableSourceRunner轮询调用该类中的process方法

// PollableSourceRunner子类PollingRunner
@Override
public void run() {
  logger.debug("Polling runner starting. Source:{}", source);
  // 轮询状态是否满足
  while (!shouldStop.get()) {
    counterGroup.incrementAndGet("runner.polls");
    try {
      // 调用source类中的process方法
      if (source.process().equals(PollableSource.Status.BACKOFF)) {
        counterGroup.incrementAndGet("runner.backoffs");

        Thread.sleep(Math.min(
          counterGroup.incrementAndGet("runner.backoffs.consecutive")
          * source.getBackOffSleepIncrement(), source.getMaxBackOffSleepInterval()));
      } else {
        counterGroup.set("runner.backoffs.consecutive", 0L);
      }
    }
    // 省略...
  }

  logger.debug("Polling runner exiting. Metrics:{}", counterGroup);
}

注意点2：类加载时最先调用的是config方法，其主要作用是解析我们的example.conf配置文件里面参数，并保存到类中的静态变量里
注意点3：config方法调用完毕后，再被调用的是start方法，核心在于构造了一个ReliableTaildirEventReader对象，这个对象我们放在后面分析，接着开启两个单线程的延时调度线程池，分别用来检查文件是否更新和回写断点续传文件positionFile.json

注意点4：existingInodes实际保存的是所有正在监控文件的inode值，这个值同时用来回写断点续传文件线程使用

private class PositionWriterRunnable implements Runnable {
  @Override
  public void run() {
    writePosition();
  }
}

private void writePosition() {
  File file = new File(positionFilePath);
  FileWriter writer = null;
  try {
    writer = new FileWriter(file);
    // 根据existingInodes回写positionFile
    if (!existingInodes.isEmpty()) {
      String json = toPosInfoJson();
      writer.write(json);
    }
  } catch (Throwable t) {
    logger.error("Failed writing positionFile", t);
  } finally {
    try {
      if (writer != null) writer.close();
    } catch (IOException e) {
      logger.error("Error: " + e.getMessage(), e);
    }
  }
}

这里每次的process方法调用时都会将existingInodes清空，同时调用reader.updateTailFiles()方法重新获取待监控的文件inode，而这块就引出了ReliableTaildirEventReader对象，对于这个对象，我们只需要关心其中一个方法

/**
     * Update tailFiles mapping if a new file is created or appends are detected
     * to the existing file.
     */
public List<Long> updateTailFiles(boolean skipToEnd) throws IOException {
  updateTime = System.currentTimeMillis();
  List<Long> updatedInodes = Lists.newArrayList();
  // 这里的TaildirMatcher实际上是我们配置文件中配置的监控组的文件匹配条件
  // 我们配置示例中的 a1.sources.s1.filegroups = f1 就只对应了一个TaildirMatcher
  for (TaildirMatcher taildir : taildirCache) {
    Map<String, String> headers = headerTable.row(taildir.getFileGroup());

    for (File f : taildir.getMatchingFiles()) {
      long inode = getInode(f);
      // tailFiles对象保存的是当前正处于监控的文件
      TailFile tf = tailFiles.get(inode);
      if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
        long startPos = skipToEnd ? f.length() : 0;
        tf = openFile(f, headers, inode, startPos);
      } else {
        boolean updated = tf.getLastUpdated() < f.lastModified() || tf.getPos() != f.length();
        if (updated) {
          // 注意点5
          if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
            tf = openFile(f, headers, inode, tf.getPos());
          }
          if (f.length() < tf.getPos()) {
            logger.info("Pos " + tf.getPos() + " is larger than file size! "
                        + "Restarting from pos 0, file: " + tf.getPath() + ", inode: " + inode);
            tf.updatePos(tf.getPath(), inode, 0);
          }
        }
        tf.setNeedTail(updated);
      }
      tailFiles.put(inode, tf);
      updatedInodes.add(inode);
    }
  }
  return updatedInodes;
}

这块代码中我们着重了解taildir.getMatchingFiles()这个方法，作用是根据我们的监控文件组匹配上需要监控的文件集合，我们点入这个方法进去一探究竟

// 构造方法
TaildirMatcher(String fileGroup, String filePattern, boolean cachePatternMatching) {
  this.fileGroup = fileGroup;
  this.filePattern = filePattern;
  this.cachePatternMatching = cachePatternMatching;

  File f = new File(filePattern);
  this.parentDir = f.getParentFile();
  String regex = f.getName();
  final PathMatcher matcher = FS.getPathMatcher("regex:" + regex);
  // 注意点4-1
  this.fileFilter = new DirectoryStream.Filter<Path>() {
    @Override
    public boolean accept(Path entry) throws IOException {
      return matcher.matches(entry.getFileName()) && !Files.isDirectory(entry);
    }
  };

  Preconditions.checkState(parentDir.exists(),
                           "Directory does not exist: " + parentDir.getAbsolutePath());
}

// 获取待监控的文件
List<File> getMatchingFiles() {
  long now = TimeUnit.SECONDS.toMillis(
    TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
  // 注意点4-2
  long currentParentDirMTime = parentDir.lastModified();
  List<File> result;

  if (!cachePatternMatching ||
      lastSeenParentDirMTime < currentParentDirMTime ||
      !(currentParentDirMTime < lastCheckedTime)) {
    lastMatchedFiles = sortByLastModifiedTime(getMatchingFilesNoCache());
    lastSeenParentDirMTime = currentParentDirMTime;
    lastCheckedTime = now;
  }

  return lastMatchedFiles;
}

// 将所有匹配上文件作为集合返回
private List<File> getMatchingFilesNoCache() {
  List<File> result = Lists.newArrayList();
  try (DirectoryStream<Path> stream = Files.newDirectoryStream(parentDir.toPath(), fileFilter)) {
    for (Path entry : stream) {
      result.add(entry.toFile());
    }
  } catch (IOException e) {
    logger.error("I/O exception occurred while listing parent directory. " +
                 "Files already matched will be returned. " + parentDir.toPath(), e);
  }
  return result;
}

注意点4-2：这里使用了一个过滤器，用于过滤没有被我们filegroups规则匹配上的文件和所有的目录
注意点4-3：这里获取最后修改时间的方法是取当前监控文件的父目录，也就是示例中工程目录下的test目录的最后修改时间

注意点5：继续回到ReliableTaildirEventReader的updateTailFiles方法，在获取到所有待监控的文件后，会判断inode是否在监控文件缓存中，判断成功之后判断待监控的文件的文件路径是否和该缓存中的文件一致，如果不一致，则认为这是一个新监控文件来处理，这也是我们解决当前问题最核心的地方；

解决方案：我们在使用mv命令的时候文件路径肯定会变动，但是文件的inode却不会变动，我们只需将绝对路径的判断去掉就能够达到目的

public List<Long> updateTailFiles(boolean skipToEnd) throws IOException {
  updateTime = System.currentTimeMillis();
  List<Long> updatedInodes = Lists.newArrayList();

  for (TaildirMatcher taildir : taildirCache) {
    Map<String, String> headers = headerTable.row(taildir.getFileGroup());

    for (File f : taildir.getMatchingFiles()) {
      long inode = getInode(f);
      TailFile tf = tailFiles.get(inode);
      if (tf == null) {
        long startPos = skipToEnd ? f.length() : 0;
        tf = openFile(f, headers, inode, startPos);
      } else {
        boolean updated = tf.getLastUpdated() < f.lastModified() || tf.getPos() != f.length();
        if (updated) {
          // 不再校验变更前后的文件绝对路径是否变更，只校验inode值
          if (tf.getRaf() == null) {
            tf = openFile(f, headers, inode, tf.getPos());
          }
          if (f.length() < tf.getPos()) {
            logger.info("Pos " + tf.getPos() + " is larger than file size! "
                        + "Restarting from pos 0, file: " + tf.getPath() + ", inode: " + inode);
            tf.updatePos(tf.getPath(), inode, 0);
          }
        }
        tf.setNeedTail(updated);
      }
      tailFiles.put(inode, tf);
      updatedInodes.add(inode);
    }
  }
  return updatedInodes;
}

方便起见，笔者将已经编译好的jar包提供出来供大家下载，替换掉lib/目录下的原jar包即可使用

jar包路径: http://yunmiao-bucket.oss-cn-beijing.aliyuncs.com/jar/flume-taildir-source-1.8.0.jar。

taildirSource没法递归监控的思考

通过分析，由于注意点4-1中过滤器只会保留监控路径下的匹配条件的文件，如果我们想要TaildirSource能够递归监控下层目录的文件，那我们可以怎么做呢？

这里笔者提供部分代码，可以供大家参考参考

TaildirMatcher

// 根据配置文件的openLevel作为是否开启递归的开关，默认关闭
private final boolean openLevel;

TaildirMatcher(String fileGroup, String filePattern, boolean cachePatternMatching, boolean openLevel) {
  this.fileGroup = fileGroup;
  this.filePattern = filePattern;
  this.cachePatternMatching = cachePatternMatching;
  this.openLevel = openLevel;

  File f = new File(filePattern);
  this.parentDir = f.getParentFile();
  String regex = f.getName();
  final PathMatcher matcher = FS.getPathMatcher("regex:" + regex);
  this.fileFilter = new DirectoryStream.Filter<Path>() {
    @Override
    public boolean accept(Path entry) throws IOException {
      // 根据开关判断，如果递归则不过滤子目录
      return openLevel ? Files.isDirectory(entry) || matcher.matches(entry.getFileName())
        : matcher.matches(entry.getFileName()) && !Files.isDirectory(entry);
    }
  };

  Preconditions.checkState(parentDir.exists(),
                           "Directory does not exist: " + parentDir.getAbsolutePath());
}

List<File> getMatchingFiles() {
  long now = TimeUnit.SECONDS.toMillis(
    TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
  AtomicLong currentParentDirMTime = new AtomicLong(parentDir.lastModified());
  if (openLevel) {
    // 如果递归则向下取所有有过变更时间的文件目录，如果修改的是子目录则修改时间一定大于父目录的修改时间
    try {
      Files.walk(Paths.get(parentDir.toURI())).forEach(path -> {
        if (Files.isDirectory(path)) {
          currentParentDirMTime.getAndSet(Math.max(currentParentDirMTime.get(), path.toFile().lastModified()));
        }
      });
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }
  List<File> result;
  if (!cachePatternMatching ||
      lastSeenParentDirMTime < currentParentDirMTime.get() ||
      !(currentParentDirMTime.get() < lastCheckedTime)) {
    lastMatchedFiles = sortByLastModifiedTime(getMatchingFilesNoCache());
    lastSeenParentDirMTime = currentParentDirMTime.get();
    lastCheckedTime = now;
  }

  return lastMatchedFiles;
}

private List<File> getMatchingFilesNoCache() {
  List<File> result = Lists.newArrayList();
  try (DirectoryStream<Path> stream = Files.newDirectoryStream(parentDir.toPath(), fileFilter)) {
    for (Path entry : stream) {
      if (Files.isDirectory(entry) && openLevel) {
        // 如果是目录则递归匹配该目录下的文件
        addDirectoryFile(result, entry);
      } else {
        result.add(entry.toFile());
      }
    }
  } catch (IOException e) {
    logger.error("I/O exception occurred while listing parent directory. " +
                 "Files already matched will be returned. " + parentDir.toPath(), e);
  }
  return result;
}

private void addDirectoryFile(List<File> result, Path path) {
  try {
    Files.walk(path).forEach(sonPath -> {
      try {
        // 要注意Files.walk()方法会在子目录中放入当前目录路径，这里需要甄别一下
        if (Files.isDirectory(sonPath) && !Files.isSameFile(path, sonPath)) {
          addDirectoryFile(result, sonPath);
        } else if (!Files.isDirectory(sonPath)) {
          // 如果还有子目录则继续递归
          result.add(sonPath.toFile());
        }
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    });
  } catch (IOException e) {
    throw new RuntimeException(e);
  }
}

代码调整完毕，打包上传到lib目录下，我们通过配置文件对openLevel参数进行配置后，即可对监控路径下目录递归监控文件变化 （这一块已经放入上面提供的jar包中）

a1.sources.s1.openLevel = true

附上源码地址：https://github.com/moonlight2893267956/–flume1.7.git

雲Miao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
TaildirSource文件监控原理探究

TaildirSource的作用是实时监控指定目录中新增文件的变化，并将这些新增文件的内容发送给 Flume 的后续处理组件，如拦截器、转换器和接收器等。是 Apache Flume 中的一个源（Source）类型，用于监视指定目录下的文件，并将文件的内容作为事件发送到 Flume 的通道（Channel）。对象，这个对象我们放在后面分析，接着开启两个单线程的延时调度线程池，分别用来检查文件是否更新和回写断点续传文件。。
复制链接

扫一扫

专栏目录