FLUME如何使TAILDIR SOURCE支持递归监控文件夹

最新推荐文章于 2022-04-12 11:15:03 发布

黄土高坡上的独孤前辈

最新推荐文章于 2022-04-12 11:15:03 发布

阅读量1.5k

点赞数 2

分类专栏： flume

本文链接：https://blog.csdn.net/lihuazaizheli/article/details/109298810

版权

flume 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

1.flume的source选择
- 1.1 TAILDIR Souce支持断点还原
- 1.2 可配置文件组，里面使用正则表达式配置多个要监控的文件
2. TAILDIR不能覆盖的场景
3. 修改源代码，使得flume支持递归监控文件夹。
4.打包代码并测试
参考资料

1.flume的source选择

生产上一般选择taildir

# 官网参考配置
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

在这里插入图片描述

1.1 TAILDIR Souce支持断点还原

positionFile可以记录偏移量,即使flume agent挂掉,重启后可继续上次断点put数据到channel

taildir_position.json 定时刷新，File in JSON format to record the inode, the absolute path and the last position of each tailing file.

[root@hadoop01 flume]# tail -20f taildir_position.json 
[{"inode":103011733,"pos":39600,"file":"/var/log/hadoop-hdfs/hadoop-cmf-hdfs-NAMENODE-ruozedata001.log.out"}]tail: taildir_position.json: file truncated
[{"inode":103011733,"pos":39600,"file":"/var/log/hadoop-hdfs/hadoop-cmf-hdfs-NAMENODE-ruozedata001.log.out"}]tail: taildir_position.json: file truncated
[{"inode":103011733,"pos":39600,"file":"/var/log/hadoop-hdfs/hadoop-cmf-hdfs-NAMENODE-ruozedata001.log.out"}]

1.2 可配置文件组，里面使用正则表达式配置多个要监控的文件

a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*

2. TAILDIR不能覆盖的场景

这么好的taildir source有一点不完美，不能支持递归监控文件夹。

比如conf文件中配置了：
# Describe/configure the dirsource
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /var/log/hadoop-hdfs/.*log.*

此时可以监控到/var/log/hadoop-hdfs/test.log 的日志文件
但是文件路径/var/log/hadoop-hdfs/f1/test.log
            /var/log/hadoop-hdfs/f1/f2/test.log
            ......
            是监控不到的

3. 修改源代码，使得flume支持递归监控文件夹。

所以就只能修改源代码了，需要注意的是无论是Apache版本的还是CDH的都能够兼容使用，我这里使用的版本是apache-flume-1.7.0-bin，但是即使你使用cdh的版本编译源码，也是没问题的。

3.1 flume taildir source源码结构

   在Module  flume-taildir-source下
   org.apache.flume.source.taidir
      -- ReliableTaildirEventReader  监听文件及获取文件的position
      -- TaildirMatcher  Identifies and caches the files matched by single file pattern for {@code TAILDIR} source.
      -- TaildirSource  核心类，configure-> start-> process-> stop 四大重要方法
      -- TaildirSourceConfigurationConstants  TaildirSource的常量配置类
      -- TailFile TailFile对象（构造及get set方法）,其他类中使用该类

在这里插入图片描述

3.2 TaildirSource 核心类解析

   TaildirSource  核心类 顺序 configure-> start-> process-> stop
   public class TaildirSource extends AbstractSource implements
PollableSource, Configurable {
        
      //2.调用start，构建出ReliableTaildirEventReader  监听文件及获取文件的position
      @Override
      public synchronized void start() {
      
      }
      
      //4.stop
      @Override
      public synchronized void stop() {
      
      }
      
      //1.首先加载配置文件
      @Override
      public synchronized void configure(Context context) {
      
      }
      
      //3.【改源码的重点】中间处理过程
      @Override
      public Status process() {
        
       }

}


 TaildirSource继承了PollableSource，从PollableSource的源码注释可以看到
 是一个外部驱动定时拉取file的数据。这就是sparkStreaming一样，是伪实时的，而不是像flink，是真正的实时。
 
 /**
 * A {@link Source} that requires an external driver to poll to determine
 * whether there are {@linkplain Event events} that are available to ingest
 * from the source.
 *
 * @see org.apache.flume.source.EventDrivenSourceRunner
 */
public interface PollableSource extends Source {


}

3.2.1 start方法会使用建造者模式构建创建一个ReliableTaildirEventReade

# 建造者模式详解
https://www.runoob.com/design-pattern/builder-pattern.html

@Override
public synchronized void start() {
    logger.info("{} TaildirSource source starting with directory: {}", getName(), filePaths);
    try {
        reader = new ReliableTaildirEventReader.Builder()
                .filePaths(filePaths)
                .headerTable(headerTable)
                .positionFilePath(positionFilePath)
                .skipToEnd(skipToEnd)
                .addByteOffset(byteOffsetHeader)
                .cachePatternMatching(cachePatternMatching)
                .recursive(isRecursive)
                .annotateFileName(fileHeader)
                .fileNameHeader(fileHeaderKey)
                .build();
    } catch (IOException e) {
        throw new FlumeException("Error instantiating ReliableTaildirEventReader", e);
    }
    idleFileChecker = Executors.newSingleThreadScheduledExecutor(
            new ThreadFactoryBuilder().setNameFormat("idleFileChecker").build());
    idleFileChecker.scheduleWithFixedDelay(new idleFileCheckerRunnable(),
            idleTimeout, checkIdleInterval, TimeUnit.MILLISECONDS);

    positionWriter = Executors.newSingleThreadScheduledExecutor(
            new ThreadFactoryBuilder().setNameFormat("positionWriter").build());
    positionWriter.scheduleWithFixedDelay(new PositionWriterRunnable(),
            writePosInitDelay, writePosInterval, TimeUnit.MILLISECONDS);

    super.start();
    logger.debug("TaildirSource started");
    sourceCounter.start();
}

3.2.2 【改源码重点】process方法

# 按照下面路径找到需要修改的方法
org.apache.flume.source.taildir.TaildirSource的process方法
   -- reader.updateTailFiles()
        -- updateTailFiles(false) 获取监控文件
            -- taildir.getMatchingFiles()  根据配置获取匹配文件
              --  sortByLastModifiedTime(getMatchingFilesNoCache())  根据文件最新更新时间确定是否需要tail
                 -- getMatchingFilesNoCache()     需要修改的方法
              
              
@Override
  public Status process() {
    Status status = Status.READY;
    try {
      existingInodes.clear();
      //updateTailFiles() 点进去
      existingInodes.addAll(reader.updateTailFiles());
      for (long inode : existingInodes) {
        TailFile tf = reader.getTailFiles().get(inode);
        if (tf.needTail()) {
          tailFileProcess(tf, true);
        }
      }
      closeTailFiles();
      try {
        TimeUnit.MILLISECONDS.sleep(retryInterval);
      } catch (InterruptedException e) {
        logger.info("Interrupted while sleeping");
      }
    } catch (Throwable t) {
      logger.error("Unable to tail files", t);
      status = Status.BACKOFF;
    }
    return status;
  }

 ==》 
 从reader.updateTailFiles()获取需要监控的文件，然后对每一个进行处理，查看最后修改时间，判定是否需要tail，需要tail就tail
 那么进入reader.updateTailFiles()
 public List<Long> updateTailFiles() throws IOException {
    return updateTailFiles(false);
 }
 
 ==》
    /**
   * Update tailFiles mapping if a new file is created or appends are detected
   * to the existing file.
   */
  public List<Long> updateTailFiles(boolean skipToEnd) throws IOException {
    updateTime = System.currentTimeMillis();
    List<Long> updatedInodes = Lists.newArrayList();
    
    for (TaildirMatcher taildir : taildirCache) {
      Map<String, String> headers = headerTable.row(taildir.getFileGroup());

       //taildir.getMatchingFiles() 获取匹配的文件 点进去
      for (File f : taildir.getMatchingFiles()) {
        long inode = getInode(f);
        TailFile tf = tailFiles.get(inode);
        if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
          long startPos = skipToEnd ? f.length() : 0;
          tf = openFile(f, headers, inode, startPos);
        } else {
          boolean updated = tf.getLastUpdated() < f.lastModified();
          if (updated) {
            if (tf.getRaf() == null) {
              tf = openFile(f, headers, inode, tf.getPos());
            }
            if (f.length() < tf.getPos()) {
              logger.info("Pos " + tf.getPos() + " is larger than file size! "
                  + "Restarting from pos 0, file: " + tf.getPath() + ", inode: " + inode);
              tf.updatePos(tf.getPath(), inode, 0);
            }
          }
          tf.setNeedTail(updated);
        }
        tailFiles.put(inode, tf);
        updatedInodes.add(inode);
      }
    }
    return updatedInodes;
  }
 
 ==》  
     遍历每一个正则表达式匹配对应的匹配器，每个匹配器去获取匹配的文件！
     getMatchingFiles()
 
     List<File> getMatchingFiles() {
    long now = TimeUnit.SECONDS.toMillis(
        TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
    long currentParentDirMTime = parentDir.lastModified();
    List<File> result;

    // calculate matched files if
    // - we don't want to use cache (recalculate every time) OR
    // - directory was clearly updated after the last check OR
    // - last mtime change wasn't already checked for sure
    //   (system clock hasn't passed that second yet)
    if (!cachePatternMatching ||
        lastSeenParentDirMTime < currentParentDirMTime ||
        !(currentParentDirMTime < lastCheckedTime)) {
      //根据文件最新更新时间确定是否需要tail
      lastMatchedFiles = sortByLastModifiedTime(getMatchingFilesNoCache());
      lastSeenParentDirMTime = currentParentDirMTime;
      lastCheckedTime = now;
    }

    return lastMatchedFiles;
  }
  
  == > 找到需要改造的方法
  private List<File> getMatchingFilesNoCache() {
List<File> result = Lists.newArrayList();
try (DirectoryStream<Path> stream = Files.newDirectoryStream(parentDir.toPath(), fileFilter)) {
  for (Path entry : stream) {
    result.add(entry.toFile());
  }
} catch (IOException e) {
  logger.error("I/O exception occurred while listing parent directory. " +
               "Files already matched will be returned. " + parentDir.toPath(), e);
}
return result;

}

3.3 修改源码

3.3.1 重载getMatchingFilesNoCache方法

package org.apache.flume.source.taildir下增加getMatchingFilesNoCache(boolean recursion)方法

    **
   * 重载getMatchingFilesNoCache方法
   * @param recursion  whether to open recursion
   * @return List of files matching the pattern unsorted
   */
  private List<File> getMatchingFilesNoCache(boolean recursion) {
    System.out.println("====execution override method================");
    if (!recursion) {
      //如果传入recursion为false,则调用原方法
      return getMatchingFilesNoCache();
    }
    List<File> result = Lists.newArrayList();
    //使用非递归方式遍历文件(将树转成队列)，此处使用递归方式遍历也可，但是效率可能不高，具体可见参考资料
    Queue<File> dirs = new ArrayBlockingQueue<>(10);
    //放parentDir
    dirs.offer(parentDir);

    //遍历ArrayBlockingQueue中的file
    while (dirs.size() > 0) {
      //拿parentDir
      File dir = dirs.poll();
      try {
        DirectoryStream<Path> stream = Files.newDirectoryStream(dir.toPath(), fileFilter);
        for (Path entry : stream) {
          //将文件加入到List
          result.add(entry.toFile());
        }
      } catch (IOException e) {
        logger.error("I/O exception occurred while listing parent directory. " +
                "Files already matched will be returned. (recursion)" + parentDir.toPath(), e);
      }
      File[] dirList = dir.listFiles();
      if (dirList != null) {
        for (File file : dirList) {
          if (file.isDirectory()) {
            //如果是目录则继续迭代
            dirs.add(file);
          }
        }
      }
    }
    return result;

  }

3.3.2 使recursive通过配置载入

(1)TaildirSourceConfigurationConstants 配置类中引入RECURSIVE和DEFAULT_RECURSIVE常量
 package org.apache.flume.source.taildir;

 public class TaildirSourceConfigurationConstants {
    
    /*** Whether to support recursion. */
  public static final String RECURSIVE = "recursive";
  public static final boolean DEFAULT_RECURSIVE = false;
  
  ......

}

(2)TaildirSource中增加成员变量isRecursive并初始化

public class TaildirSource extends AbstractSource implements
PollableSource, Configurable {
 
 //增加成员变量
 public static boolean isRecursive;

 ......
 
 @Override
 public synchronized void configure(Context context) {

 //初始化
 isRecursive =context.getBoolean(RECURSIVE, DEFAULT_RECURSIVE）
 ......
}


......


}


(3)使用重载方法getMatchingFilesNoCache(TaildirSource.isRecursive)并载入配置
 
 package org.apache.flume.source.taildir;
 public class TaildirMatcher {
 
 List<File> getMatchingFiles() {
long now = TimeUnit.SECONDS.toMillis(
    TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
long currentParentDirMTime = parentDir.lastModified();
List<File> result;

// calculate matched files if
// - we don't want to use cache (recalculate every time) OR
// - directory was clearly updated after the last check OR
// - last mtime change wasn't already checked for sure
//   (system clock hasn't passed that second yet)
if (!cachePatternMatching ||
    lastSeenParentDirMTime < currentParentDirMTime ||
    !(currentParentDirMTime < lastCheckedTime)) {
    /**
    * 使用重载方法getMatchingFilesNoCache(TaildirSource.isRecursive)
    */
  lastMatchedFiles = sortByLastModifiedTime(getMatchingFilesNoCache(TaildirSource.isRecursive));
  lastSeenParentDirMTime = currentParentDirMTime;
  lastCheckedTime = now;
}

return lastMatchedFiles;
}

}

4.打包代码并测试

4.1 打包

   clean package -DskipTests
   
   flume-1.7.0\flume-ng-sources\flume-taildir-source\target下找到编译好的包
   
  替换原来的flume lib下的flume-taildir-source***.jar

4.2 flume 的agent config配置

vim dirsource_memory_kafka.properties

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the dirsource
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /var/log/hadoop-hdfs/.*log.*
# 开启 递归监控文件夹
a1.sources.r1.recursive = true


# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = PREWARNING
a1.sinks.k1.kafka.bootstrap.servers = hadoop01:9092,hadoop02:9092,hadoop03:9092
a1.sinks.k1.kafka.flumeBatchSize = 6000
a1.sinks.k1.kafka.producer.acks = all
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.ki.kafka.producer.compression.type = snappy

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.keep-alive = 90
a1.channels.c1.capacity = 2000000
a1.channels.c1.transactionCapacity = 6000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.3 封装启动脚本

   vim start_dirsource_memory_kafka.sh
   
   #!/bin/bash

    nohup  /flume/flume/bin/flume-ng agent \
    -c /flume/flume/conf \
    -f /flume/flumeFlu/conf/dirsource_memory_kafka.properties \
    -n a1 -Dflume.root.logger=INFO,console &

4.4 启动脚本并运行

   (1)查看启动日志，执行了重载方法的输出,说明运行了

在这里插入图片描述

   (2)模拟生成日志
   
   [root@hadoop01 log]# echo "666" >> ./hadoop-hdfs/hadoop-cmf-hdfs-NAMENODE-hadoop01.log.out 
   
   [root@hadoop01 log]# echo  "777" >> hadoop-hdfs/test/test.log
   
   
   (3)在kafak的消费者可以看到结果
   afka-console-consumer.sh --topic 'PREWARNING' --bootstrap-server  hadoop01:9092,hadoop02:9092,hadoop03:9092  --skip-message-on-error
   666
   777

参考资料

生产上，修改Flume源码使taildir source支持递归（可配置）
https://mp.weixin.qq.com/s?__biz=MzA5ODY0NzgxNA==&mid=2247486109&idx=1&sn=67de70c93d2d7659dc7182ff20116325&chksm=908f20f4a7f8a9e2bf4ee8d54fb67cec33f53ed064044c0f7043d461cc9495da56ee3c96be06&mpshare=1&scene=1&srcid=&sharer_sharetime=1590537356240&sharer_shareid=de261d3248f7a54980859a381c24c116&key=b045cd156f7027c5ddd185d0bcc106fb6b7e18c5b379f75d9b75b9a1ff125f4eb6a4091f160cd9901b85827b2e8406ce20926ce3161441add48e4d700dca806082fd8f7bbfb0bfc612e4eb54e273b673&ascene=1&uin=MTQxODcwNzI2MA%3D%3D&devicetype=Windows+10&version=62080079&lang=zh_CN&exportkey=Abzk3oDMK0AFVaNEMYgbFGw%3D&pass_ticket=dpT1nm%2FwrkXBvC36UjY6Dz2FG3R1uOC%2Bi76nouZiPqP1ZoKA9BME%2B15kKqrmIDCu

Flume TaildirSource 实现递归
https://blog.csdn.net/qq_38976805/article/details/93117865

黄土高坡上的独孤前辈

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
2
评论
FLUME如何使TAILDIR SOURCE支持递归监控文件夹

文章目录1.flume的source选择1.1 TAILDIR Souce支持断点还原1.2 可配置文件组，里面使用正则表达式配置多个要监控的文件2. TAILDIR不能覆盖的场景3. 修改源代码，使得flume支持递归监控文件夹。3.1 flume taildir source源码结构3.2 TaildirSource 核心类解析3.2.1 start方法会使用建造者模式构建创建一个ReliableTaildirEventReade3.2.2 【改源码重点】process方法3.3 修改源码3.3.
复制链接

扫一扫