Flume的HDFS sink学习

前言:HDFS sink原生的解析时间戳的代码性能不高,可以通过修改源码提升性能。具体操作参考链接:http://www.cnblogs.com/lxf20061900/p/4014281.html

HDFS sink常用配置项:

typeThe component type name, needs to be hdfs
hdfs.pathHDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefixFlumeDataName prefixed to files created by Flume in hdfs directory
hdfs.fileSuffixSuffix to append to file (eg .avro - NOTE: period is not automatically added)
hdfs.inUsePrefixPrefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix.tmpSuffix that is used for temporal files that flume actively writes into
hdfs.rollInterval30Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize1024File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount10Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout0Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize100number of events written to file before it is flushed to HDFS
hdfs.fileTypeSequenceFileFile format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles5000Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.callTimeout10000Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize10Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.roundfalseShould the timestamp be rounded down (if true, affects all time based escape sequences except %t)
hdfs.roundValue1Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.

一般使用hdfs sink都会采用滚动生成文件的方式,hdfs sink滚动生成文件的策略有:

  • 基于时间
  • 基于文件大小
  • 基于hdfs文件副本数(一般要规避这种情况)
  • 基于event数量
  • 基于文件闲置时间

1、基于时间策略

配置项:hdfs.rollInterval

默认值:30秒

说明:如果设置为0表示禁用这个策略

原理: 在org.apache.flume.sink.hdfs.BucketWriter.append方法中打开一个文件,都会调用open方法,如果设置了hdfs.rollInterval,那么hdfs.rollInterval秒之内只要其他策略没有关闭文件,文件会在hdfs.rollInterval秒之后关闭。

if (rollInterval > 0) {

  Callable<Void> action = new Callable<Void>() {
    public Void call() throws Exception {
      LOG.debug("Rolling file ({}): Roll scheduled after {} sec elapsed.",
          bucketPath, rollInterval);
      try {
        // Roll the file and remove reference from sfWriters map.
        close(true);
      } catch (Throwable t) {
        LOG.error("Unexpected error", t);
      }
      return null;
    }
  };
  timedRollFuture = timedRollerPool.schedule(action, rollInterval,
      TimeUnit.SECONDS);

}

2、基于文件大小和event数量策略

配置项:

文件大小策略:hdfs.rollSize

event数量策略:hdfs.rollCount

默认值:

文件大小策略:1024字节

event数量策略:10

说明:如果设置为0表示禁用这些策略

原理: 这2种策略都是在org.apache.flume.sink.hdfs.BucketWriter.shouldRotate方法中进行判断的,只要doRotate的值为true,那么当前文件就会关闭,即滚动到下一个文件。

private boolean shouldRotate() {

boolean doRotate = false;

if (writer.isUnderReplicated()) {
  this.isUnderReplicated = true;
  doRotate = true;
} else {
  this.isUnderReplicated = false;
}

if ((rollCount > 0) && (rollCount <= eventCounter)) {
  LOG.debug("rolling: rollCount: {}, events: {}", rollCount, eventCounter);
  doRotate = true;
}

if ((rollSize > 0) && (rollSize <= processSize)) {
  LOG.debug("rolling: rollSize: {}, bytes: {}", rollSize, processSize);
  doRotate = true;
}

return doRotate;

}

注意:如果同时配置了时间策略和文件大小策略,那么会先判断时间,如果时间没到再判断其他的条件。

3、基于hdfs文件副本数

配置项:hdfs.minBlockReplicas

默认值:和hdfs的副本数一致

原理: 从上面的代码中可以看到,判断副本数的关键方法是writer.isUnderReplicated(),即

public boolean isUnderReplicated() {

try {
  int numBlocks = getNumCurrentReplicas();
  if (numBlocks == -1) {
    return false;
  }
  int desiredBlocks;
  if (configuredMinReplicas != null) {
    desiredBlocks = configuredMinReplicas;
  } else {
    desiredBlocks = getFsDesiredReplication();
  }
  return numBlocks < desiredBlocks;
} catch (IllegalAccessException e) {
  logger.error("Unexpected error while checking replication factor", e);
} catch (InvocationTargetException e) {
  logger.error("Unexpected error while checking replication factor", e);
} catch (IllegalArgumentException e) {
  logger.error("Unexpected error while checking replication factor", e);
}
return false;

}

也就是说,如果当前正在写的文件的副本数小于hdfs.minBlockReplicas,此方法返回true,其他情况都返回false。假设这个方法返回true,那么看一下会发生什么事情。

首先就是上面代码提到的shouldRotate方法肯定返回的是true。再继续跟踪,下面的代码是关键

if (shouldRotate()) {

  boolean doRotate = true;

  if (isUnderReplicated) {
    if (maxConsecUnderReplRotations > 0 &&
        consecutiveUnderReplRotateCount >= maxConsecUnderReplRotations) {
      doRotate = false;
      if (consecutiveUnderReplRotateCount == maxConsecUnderReplRotations) {
        LOG.error("Hit max consecutive under-replication rotations ({}); " +
            "will not continue rolling files under this path due to " +
            "under-replication", maxConsecUnderReplRotations);
      }
    } else {
      LOG.warn("Block Under-replication detected. Rotating file.");
    }
    consecutiveUnderReplRotateCount++;
  } else {
    consecutiveUnderReplRotateCount = 0;
  }

  if (doRotate) {
    close();
    open();
  }
}

这里maxConsecUnderReplRotations是固定的值30,也就是说,文件滚动生成了30个之后,就不会再滚动了,因为将doRotate设置为了false。所以,从这里可以看到,如果isUnderReplicated方法返回的是true,可能会导致文件的滚动和预期的不一致。规避这个问题的方法就是将hdfs.minBlockReplicas设置为1,一般hdfs的副本数肯定都是大于等于1的,所以isUnderReplicated方法一定会返回false。 所以一般情况下,要规避这种情况,避免影响文件的正常滚动。

4、基于文件闲置时间策略

配置项:hdfs.idleTimeout

默认值:0

说明:默认启动这个功能

这种策略很简单,如果文件在hdfs.idleTimeout秒的时间里都是闲置的,没有任何数据写入,那么当前文件关闭,滚动到下一个文件。

public synchronized void flush() throws IOException, InterruptedException {

checkAndThrowInterruptedException();
if (!isBatchComplete()) {
  doFlush();

  if (idleTimeout > 0) {
    // if the future exists and couldn't be cancelled, that would mean it has already run
    // or been cancelled
    if (idleFuture == null || idleFuture.cancel(false)) {
      Callable<Void> idleAction = new Callable<Void>() {
        public Void call() throws Exception {
          LOG.info("Closing idle bucketWriter {} at {}", bucketPath,
                   System.currentTimeMillis());
          if (isOpen) {
            close(true);
          }
          return null;
        }
      };
      idleFuture = timedRollerPool.schedule(idleAction, idleTimeout,
          TimeUnit.SECONDS);
    }
  }
}

}

以上滚动生成文件的内容,参考链接 https://www.jianshu.com/p/4f43780c82e9

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值