flume中HDFSEventSink.java分析及翻转问题

Flume之HDFSEventSink.java分析

概述
HDFSEventSink是Flume中使用比较频繁且很重要的sink之一,与此sink相关的类都在org.apache.flume.sink.hdfs包中。

HDFSEventSink讲述

讲到HDFSEventSink,首先其他地方调用该类需要初始化及配置一些重要的参数,主要调用的是configure(Context context)方法,代码如下:

public void configure(Context context) {
    this.context = context;
    filePath = Preconditions.checkNotNull(
    context.getString("hdfs.path"), "hdfs.path is required");
    fileName = context.getString("hdfs.filePrefix", defaultFileName);
    this.suffix = context.getString("hdfs.fileSuffix", defaultSuffix);
    inUsePrefix = context.getString("hdfs.inUsePrefix", defaultInUsePrefix);
    inUseSuffix = context.getString("hdfs.inUseSuffix", defaultInUseSuffix);
    String tzName = context.getString("hdfs.timeZone");
    timeZone = tzName == null ? null : TimeZone.getTimeZone(tzName);
    rollInterval = context.getLong("hdfs.rollInterval", defaultRollInterval);
    rollSize = context.getLong("hdfs.rollSize", defaultRollSize);
    rollCount = context.getLong("hdfs.rollCount", defaultRollCount);
    batchSize = context.getLong("hdfs.batchSize", defaultBatchSize);
    idleTimeout = context.getInteger("hdfs.idleTimeout", 0);
    String codecName = context.getString("hdfs.codeC");
    fileType = context.getString("hdfs.fileType", defaultFileType);
    maxOpenFiles = context.getInteger("hdfs.maxOpenFiles", defaultMaxOpenFiles);
    callTimeout = context.getLong("hdfs.callTimeout", defaultCallTimeout);
    threadsPoolSize = context.getInteger("hdfs.threadsPoolSize",
        defaultThreadPoolSize);
    rollTimerPpublic void configure(Context context) {  
    this.context = context;
    //HDFS目录路径,必需(eg hdfs://namenode/flume/webdata/)
    filePath = Preconditions.checkNotNull(
        context.getString("hdfs.path"), "hdfs.path is required");
    //在Hdfs目录中生成的文件名字的前缀
    fileName = context.getString("hdfs.filePrefix", defaultFileName);
    //文件后缀,例如.avro,一般不用
    this.suffix = context.getString("hdfs.fileSuffix", defaultSuffix);
    //内部写文件的时候表示正在写的文件的前缀和后缀
    inUsePrefix = context.getString("hdfs.inUsePrefix", defaultInUsePrefix);
    inUseSuffix = context.getString("hdfs.inUseSuffix", defaultInUseSuffix);//默认是.tmp
    String tzName = context.getString("hdfs.timeZone");
    timeZone = tzName == null ? null : TimeZone.getTimeZone(tzName);
    //当前写入的文件滚动间隔,默认30秒生成一个新的文件,0表示不基于时间间隔来滚动
    rollInterval = context.getLong("hdfs.rollInterval", defaultRollInterval);
    //以文件大小触发文件滚动,单位字节,0表示不基于文件大小间隔来滚动
    rollSize = context.getLong("hdfs.rollSize", defaultRollSize);
    //以写入的事件数触发文件滚动, 0表示不基于事件数大小间隔来滚动
    rollCount = context.getLong("hdfs.rollCount", defaultRollCount);
    //事件刷新到HDFS之前的数量
    batchSize = context.getLong("hdfs.batchSize", defaultBatchSize);
    //控制文件打开时间,单位:s
    idleTimeout = context.getInteger("hdfs.idleTimeout", 0);
    //压缩编码类型. one of following : gzip, bzip2, lzo, snappy
    String codecName = context.getString("hdfs.codeC");
    //文件格式:当前为SequenceFile, DataStream or CompressedStream。
    //(1)DataStream不压缩输出文件,不能设置codeC选项,(2)CompressedStream需要设置hdfs.codeC的一个可用的编jiemaqi
    fileType = context.getString("hdfs.fileType", defaultFileType);
    //允许打开的文件数。如果超过这个数字,最早的文件被关闭。
    maxOpenFiles = context.getInteger("hdfs.maxOpenFiles", defaultMaxOpenFiles);
    //HDFS的操作允许的毫秒数,如打开,写,刷新,关闭。这个数字应该增加,如果正在发生许多HDFS超时操作。
    callTimeout = context.getLong("hdfs.callTimeout", defaultCallTimeout);
    //每个HDFS sink用于HDFS io操作的线程数,如打开、写入等操作。
    threadsPoolSize = context.getInteger("hdfs.threadsPoolSize",
        defaultThreadPoolSize);
    //每个HDFS sink用于调度定时文件滚动的线程数
    rollTimerPoolSize = context.getInteger("hdfs.rollTimerPoolSize",
        defaultRollTimerPoolSize);
    //安全认证时使用Kerberos user principal for accessing secure HDFS
    kerbConfPrincipal = context.getString("hdfs.kerberosPrincipal", "");
    //安全认证时使用Kerberos keytab for accessing secure HDFS
    kerbKeytab = context.getString("hdfs.kerberosKeytab", "");
    proxyUserName = context.getString("hdfs.proxyUser", "");  //代理用户

    Preconditions.checkArgument(batchSize > 0,
        "batchSize must be greater than 0");
    if (codecName == null) {  //不压缩数据
      codeC = null;
      compType = CompressionType.NONE;
    } else {    //压缩数据
      codeC = getCodec(codecName);
      // TODO : set proper compression type
      compType = CompressionType.BLOCK;
    }

    // Do not allow user to set fileType DataStream with codeC together
    // To prevent output file with compress extension (like .snappy)
    if(fileType.equalsIgnoreCase(HDFSWriterFactory.DataStreamType)//如果fileType是DataStream,则不允许压缩
        && codecName != null) {
      throw new IllegalArgumentException("fileType: " + fileType +
          " which does NOT support compressed output. Please don't set codeC" +
          " or change the fileType if compressed output is desired.");
    }

    if(fileType.equalsIgnoreCase(HDFSWriterFactory.CompStreamType)) {//如果fileType是压缩类型,则codeC不允许为空
      Preconditions.checkNotNull(codeC, "It's essential to set compress codec"
          + " when fileType is: " + fileType);
    }

    if (!authenticate()) {  //认证
      LOG.error("Failed to authenticate!");
    }
    //时间戳是否四舍五入(如果为true,会影响所有基于时间的转义序列%t除外)
    needRounding = context.getBoolean("hdfs.round", false);

    if(needRounding) {
        //The unit of the round down value - second, minute or hour.
      String unit = context.getString("hdfs.roundUnit", "second");  //滚动时间单位
      if (unit.equalsIgnoreCase("hour")) {
        this.roundUnit = Calendar.HOUR_OF_DAY;
      } else if (unit.equalsIgnoreCase("minute")) {
        this.roundUnit = Calendar.MINUTE;
      } else if (unit.equalsIgnoreCase("second")){
        this.roundUnit = Calendar.SECOND;
      } else {
        LOG.warn("Rounding unit is not valid, please set one of" +
            "minute, hour, or second. Rounding will be disabled");
        needRounding = false;
      }
      //Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.
      this.roundValue = context.getInteger("hdfs.roundValue", 1);  //滚动时间大小
      if(roundUnit == Calendar.SECOND || roundUnit == Calendar.MINUTE){//检查是否符合分、秒数值,0<v<=60
        Preconditions.checkArgument(roundValue > 0 && roundValue <= 60,
            "Round value" +
            "must be > 0 and <= 60");
      } else if (roundUnit == Calendar.HOUR_OF_DAY){
        Preconditions.checkArgument(roundValue > 0 && roundValue <= 24,  //检查是否符合时数值0<v<=24
            "Round value" +
            "must be > 0 and <= 24");
      }
    }

    if (sinkCounter == null) {//构造计数器
      sinkCounter = new SinkCounter(getName());
    }
  }

其中比较重要的参数有:

  • rollInterval: 以固定时间间隔滚动文件;
  • rollSize: 以文件大小为单位滚动文件;
  • rollCount: 以行数来滚动文件,
  • fileType: (有3种SequenceFile(二进制)、DataStream(不能压缩)、CompressedStream(压缩文件))
接下来是start()方法

HDFSEventSink用于把数据从channel中拿出来(主动pull的形式)然后放到hdfs中,HDFSEventSink在启动时会启动两个线程池callTimeoutPool 和timedRollerPool ,callTimeoutPool 用于运行/open/close/append/flush/rename等操作hdfs的task(通过callWithTimeout方法调用,并实现timeout功能),用于运行翻转文件的计划任务timedRollerPool,代码如下:

    public void start() {
        String timeoutName = "hdfs-" + getName() + "-call-runner-%d";
        callTimeoutPool = Executors.newFixedThreadPool(threadsPoolSize,
                new ThreadFactoryBuilder().setNameFormat(timeoutName).build());
    
        String rollerName = "hdfs-" + getName() + "-roll-timer-%d";
        timedRollerPool = Executors.newScheduledThreadPool(rollTimerPoolSize,
                new ThreadFactoryBuilder().setNameFormat(rollerName).build());
    
        this.sfWriters = new WriterLinkedHashMap(maxOpenFiles);
        sinkCounter.start();
        super.start();
      }

在接下来是重点:process()方法

event从channel到hdfseventsink的操作最终调用了sink的process()方法(由SinkRunner中的PollingRunner线程启动SinkProcessor实现类调用),比如HDFSEventSink的process()方法,每个process方法中都是一个事务,用来提供原子性操作,process方法调用Channel的take方法从Channel中取出Event,每个transaction中最多的Event数量由hdfs.batchSize设定,默认是1000,对每一个Event有如下操作:

1.获取文件的完整路径和名称lookupPath;

2.声明一个BucketWriter对象和HDFSWriter 对象,HDFSWriter由hdfs.fileType设定,负责实际数据的写入,BucketWriter可以理解成对hdfs文件和写入方法的封装,每个lookupPath对应一个BucketWriter对象,对应关系写入到sfWriters中,用来存放文件到BucketWriter的对应关系,在start方法中初始化:

this.sfWriters = new WriterLinkedHashMap( maxOpenFiles);

长度为hdfs.maxOpenFiles的设置,默认为5000,这个代表最多能打开的文件数量);

3.调用BucketWriter的append方法写入数据

4.当操作的Event数量达到hdfs.batchSize设定后,循环调用每个BucketWriter对象的flush方法,并提交transaction

5.如果出现异常则回滚事务

6.最后关闭transaction

再后来就是process方法

process方法最后返回的是代表Sink状态的Status对象(BACKOFF或者READY),这个可以用于判断Sink的健康状态,比如failover的SinkProcessor就根据这个来判断Sink是否可以提供服务

上述步骤代码如下:

public Status process() throws EventDeliveryException {
    Channel channel = getChannel();//获取对应的channel
    Transaction transaction = channel.getTransaction();//获得事务
    List<BucketWriter> writers = Lists.newArrayList();//BucketWriter列表
    transaction.begin(); //事务开始
    try {
      int txnEventCount = 0;
      for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
        Event event = channel.take(); //获取event
        if (event == null) {
          break;
        }

        // reconstruct the path name by substituting place holders
        String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
            timeZone, needRounding, roundUnit, roundValue, useLocalTime);//格式化后的HDFS目录
        String realName = BucketPath.escapeString(fileName, event.getHeaders(),
          timeZone, needRounding, roundUnit, roundValue, useLocalTime);//格式化后的文件名

        String lookupPath = realPath + DIRECTORY_DELIMITER + realName;//要写入的文件的HDFS绝对路径
        BucketWriter bucketWriter;
        HDFSWriter hdfsWriter = null;
        // Callback to remove the reference to the bucket writer from the
        // sfWriters map so that all buffers used by the HDFS file
        // handles are garbage collected.
        WriterCallback closeCallback = new WriterCallback() {
          [@Override](https://my.oschina.net/u/1162528)
          public void run(String bucketPath) {
            LOG.info("Writer callback called.");
            synchronized (sfWritersLock) {
              sfWriters.remove(bucketPath);
            }
          }
        };
        synchronized (sfWritersLock) {
          bucketWriter = sfWriters.get(lookupPath);  //获取文件的BucketWriter
          // we haven't seen this file yet, so open it and cache the handle
          if (bucketWriter == null) {
            hdfsWriter = writerFactory.getWriter(fileType); //获取HDFSWriter 对象
            bucketWriter = initializeBucketWriter(realPath, realName,
              lookupPath, hdfsWriter, closeCallback);//根据HDFSWriter对象获取BucketWriter对象
            sfWriters.put(lookupPath, bucketWriter);//将文件路径和BucketWriter组成K-V,放入sfWriters
          }
        }

        // track the buckets getting written in this transaction
        if (!writers.contains(bucketWriter)) {
          writers.add(bucketWriter);
        }

        // Write the data to HDFS
        try {
          bucketWriter.append(event);//将event写入bucketWriter对应的文件中
        } catch (BucketClosedException ex) {
          LOG.info("Bucket was closed while trying to append, " +
            "reinitializing bucket and writing event.");
          hdfsWriter = writerFactory.getWriter(fileType);
          bucketWriter = initializeBucketWriter(realPath, realName,
            lookupPath, hdfsWriter, closeCallback);
          synchronized (sfWritersLock) {
            sfWriters.put(lookupPath, bucketWriter);
          }
          bucketWriter.append(event);//将event写入bucketWriter对应的文件中
        }
      }

      if (txnEventCount == 0) {//这次事务没有处理任何event
        sinkCounter.incrementBatchEmptyCount();
      } else if (txnEventCount == batchSize) {//一次处理batchSize个event
        sinkCounter.incrementBatchCompleteCount();
      } else {//channel中剩余的events不足batchSize
        sinkCounter.incrementBatchUnderflowCount();
      }

      // flush all pending buckets before committing the transaction
      for (BucketWriter bucketWriter : writers) {//将所有BucketWriter数据刷新到HDFS中
        bucketWriter.flush();
      }

      transaction.commit();//提交事务

      if (txnEventCount < 1) {
        return Status.BACKOFF;
      } else {
        sinkCounter.addToEventDrainSuccessCount(txnEventCount);
        return Status.READY;
      }
    } catch (IOException eIO) {
      transaction.rollback();
      LOG.warn("HDFS IO error", eIO);
      return Status.BACKOFF;
    } catch (Throwable th) {
      transaction.rollback();
      LOG.error("process failed", th);
      if (th instanceof Error) {
        throw (Error) th;
      } else {
        throw new EventDeliveryException(th);
      }
    } finally {
      transaction.close();
    }
  }

在第3步append的过程中,

A.首先会检查当前线程是否中断checkAndThrowInterruptedException();

B.BucketWriter初次运行时,isOpen=false(表示文件未打开不能写),会运行open()。fullFileName是"前缀.时间戳"组成的文件名,从这也可以看出时间戳部分不能更改,也就是HDFS中文件名无法自定义,除非自己定制HDFSSink;另外后缀名和压缩不能同时兼得,即如果没有配置压缩则可以在fullFileName后面添加自定义的后缀(比如后缀为.avro),否则只能添加压缩类型的后缀;bucketPath表示在HDFS中正在写的文件完整名字,这个名字中有标示正在写的文件的前后缀(inUsePrefix、inUseSuffix);targetPath这个是文件写完后的要更改成的完整名字,和bucketPath的区别是没有inUsePrefix、inUseSuffix;然后是根据有无压缩配置信息open此witer,没有压缩:writer.open(bucketPath),有压缩:writer.open(bucketPath, codeC, compType)。需要注意的是当使用Kerberos时,hadoop的RPC操作是非线程安全的包括getFileSystem()操作,open()操作在同一个JVM的同一时刻只能由一个线程使用,因为有可能导致死锁。所以对open进行了同步。

  • 回到BucketWriter,如果rollInterval(按时间滚动文件)不为0,则创建一个Callable,放入timedRollFuture中rollInterval秒之后关闭文件,默认是30s写一个文件,这只是控制文件滚动的4个条件之一; isOpen = true表示文件已打开,可以write了。 以上内容代码如下:

    synchronized (staticLock) { checkAndThrowInterruptedException();

        try {
          long counter = fileExtensionCounter.incrementAndGet();
    
          String fullFileName = fileName + "." + counter;
    
          if (fileSuffix != null && fileSuffix.length() > 0) {
            fullFileName += fileSuffix;
          } else if (codeC != null) {
            fullFileName += codeC.getDefaultExtension();
          }
    
          bucketPath = filePath + "/" + inUsePrefix
            + fullFileName + inUseSuffix;
          targetPath = filePath + "/" + fullFileName;
    
          LOG.info("Creating " + bucketPath);
          callWithTimeout(new CallRunner<Void>() {
            [@Override](https://my.oschina.net/u/1162528)
            public Void call() throws Exception {
              if (codeC == null) {
                // Need to get reference to FS using above config before underlying
                // writer does in order to avoid shutdown hook &
                // IllegalStateExceptions
                if(!mockFsInjected) {
                  fileSystem = new Path(bucketPath).getFileSystem(
                    config);
                }
                writer.open(bucketPath);    //调用HDFSWriter.open方法打开bucketPath
              } else {
                // need to get reference to FS before writer does to
                // avoid shutdown hook
                if(!mockFsInjected) {
                  fileSystem = new Path(bucketPath).getFileSystem(
                    config);
                }
                writer.open(bucketPath, codeC, compType);
              }
              return null;
            }
          });
        } catch (Exception ex) {
          sinkCounter.incrementConnectionFailedCount();
          if (ex instanceof IOException) {
            throw (IOException) ex;
          } else {
            throw Throwables.propagate(ex);
          }
        }
      }
    

    isClosedMethod = getRefIsClosed(); sinkCounter.incrementConnectionCreatedCount(); resetCounters();

      // if time-based rolling is enabled, schedule the roll
      if (rollInterval > 0) {
        Callable<Void> action = new Callable<Void>() {
          public Void call() throws Exception {
            LOG.debug("Rolling file ({}): Roll scheduled after {} sec elapsed.",
                bucketPath, rollInterval);
            try {
              // Roll the file and remove reference from sfWriters map.
              close(true);
            } catch(Throwable t) {
              LOG.error("Unexpected error", t);
            }
            return null;
          }
        };
        timedRollFuture = timedRollerPool.schedule(action, rollInterval,
            TimeUnit.SECONDS);
      }
    
      isOpen = true;
    }
    

    C.回到上面的第3步骤,shouldRotate()方法会判断hdfs中的文件是否正在复制及文件中的行数和文件的大小是否达到配置文件中的配置,如果任何一个满足条件则可以关闭文件,这是控制文件滚动的4个条件中的三个。close()方法会关闭文件,再清理俩线程池及一些其他的清理工作,及改名(将.tmp文件改名)【此处就是当进行挂掉之后,翻转的操作】,再open()就又到了上面B中所说的。以上步骤代码如下:

    private boolean shouldRotate() { boolean doRotate = false;

      if (writer.isUnderReplicated()) {
        this.isUnderReplicated = true;
        doRotate = true;
      } else {
        this.isUnderReplicated = false;
      }
    
      if ((rollCount > 0) && (rollCount <= eventCounter)) {
        LOG.debug("rolling: rollCount: {}, events: {}", rollCount, eventCounter);
        doRotate = true;
      }
    
      if ((rollSize > 0) && (rollSize <= processSize)) {
        LOG.debug("rolling: rollSize: {}, bytes: {}", rollSize, processSize);
        doRotate = true;
      }
    
      return doRotate;
    }
    

    if (doRotate) { close(); open(); }

    public synchronized void close(boolean callCloseCallback) throws IOException, InterruptedException { checkAndThrowInterruptedException(); try { flush(); } catch (IOException e) { LOG.warn("pre-close flush failed", e); } boolean failedToClose = false; LOG.info("Closing {}", bucketPath); CallRunner<Void> closeCallRunner = createCloseCallRunner(); if (isOpen) { try { callWithTimeout(closeCallRunner); sinkCounter.incrementConnectionClosedCount(); } catch (IOException e) { LOG.warn( "failed to close() HDFSWriter for file (" + bucketPath + "). Exception follows.", e); sinkCounter.incrementConnectionFailedCount(); failedToClose = true; } isOpen = false; } else { LOG.info("HDFSWriter is already closed: {}", bucketPath); }

      // NOTE: timed rolls go through this codepath as well as other roll types
      if (timedRollFuture != null && !timedRollFuture.isDone()) {
        timedRollFuture.cancel(false); // do not cancel myself if running!
        timedRollFuture = null;
      }
    
      if (idleFuture != null && !idleFuture.isDone()) {
        idleFuture.cancel(false); // do not cancel myself if running!
        idleFuture = null;
      }
    
      if (bucketPath != null && fileSystem != null) {
        // could block or throw IOException
        try {
          renameBucket(bucketPath, targetPath, fileSystem);
        } catch(Exception e) {
          LOG.warn(
            "failed to rename() file (" + bucketPath +
            "). Exception follows.", e);
          sinkCounter.incrementConnectionFailedCount();
          final Callable<Void> scheduledRename =
                  createScheduledRenameCallable();
          timedRollerPool.schedule(scheduledRename, retryInterval,
                  TimeUnit.SECONDS);
        }
      }
      if (callCloseCallback) {
        runCloseAction();
        closed = true;
      }
    }
    
    private void renameBucket(String bucketPath,
      String targetPath, final FileSystem fs) throws IOException,
      InterruptedException {
      if(bucketPath.equals(targetPath)) {
        return;
      }
    
      final Path srcPath = new Path(bucketPath);
      final Path dstPath = new Path(targetPath);
    
      callWithTimeout(new CallRunner<Void>() {
        [@Override](https://my.oschina.net/u/1162528)
        public Void call() throws Exception {
          if (fs.exists(srcPath)) { // could block
            LOG.info("Renaming " + srcPath + " to " + dstPath);
            renameTries.incrementAndGet();
            fs.rename(srcPath, dstPath); // could block
          }
          return null;
        }
      });
    }
    

注意:writer.append(event)这是向HDFS中写数据的地方。这里又要分很多讨论了,因为writer有三类。

  writer为HDFSSequenceFile:append(event)方法,会先通过serializer.serialize(e)把event处理成一个Key和一个Value。

  (1)serializer为HDFSWritableSerializer时,则Key会是event.getHeaders().get("timestamp"),如果没有"timestamp"的Headers则使用当前系统时间System.currentTimeMillis(),然后将时间封装成LongWritable;Value是将event.getBody()封装成BytesWritable,代码是bytesObject.set(e.getBody(), 0, e.getBody().length);

  (2)serializer为HDFSTextSerializer时,Key和上述HDFSWritableSerializer一样;Value会将event.getBody()封装成Text,代码是textObject.set(e.getBody(), 0, e.getBody().length)。

  writer.append(event)中会将Key和Value,writer.append(record.getKey(), record.getValue())。

  writer为HDFSDataStream:append(event)方法直接调用serializer.write(e)。

  (1)serializer为BodyTextEventSerializer,则其write(e)方法会将e.getBody()写入输出流,并根据配置再写入一个"\n";

  (2)serializer为HeaderAndBodyTextEventSerializer,则其write(e)方法会将e.getHeaders() + " "(注意此空格)和e.getBody()写入输出流,并根据配置再写入一个"\n";

  (3)serializer为FlumeEventAvroEventSerializer,则其write(e)方法会将event整体写入dataFileWriter。

  writer为HDFSCompressedDataStream:append(event)方法会首先判断是否完成一个阶段的压缩isFinished,如果是则更新压缩输出流的状态,并isFinished=false,否则剩下的执行和HDFSDataStream.append(event)相同。

  E、是做一些统计工作processSize是统计文件大小的;eventCounter是统计文件行数的;batchCounter是统计最近一次flush之后的处理的event数;

 processSize += event.getBody().length;
    eventCounter++;
    batchCounter++;

  F、如果处理的event数达到batchSize则刷新到HDFS中,flush()。flush()方法会首先执行writer.sync()即写入HDFS,然后清空batchCounter表明这次batch已经完成,可以准备下次的。涉及到writer就会涉及很多写入类型:

 if (batchCounter == batchSize) {
      flush();
    }

  writer为HDFSSequenceFile:sync()方法执行SequenceFile.Writer.syncFs()将数据写入HDFS中;

  writer为HDFSDataStream:sync()方法执行

  writer为HDFSCompressedDataStream:sync()方法先执行serializer.flush():只有FlumeEventAvroEventSerializer的flush()方法也有实现dataFileWriter.flush(),其他俩BodyTextEventSerializer和HeaderAndBodyTextEventSerializer均未实现flush()方法。然后执行outStream.flush()和outStream.sync()将数据刷新至HDFS中。

  如果idleTimeout>0,表示文件超时时间,超时后就成为无效文件需要关闭(默认是0不允许关闭的),构造一个Callable对象idleAction执行内容是:close()方法,idleClosed = true表示超时关闭了这个bucketwriter,而且onIdleCallback.run(onIdleCallbackPath)会将onIdleCallbackPath从HDFSEventSink.sfWriters中删除对应对应的bucketwriter,表示这个文件已经写完了。然后将这个idleAction放入timedRollerPool中idleTimeout秒后执行。

  8、回到HDFSEventSink.process()方法中,会根据这次事务处理的event数量更新相应的计数器;

  9、遍历writers,挨个刷新BucketWriter至HDFS;

  10、transaction.commit();//提交事务

  11、transaction.rollback();//异常后回滚

  12、transaction.close();//关闭事务

-------------------------到此处为自己参考总结,以下为网上博客----------------------------

四、stop()方法。

首先会遍历sfWriters,挨个close(BucketWriter):BucketWriter.close()方法,如果isOpen=true表示文件还处于打开状态,则writer.close()(这里的writer就不分情况了,HDFSSequenceFile就直接writer.close();其他俩都是先flush(好些都没实现)再beforClose(好些都没实现)输出流再flush、sync、close),BucketWriter.close()方法方法接下来关闭俩线程池以及改名等清理操作。HDFSEventSink的stop()方法接下来是关闭俩线程池,清理一些数据比如sfWriters.clear()。

public void stop() {
    // do not constrain close() calls with a timeout
    synchronized (sfWritersLock) {
      for (Entry<String, BucketWriter> entry : sfWriters.entrySet()) {
        LOG.info("Closing {}", entry.getKey());

        try {
          entry.getValue().close();
        } catch (Exception ex) {
          LOG.warn("Exception while closing " + entry.getKey() + ". " +
                  "Exception follows.", ex);
          if (ex instanceof InterruptedException) {
            Thread.currentThread().interrupt();
          }
        }
      }
    }

    // shut down all our thread pools
    ExecutorService toShutdown[] = {callTimeoutPool, timedRollerPool};
    for (ExecutorService execService : toShutdown) {
      execService.shutdown();
      try {
        while (execService.isTerminated() == false) {
          execService.awaitTermination(
                  Math.max(defaultCallTimeout, callTimeout), TimeUnit.MILLISECONDS);
        }
      } catch (InterruptedException ex) {
        LOG.warn("shutdown interrupted on " + execService, ex);
      }
    }

    callTimeoutPool = null;
    timedRollerPool = null;

    synchronized (sfWritersLock) {
      sfWriters.clear();
      sfWriters = null;
    }
    sinkCounter.stop();
    super.stop();
  }

本文参考:

1 http://www.aboutyun.com/thread-8454-1-1.html

2 http://caiguangguang.blog.51cto.com/1652935/1617764

3 http://www.mamicode.com/info-detail-504303.html

4 http://www.aboutyun.com/thread-21422-1-1.html

问题参考

  1. flume的hdfs.minBlockReplicas参数的作用
  2. flume 频繁产生小文件原因分析及解决办法

转载于:https://my.oschina.net/112612/blog/1554740

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值