flume整合kafka，实现exactly one的数据采集

最新推荐文章于 2021-06-30 19:21:40 发布

weixin_34029680

最新推荐文章于 2021-06-30 19:21:40 发布

阅读量257

点赞数

文章标签：大数据 python java

原文链接：https://my.oschina.net/osenlin/blog/3004607

版权

2019独角兽企业重金招聘Python工程师标准>>>

flume-kafka source配置

flume同步kafka的数据需要配置以下几个配置

type，数据源类型，如org.apache.flume.source.kafka.KafkaSource
channels，下游对接source的channel名
topics，消费kafka的主题
consumer.group.id，消费对应的主题的消费者组
其他，可以通过kafka.consumer.kafka其他配置，兼容kafka的配置，例如kafka.consumer.auto.offset.reset

下面是一个同步kafa source的一个例子:

tier1.sources = fc_source fc_source_client    
tier1.sources.fc_source.type =
org.apache.flume.source.kafka.KafkaSource    
tier1.sources.fc_source.channels = fc_channel    
tier1.sources.fc_source.kafka.bootstrap.servers =localhost:9092    
tier1.sources.fc_source.kafka.topics = roomserverOutput    
#batchSize表示提交channel最大消息数据
tier1.sources.fc_source.batchSize = 500    
tier1.sources.fc_source.batchDurationMillis = 2000    
tier1.sources.fc_source.kafka.consumer.group.id = custom.server_event1.id

flume-channel 配置

channel可选的类型有很多例如，memory，file，kafka，jdbc等，本文选用落盘的channel-file，下面是一个flume-file-channel的配置：

tier1.channels.fc_channel.type = file   
tier1.channels.fc_channel.checkpointDir = /mnt/data/flume/server_log/checkpoint  
tier1.channels.fc_channel.dataDirs =  /mnt/data/flume/server_log/data

落盘的channel可以保证数据的无丢失采集，为啥? 数据从source同步后，先通过DBMaker持久化到本地的文件，当到达一定的阀值，比如到Source的batchSize上限或者达到提交的时间，会通过commit的形式，将本地的文件移动到channel所存储的目录，如果commit失败，或者source通过DBMaker持久化到本地失败，则source不会提交kafka消息的offset，重新消费还会从最近未消费的offset开始。下面我们看俩段代码，分别是source提交offset的代码，channel同步source的代码下面是source提交offset的代码

	  //KafkaSource.java的部分代码
      //source从kafka取到的数据先放入eventList
      if (eventList.size() > 0) {
        counter.addToKafkaEventGetTimer((System.nanoTime() - nanoBatchStartTime) / (1000 * 1000));
        counter.addToEventReceivedCount((long) eventList.size());
        //source将数据写入到channel
        getChannelProcessor().processEventBatch(eventList);
        counter.addToEventAcceptedCount(eventList.size());
        if (log.isDebugEnabled()) {
          log.debug("Wrote {} events to channel", eventList.size());
        }
        eventList.clear();
       //如果成功的开始准备同步已经消费的offset
        if (!tpAndOffsetMetadata.isEmpty()) {
          long commitStartTime = System.nanoTime();
          //consumer同步当前消费的offset
          consumer.commitSync(tpAndOffsetMetadata);
          long commitEndTime = System.nanoTime();
          counter.addToKafkaCommitTimer((commitEndTime - commitStartTime) / (1000 * 1000));
          tpAndOffsetMetadata.clear();
        }
        return Status.READY;
      }

      return Status.BACKOFF;

下面是channel同步sources的代码

	//FileChannel的部分代码
    @Override
    protected void doPut(Event event) throws InterruptedException {
	 ...//省略部分代码
      boolean success = false;
      log.lockShared();
      try {
        FlumeEventPointer ptr = log.put(transactionID, event);
        Preconditions.checkState(putList.offer(ptr), "putList offer failed "
            + channelNameDescriptor);
        //将消息先持久化到本地临时文件中
        queue.addWithoutCommit(ptr, transactionID);
        success = true;
      } catch (IOException e) {
        channelCounter.incrementEventPutErrorCount();
        throw new ChannelException("Put failed due to IO error "
            + channelNameDescriptor, e);
      } finally {
        log.unlockShared();
        if (!success) {
          // release slot obtained in the case
          // the put fails for any reason
          queueRemaining.release();
        }
      }
    }
	
 //同步到本地，inflightPuts指向的就是本地一个文件
 synchronized void addWithoutCommit(FlumeEventPointer e, long transactionID) {
        inflightPuts.addEvent(transactionID, e.toLong());
  }
  //当达到一定的条件的时候
      protected void doCommit() throws InterruptedException {
      int puts = putList.size();
      int takes = takeList.size();
      if (puts > 0) {
        Preconditions.checkState(takes == 0, "nonzero puts and takes "
            + channelNameDescriptor);
        log.lockShared();
        try {
          log.commitPut(transactionID);
          channelCounter.addToEventPutSuccessCount(puts);
          synchronized (queue) {
            while (!putList.isEmpty()) {
			//将数据从inflightPuts拿出并持久化到channel所对应的queue，queue指向的也是一个本地文件
              if (!queue.addTail(putList.removeFirst())) {
                StringBuilder msg = new StringBuilder();
                msg.append("Queue add failed, this shouldn't be able to ");
                msg.append("happen. A portion of the transaction has been ");
                msg.append("added to the queue but the remaining portion ");
                msg.append("cannot be added. Those messages will be consumed ");
                msg.append("despite this transaction failing. Please report.");
                msg.append(channelNameDescriptor);
                LOG.error(msg.toString());
                Preconditions.checkState(false, msg.toString());
              }
            }
            queue.completeTransaction(transactionID);
          }
        } catch (IOException e) {
          throw new ChannelException("Commit failed due to IO error "
              + channelNameDescriptor, e);
        } finally {
          log.unlockShared();
        }
       ...//省略部分代码
    }

数据从本地的临时文件同步到channel是采用事务的方式，即如果一条消息提交失败则整体回滚，这里能够实现extractly one，channel同步数据到sink也是采用事务的方式。如果sink使用的是hdfs，这里只能保证at least one，因为在极端情况下可能出现数据已经从队列sync同步hdfs但是事务刚好没有提交，例如断电。

转载于:https://my.oschina.net/osenlin/blog/3004607