flume整合kafka,实现exactly one的数据采集

flume-kafka source配置

flume同步kafka的数据需要配置以下几个配置

  • type,数据源类型,如org.apache.flume.source.kafka.KafkaSource
  • channels,下游对接source的channel名
  • topics,消费kafka的主题
  • consumer.group.id,消费对应的主题的消费者组
  • 其他,可以通过kafka.consumer.kafka其他配置,兼容kafka的配置,例如kafka.consumer.auto.offset.reset

下面是一个同步kafa source的一个例子:

tier1.sources = fc_source fc_source_client    
tier1.sources.fc_source.type =
org.apache.flume.source.kafka.KafkaSource    
tier1.sources.fc_source.channels = fc_channel    
tier1.sources.fc_source.kafka.bootstrap.servers =localhost:9092    
tier1.sources.fc_source.kafka.topics = roomserverOutput    
#batchSize表示提交channel最大消息数据
tier1.sources.fc_source.batchSize = 500    
tier1.sources.fc_source.batchDurationMillis = 2000    
tier1.sources.fc_source.kafka.consumer.group.id = custom.server_event1.id    

flume-channel 配置

channel可选的类型有很多例如,memory,file,kafka,jdbc等,本文选用落盘的channel-file,下面是一个flume-file-channel的配置:

tier1.channels.fc_channel.type = file   
tier1.channels.fc_channel.checkpointDir = /mnt/data/flume/server_log/checkpoint  
tier1.channels.fc_channel.dataDirs =  /mnt/data/flume/server_log/data  

落盘的channel可以保证数据的无丢失采集,为啥? 数据从source同步后,先通过DBMaker持久化到本地的文件,当到达一定的阀值,比如到Source的batchSize上限或者达到提交的时间,会通过commit的形式,将本地的文件移动到channel所存储的目录,如果commit失败,或者source通过DBMaker持久化到本地失败,则source不会提交kafka消息的offset,重新消费还会从最近未消费的offset开始。下面我们看俩段代码,分别是source提交offset的代码,channel同步source的代码 下面是source提交offset的代码

	  //KafkaSource.java的部分代码
      //source从kafka取到的数据先放入eventList
      if (eventList.size() > 0) {
        counter.addToKafkaEventGetTimer((System.nanoTime() - nanoBatchStartTime) / (1000 * 1000));
        counter.addToEventReceivedCount((long) eventList.size());
        //source将数据写入到channel
        getChannelProcessor().processEventBatch(eventList);
        counter.addToEventAcceptedCount(eventList.size());
        if (log.isDebugEnabled()) {
          log.debug("Wrote {} events to channel", eventList.size());
        }
        eventList.clear();
       //如果成功的开始准备同步已经消费的offset
        if (!tpAndOffsetMetadata.isEmpty()) {
          long commitStartTime = System.nanoTime();
          //consumer同步当前消费的offset
          consumer.commitSync(tpAndOffsetMetadata);
          long commitEndTime = System.nanoTime();
          counter.addToKafkaCommitTimer((commitEndTime - commitStartTime) / (1000 * 1000));
          tpAndOffsetMetadata.clear();
        }
        return Status.READY;
      }

      return Status.BACKOFF;

下面是channel同步sources的代码

	//FileChannel的部分代码
    @Override
    protected void doPut(Event event) throws InterruptedException {
	 ...//省略部分代码
      boolean success = false;
      log.lockShared();
      try {
        FlumeEventPointer ptr = log.put(transactionID, event);
        Preconditions.checkState(putList.offer(ptr), "putList offer failed "
            + channelNameDescriptor);
        //将消息先持久化到本地临时文件中
        queue.addWithoutCommit(ptr, transactionID);
        success = true;
      } catch (IOException e) {
        channelCounter.incrementEventPutErrorCount();
        throw new ChannelException("Put failed due to IO error "
            + channelNameDescriptor, e);
      } finally {
        log.unlockShared();
        if (!success) {
          // release slot obtained in the case
          // the put fails for any reason
          queueRemaining.release();
        }
      }
    }
	
 //同步到本地,inflightPuts指向的就是本地一个文件
 synchronized void addWithoutCommit(FlumeEventPointer e, long transactionID) {
        inflightPuts.addEvent(transactionID, e.toLong());
  }
  //当达到一定的条件的时候
      protected void doCommit() throws InterruptedException {
      int puts = putList.size();
      int takes = takeList.size();
      if (puts > 0) {
        Preconditions.checkState(takes == 0, "nonzero puts and takes "
            + channelNameDescriptor);
        log.lockShared();
        try {
          log.commitPut(transactionID);
          channelCounter.addToEventPutSuccessCount(puts);
          synchronized (queue) {
            while (!putList.isEmpty()) {
			//将数据从inflightPuts拿出并持久化到channel所对应的queue,queue指向的也是一个本地文件
              if (!queue.addTail(putList.removeFirst())) {
                StringBuilder msg = new StringBuilder();
                msg.append("Queue add failed, this shouldn't be able to ");
                msg.append("happen. A portion of the transaction has been ");
                msg.append("added to the queue but the remaining portion ");
                msg.append("cannot be added. Those messages will be consumed ");
                msg.append("despite this transaction failing. Please report.");
                msg.append(channelNameDescriptor);
                LOG.error(msg.toString());
                Preconditions.checkState(false, msg.toString());
              }
            }
            queue.completeTransaction(transactionID);
          }
        } catch (IOException e) {
          throw new ChannelException("Commit failed due to IO error "
              + channelNameDescriptor, e);
        } finally {
          log.unlockShared();
        }
       ...//省略部分代码
    }

数据从本地的临时文件同步到channel是采用事务的方式,即如果一条消息提交失败则整体回滚,这里能够实现extractly one,channel同步数据到sink也是采用事务的方式。如果sink使用的是hdfs,这里只能保证at least one,因为在极端情况下可能出现数据已经从队列sync同步hdfs但是事务刚好没有提交,例如断电。

转载于:https://my.oschina.net/osenlin/blog/3004607

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值