flume-kafka source配置
flume同步kafka的数据需要配置以下几个配置
- type,数据源类型,如org.apache.flume.source.kafka.KafkaSource
- channels,下游对接source的channel名
- topics,消费kafka的主题
- consumer.group.id,消费对应的主题的消费者组
- 其他,可以通过kafka.consumer.kafka其他配置,兼容kafka的配置,例如kafka.consumer.auto.offset.reset
下面是一个同步kafa source的一个例子:
tier1.sources = fc_source fc_source_client
tier1.sources.fc_source.type =
org.apache.flume.source.kafka.KafkaSource
tier1.sources.fc_source.channels = fc_channel
tier1.sources.fc_source.kafka.bootstrap.servers =localhost:9092
tier1.sources.fc_source.kafka.topics = roomserverOutput
#batchSize表示提交channel最大消息数据
tier1.sources.fc_source.batchSize = 500
tier1.sources.fc_source.batchDurationMillis = 2000
tier1.sources.fc_source.kafka.consumer.group.id = custom.server_event1.id
flume-channel 配置
channel可选的类型有很多例如,memory,file,kafka,jdbc等,本文选用落盘的channel-file,下面是一个flume-file-channel的配置:
tier1.channels.fc_channel.type = file
tier1.channels.fc_channel.checkpointDir = /mnt/data/flume/server_log/checkpoint
tier1.channels.fc_channel.dataDirs = /mnt/data/flume/server_log/data
落盘的channel可以保证数据的无丢失采集,为啥? 数据从source同步后,先通过DBMaker持久化到本地的文件,当到达一定的阀值,比如到Source的batchSize上限或者达到提交的时间,会通过commit的形式,将本地的文件移动到channel所存储的目录,如果commit失败,或者source通过DBMaker持久化到本地失败,则source不会提交kafka消息的offset,重新消费还会从最近未消费的offset开始。下面我们看俩段代码,分别是source提交offset的代码,channel同步source的代码 下面是source提交offset的代码
//KafkaSource.java的部分代码
//source从kafka取到的数据先放入eventList
if (eventList.size() > 0) {
counter.addToKafkaEventGetTimer((System.nanoTime() - nanoBatchStartTime) / (1000 * 1000));
counter.addToEventReceivedCount((long) eventList.size());
//source将数据写入到channel
getChannelProcessor().processEventBatch(eventList);
counter.addToEventAcceptedCount(eventList.size());
if (log.isDebugEnabled()) {
log.debug("Wrote {} events to channel", eventList.size());
}
eventList.clear();
//如果成功的开始准备同步已经消费的offset
if (!tpAndOffsetMetadata.isEmpty()) {
long commitStartTime = System.nanoTime();
//consumer同步当前消费的offset
consumer.commitSync(tpAndOffsetMetadata);
long commitEndTime = System.nanoTime();
counter.addToKafkaCommitTimer((commitEndTime - commitStartTime) / (1000 * 1000));
tpAndOffsetMetadata.clear();
}
return Status.READY;
}
return Status.BACKOFF;
下面是channel同步sources的代码
//FileChannel的部分代码
@Override
protected void doPut(Event event) throws InterruptedException {
...//省略部分代码
boolean success = false;
log.lockShared();
try {
FlumeEventPointer ptr = log.put(transactionID, event);
Preconditions.checkState(putList.offer(ptr), "putList offer failed "
+ channelNameDescriptor);
//将消息先持久化到本地临时文件中
queue.addWithoutCommit(ptr, transactionID);
success = true;
} catch (IOException e) {
channelCounter.incrementEventPutErrorCount();
throw new ChannelException("Put failed due to IO error "
+ channelNameDescriptor, e);
} finally {
log.unlockShared();
if (!success) {
// release slot obtained in the case
// the put fails for any reason
queueRemaining.release();
}
}
}
//同步到本地,inflightPuts指向的就是本地一个文件
synchronized void addWithoutCommit(FlumeEventPointer e, long transactionID) {
inflightPuts.addEvent(transactionID, e.toLong());
}
//当达到一定的条件的时候
protected void doCommit() throws InterruptedException {
int puts = putList.size();
int takes = takeList.size();
if (puts > 0) {
Preconditions.checkState(takes == 0, "nonzero puts and takes "
+ channelNameDescriptor);
log.lockShared();
try {
log.commitPut(transactionID);
channelCounter.addToEventPutSuccessCount(puts);
synchronized (queue) {
while (!putList.isEmpty()) {
//将数据从inflightPuts拿出并持久化到channel所对应的queue,queue指向的也是一个本地文件
if (!queue.addTail(putList.removeFirst())) {
StringBuilder msg = new StringBuilder();
msg.append("Queue add failed, this shouldn't be able to ");
msg.append("happen. A portion of the transaction has been ");
msg.append("added to the queue but the remaining portion ");
msg.append("cannot be added. Those messages will be consumed ");
msg.append("despite this transaction failing. Please report.");
msg.append(channelNameDescriptor);
LOG.error(msg.toString());
Preconditions.checkState(false, msg.toString());
}
}
queue.completeTransaction(transactionID);
}
} catch (IOException e) {
throw new ChannelException("Commit failed due to IO error "
+ channelNameDescriptor, e);
} finally {
log.unlockShared();
}
...//省略部分代码
}
数据从本地的临时文件同步到channel是采用事务的方式,即如果一条消息提交失败则整体回滚,这里能够实现extractly one,channel同步数据到sink也是采用事务的方式。如果sink使用的是hdfs,这里只能保证at least one,因为在极端情况下可能出现数据已经从队列sync同步hdfs但是事务刚好没有提交,例如断电。