Review(7)

最新推荐文章于 2020-11-13 14:28:32 发布

longdada007

最新推荐文章于 2020-11-13 14:28:32 发布

阅读量230

点赞数

分类专栏： Review 文章标签： Review

本文链接：https://blog.csdn.net/qq_18522601/article/details/95333956

版权

Review 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

17 flume如何抽取数据记录pos点？用哪一个source？ tailDir目录能支持递归吗

flume中有三种可监控文件或目录的source、分别是Exec Source、Spooling Directory Source和Taildir Source。

Taildir Source是1.7版本的新特性，综合了Spooling Directory Source和Exec Source的优点。

使用场景：

Exec Source
　　Exec Source可通过tail -f命令去tail住一个文件，然后实时同步日志到sink。但存在的问题是，当agent进程挂掉重启后，会有重复消费的问题。可以通过增加UUID来解决，或通过改进ExecSource来解决。
Spooling Directory Source
　　Spooling Directory Source可监听一个目录，同步目录中的新文件到sink,被同步完的文件可被立即删除或被打上标记。适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步。如果需要实时监听追加内容的文件，可对SpoolDirectorySource进行改进。
Taildir Source
　　Taildir Source可实时监控一批文件，并记录每个文件最新消费位置，agent进程重启后不会有重复消费的问题。
使用时建议用1.8.0版本的flume，1.8.0版本中解决了Taildir Source一个可能会丢数据的bug。

######## source相关配置 ########
# source类型
agent.sources.s1.type = TAILDIR
# 元数据位置
agent.sources.s1.positionFile = /Users/wangpei/tempData/flume/taildir_position.json
# 监控的目录
agent.sources.s1.filegroups = f1
agent.sources.s1.filegroups.f1=/Users/wangpei/tempData/flume/data/.*log
agent.sources.s1.fileHeader = true

记录每个文件消费位置的元数据
#配置
agent.sources.s1.positionFile = /Users/wangpei/tempData/flume/taildir_position.json
#内容
[
{
"inode":6028358,
"pos":144,
"file":"/Users/wangpei/tempData/flume/data/test.log"
},
{
"inode":6028612,
"pos":20,
"file":"/Users/wangpei/tempData/flume/data/test_a.log"
}
]

可以看到，在taildir_position.json文件中，通过json数组的方式，记录了每个文件最新的消费位置，每消费一次便去更新这个文件。

官方版本不支持，可以定制开发

/**
	* hzy
	* 带递归
	* @return
	*/
	private List<File> getMatchingFilesNoCache2() {
	List<File> result = Lists.newArrayList();

	List<Path> paths = recurseFolder(parentDir);
	for(Path path:paths){
	try (DirectoryStream<Path> stream = Files.newDirectoryStream(path, fileFilter)) {
	for (Path entry : stream) {
	result.add(entry.toFile());
	}
	} catch (IOException e) {
	e.printStackTrace();
	}
	}

	String matchedFileNames = result.stream().map(r-> r.getAbsolutePath()).collect(Collectors.joining("\n"));
	logger.debug("============================matched files=======================================");
	logger.debug(matchedFileNames);
	logger.debug("=============================matched files======================================");
	return result;
	}

	/**
	* hzy
	* @param root
	* @return
	*/
	public List<Path> recurseFolder(File root) {
	List<Path> allParentFolders = new ArrayList<>();
	allParentFolders.add(root.toPath());

	if (root.exists()) {
	File[] files = root.listFiles();
	if (null == files \|\| files.length == 0) {
	return allParentFolders;
	} else {
	for (File subFile : files) {
	if (subFile.isDirectory()) {
	allParentFolders.addAll(recurseFolder(subFile));
	}
	}
	}
	}
	return allParentFolders;
	}

18 flume源代码有没有做过二次开发

其中flume-ng-core存放了最核心部分的代码，包含基础的Source、Channel、Sink等；flume-ng-node则是存放了程序启动的代码（入口函数）。

其他可能会用到的模块就是flume-ng-sources、flume-ng-channels、flume-ng-sinks，这3个模块存放了非必须的flume组件（flume-ng-core中未包含的），里面有些组件也是很常用的。

Flume有三大组件：Source、Channel、Sink。

Source就是数据来源，例如Web Server产生日志后，可使用ExecSource执行tail -F命令后不断监听日志文件新生成的数据，然后传给Channel。
Channel就是一个缓存队列，由于读取数据和写入数据的速度可能不匹配，假如用同步完成的方式可能效率低下，所以Source把数据写到Channel这个队列里面，Sink再用另外的线程去读取。
Sink就是最终的存储，例如可以是HDFS或LOG文件输出等，Sink负责去Channel里面读取数据，并存储。
在程序启动时，会启动所有的SourceRunner、Channel、SinkRunner。其中Channel的启动，没做什么特别的事情，就是初始化一下状态、创建一下计数器，算做一个被动的角色。比较重要的是SourceRunner和SinkRunner。

SourceRunner会调用Source的start方法。以ExecSource为例，其start方法就是启动一个线程，去不断获取标准输出流写入一个列表（eventList），同时再启动一个线程去定期批量地把列表中的数据往Channel发，如下图所示。
SinkRunner则是不断循环调用SinkProcess的process的方法，SinkProcess有几种类型，用于决定选择哪个Sink进行存储（Sink可以有多个），选择了Sink后，调用其process方法。Sink的process方法，主要做的就是去Channel中读取数据，并写入对应的存储，如下图所示。

程序入口
启动Flume的过程可以简单分为2个步骤：
1. 获取相关配置文件（一般来说就是flume-conf.properties）。
2. 启动各组件。不特别说明，本文中的组件是指实现了LifecycleAware接口的类的对象，一般就是Source、Channel、Sink这3种对象。

启动Flume的Main函数在flume-ng-node模块的org.apache.flume.node.Application。该函数的功能可以简单划分为以下三个步骤：
1. 使用commons.cli类获取命令行参数（就是启动时传入的参数）
2. 根据启动参数确定的读取配置的方式。读取配置的方式总共有4种，分别根据配置是保存在zookeeper上还是本地properties文件、以及是否reload（自动重载配置文件）分为4种方式。
3. 根据相应的配置启动程序，并注册关闭钩子。
接下来以properties文件、不重载的方式为例，主要的代码如下：

PropertiesFileConfigurationProvider configurationProvider =
new PropertiesFileConfigurationProvider(agentName, configurationFile);
//创建Application对象，包含初始化组件列表（components），初始化LifecycleSupervisor。
application = new Application();
application.handleConfigurationEvent(configurationProvider.getConfiguration());
//start方法用于检查所有组件是否是启动状态，如果不是则启动该组件。
application.start();
//监听程序关闭事件，用于当程序被kill后能够执行一些清理工作。
final Application appReference = application;
Runtime.getRuntime().addShutdownHook(new Thread("agent-shutdown-hook") {
public void run() {
appReference.stop();
}
});

上面的代码，有两处比较关键：

configurationProvider.getConfiguration()会返回一个MaterializedConfiguration类型的对象，用于从文件形式的配置转为物化的配置，即包含实际的channel、sinkRunner等对象的实例，在“物化配置”一节分析。
handleConfigurationEvent用于停止所有components，并使用新的配置进行启动，在“使用新配置重启”一节分析。
---------------------

物化配置
configurationProvider.getConfiguration()方法主要做了以下两件事：
1. 读取配置文件（flume-conf.properties），保存在AgentConfiguration对象中。

public static class AgentConfiguration {
private final String agentName;
private String sources;
private String sinks;
private String channels;
private String sinkgroups;
private final Map<String, ComponentConfiguration> sourceConfigMap;
private final Map<String, ComponentConfiguration> sinkConfigMap;
private final Map<String, ComponentConfiguration> channelConfigMap;
private final Map<String, ComponentConfiguration> sinkgroupConfigMap;
private Map<String, Context> sourceContextMap;
private Map<String, Context> sinkContextMap;
private Map<String, Context> channelContextMap;
private Map<String, Context> sinkGroupContextMap;
private Set<String> sinkSet;
private Set<String> sourceSet;
private Set<String> channelSet;
private Set<String> sinkgroupSet;
}

到这个步骤还仅仅是做好了分类的文本形式的配置项。
2. 创建出配置中的各组件实例，并添加到MaterializedConfiguration实例中。

public interface MaterializedConfiguration {
public void addSourceRunner(String name, SourceRunner sourceRunner);
public void addSinkRunner(String name, SinkRunner sinkRunner);
public void addChannel(String name, Channel channel);
public ImmutableMap<String, SourceRunner> getSourceRunners();
public ImmutableMap<String, SinkRunner> getSinkRunners();
public ImmutableMap<String, Channel> getChannels();
}
---------------------

启动所有组件
4.2.1 使用新配置重启
有了上面的MaterializedConfiguration实例，我们就可以启动组件了。
在handleConfigurationEvent方法中，首先会停止所有组件，然后再启动所有组件。

stopAllComponents();
startAllComponents(conf); //这里的conf就是上节的MaterializedConfiguration。
1
2
在startAllComponents方法中，会遍历组件列表（SourceRunners、SinkRunners、Channels），分别调用supervise方法。以Channel为例：

for (Entry<String, Channel> entry :
materializedConfiguration.getChannels().entrySet()) {
try {
logger.info("Starting Channel " + entry.getKey());
supervisor.supervise(entry.getValue(),
new SupervisorPolicy.AlwaysRestartPolicy(), LifecycleState.START);
} catch (Exception e) {
logger.error("Error while starting {}", entry.getValue(), e);
}
}
---------------------

LifecycleSupervisor
上节的supervisor是一个LifecycleSupervisor对象。前面有说到，在创建Application的时候初始化了一个LifecycleSupervisor对象，就是这里的supervisor。这个对象，我理解为各组件生命周期的管理者，用于实时监控所有组件的状态，如果不是期望的状态（desiredState），则进行状态转换。

上节的代码中调用了supervisor.supervise方法，接下来分析一下supervise这个方法：

public synchronized void supervise(LifecycleAware lifecycleAware,
SupervisorPolicy policy, LifecycleState desiredState) {
//省略状态检查的代码
Supervisoree process = new Supervisoree();
process.status = new Status();
process.policy = policy;
process.status.desiredState = desiredState;
process.status.error = false;
MonitorRunnable monitorRunnable = new MonitorRunnable();
monitorRunnable.lifecycleAware = lifecycleAware;
monitorRunnable.supervisoree = process;
monitorRunnable.monitorService = monitorService;
supervisedProcesses.put(lifecycleAware, process);
ScheduledFuture<?> future = monitorService.scheduleWithFixedDelay(
monitorRunnable, 0, 3, TimeUnit.SECONDS);
monitorFutures.put(lifecycleAware, future);
}

由于所有的组件都实现了LifecycleAware接口，所以这里的supervise方法传入的是LifecycleAware接口的对象。

可以看到创建了一个Supervisoree对象，顾名思义，就是被监控的的对象，该对象有以下几种状态：IDLE, START, STOP, ERROR。
scheduleWithFixedDelay每隔3秒触发一次监控任务（monitorRunnable）
---------------------

MonitorRunnable
在MonitorRunnable中主要是检查组件的状态，并实现从lifecycleState到desiredState的转变。

switch (supervisoree.status.desiredState) {
case START:
try {
lifecycleAware.start();
} catch (Throwable e) {省略}
break;
case STOP:
try {
lifecycleAware.stop();
} catch (Throwable e) {省略}
break;
default:
logger.warn("I refuse to acknowledge {} as a desired state", supervisoree.status.desiredState);
}

到这里为止，可以看到监控的进程，调用了组件自己的start和stop方法来启动、停止。前面有提到有3种类型的组件，SourceRunner、Channel、SinkRunner，而Channel的start只做了初始化计数器，没什么实质内容，所以接下来从SourceRunner的启动（从Source写数据到Channel）和SinkRunner的启动（从Channel获取数据写入Sink）来展开说明。
---------------------

从Source写数据到Channel
5.1 Source部分
5.1.1 SourceRunner
SourceRunner就是专门用于运行Source的一个类。
在”物化配置”一节获取配置信息后，会根据Source去获取具体的SourceRunner，调用的是SourceRunner的forSource方法。

public static SourceRunner forSource(Source source) {
SourceRunner runner = null;
if (source instanceof PollableSource) {
runner = new PollableSourceRunner();
((PollableSourceRunner) runner).setSource((PollableSource) source);
} else if (source instanceof EventDrivenSource) {
runner = new EventDrivenSourceRunner();
((EventDrivenSourceRunner) runner).setSource((EventDrivenSource) source);
} else {
throw new IllegalArgumentException("No known runner type for source " + source);
}
return runner;
}
---------------------

可以看到source分为了2种类型，并有对应的sourceRunner（PollableSourceRunner、EventDrivenSourceRunner）。这2种source区别在于是否需要外部的驱动去获取数据，不需要外部驱动（采用自身的事件驱动机制）的称为EventDrivenSource，需要外部驱动的称为PollableSource。

常见的EventDrivenSource：AvroSource、ExecSource、SpoolDirectorySource。
常见的PollableSource：TaildirSource、kafkaSource、JMSSource。
以EventDrivenSourceRunner为例，由MonitorRunnable调用其start方法：

public void start() {
Source source = getSource();
ChannelProcessor cp = source.getChannelProcessor();
cp.initialize();//用于初始化Interceptor
source.start();
lifecycleState = LifecycleState.START;
}
这里的ChannelProcessor是比较重要的一个类，后面会具体说。接下来调用了Source的start方法。可以对照一下之前的整体架构的图，start方法实现的就是这个部分：

5.1.2 ExecSource
以ExecSource的start方法为例：

public void start() {
executor = Executors.newSingleThreadExecutor();
runner = new ExecRunnable(shell, command, getChannelProcessor(), sourceCounter, restart, restartThrottle, logStderr, bufferCount, batchTimeout, charset);
runnerFuture = executor.submit(runner);
sourceCounter.start();
super.start();
}

主要启动了一个线程runner，初始化了一下计数器。具体实现还是要看ExecRunable类的run方法：

public void run() {
do {
timedFlushService = Executors.newSingleThreadScheduledExecutor(…);
//使用配置的参数启动Shell命令
String[] commandArgs = command.split("\\s+");
process = new ProcessBuilder(commandArgs).start();
//设置标准输入流
reader = new BufferedReader(new InputStreamReader(process.getInputStream()…));
//设置错误流
StderrReader stderrReader = new StderrReader(…);
stderrReader.start();
//启动定时任务，将eventList中数据批量写入到Channel
future = timedFlushService.scheduleWithFixedDelay(new Runnable() {
public void run() {
synchronized (eventList) {
if (!eventList.isEmpty() && timeout()) {flushEventBatch(eventList);}
}
}
},batchTimeout, batchTimeout, TimeUnit.MILLISECONDS);
//按行读取标准输出流的内容，并写入eventList
while ((line = reader.readLine()) != null) {
synchronized (eventList) {
sourceCounter.incrementEventReceivedCount();
eventList.add(EventBuilder.withBody(line.getBytes(charset)))
//超出配置的大小或者超时后，将eventList写到Channel
if (eventList.size() >= bufferCount || timeout()) {flushEventBatch(eventList);}
}
}
synchronized (eventList) {if (!eventList.isEmpty()){flushEventBatch(eventList);}}
} while (restart);//如果配置了自动重启，当Shell命令的进程结束时，自动重启命令。
}

在该方法中启动了2个reader，分别取读取标准输入流和错误流，将标准输入流中的内容写入eventList。

与此同时启动另外一个线程，调用flushEventBatch方法，定期将eventList中的数据写入到Channel。

private void flushEventBatch(List<Event> eventList) {
channelProcessor.processEventBatch(eventList);//假如这里异常的话，eventList还没有清空
sourceCounter.addToEventAcceptedCount(eventList.size());
eventList.clear();
lastPushToChannel = systemClock.currentTimeMillis();
}

可以看到这里调用了channelProcessor.processEventBatch()来写入Channel。

5.2 Channel部分
5.2.1 ChannelProcessor
ChannelProcessor的作用是执行所有interceptor，并将eventList中的数据，发送到各个reqChannel、optChannel。ReqChannel和optChannel是通过channelSelector来获取的。

public interface ChannelSelector extends NamedComponent, Configurable {
public void setChannels(List<Channel> channels);
public List<Channel> getRequiredChannels(Event event);
public List<Channel> getOptionalChannels(Event event);
public List<Channel> getAllChannels();//获取在当前Source中配置的全部Channel
}

如果要自定义一个ChannelSelector，只需要继承AbstractChannelSelector后，实现getRequiredChannels和getOptionalChannels即可。

ReqChannel代表一定保证存储的Channel（失败会不断重试），optChannel代表可能存储的Channel（即失败后不重试）。

ReqChannel与optChannel的区别从代码上来看，前者在出现异常时，会在执行完回滚后往上层抛，而optChannel则只执行回滚。注意到回滚操作只清空putList（5.2.4节会说明），而这一层如果没有抛出异常的话，调用方（也就是上节的flushEventBatch）会清空eventList，也就是异常之后的数据丢失了。

发送其中一条数据的代码如下：

try {
tx.begin();
reqChannel.put(event);
tx.commit();
} catch (Throwable t) {
tx.rollback();
//省略部分代码
}

其中put调用Channel的doPut方法，commit调用Channel的doCommit方法。
Channel主要包含4个主要方法：doPut、doTake、doCommit、doRollback。下面以MemoryChannel为例说明。
---------------------

5.2.2 doPut方法
在这个方法中，只包含了递增计数器和将事件添加到putList。

protected void doPut(Event event) throws InterruptedException {
channelCounter.incrementEventPutAttemptCount();
int eventByteSize = (int) Math.ceil(estimateEventSize(event) / byteCapacitySlotSize);
if (!putList.offer(event)) {
throw new ChannelException("");
}
putByteCounter += eventByteSize;
}
1
2
3
4
5
6
7
8
假如这个方法中出现了异常，则会抛到ChannelProcessor中执行回滚操作。

5.2.3 doCommit方法
这个方法是比较复杂的方法之一，原因在于put和take操作的commit都是通过这个方法来进行的，所以代码里面其实混合了2个功能（即put和take操作）所需的提交代码。

单纯从Source写数据到Channel这件事情，流程为eventList->putList->queue。

由于前面已经完成了把数据放到putList中，那接下来要做的事情就是将putList中数据放入queue中就可以了。这个部分先说明到这里，下一个章节结合take操作一起看这个方法。

5.2.4 doRollback方法
与doCommit方法类似，这里的回滚，也分为2种情况:由take操作引起的和由put方法引起的。

这里先说由put发起的，该transaction的流程如下：
eventList->putList->queue

由于doPut和doCommit执行出现异常就直接跳出了，还没执行清空语句（这里可以参考“ExecSource“章节的最后一段代码的注释部分），也就是eventList还没有清空，所以可以直接清空putList，这样下次循环还会重新读取该eventList中的数据。

附注：在put操作commit的时候，如果部分数据已经放进queue的话，这个时候回滚，那是否存在数据重复问题呢？根据代码，由于在放队列这个操作之前已经做过很多判断（容量等等），这个操作只是取出放进队列的操作，而这个代码之后，也只是一些设置计数器的操作，理论上不会出现异常导致回滚了
---------------------

从Channel获取数据写入Sink
6.1 Sink部分
Sink部分主要分为以下3个步骤：
1. 由SinkRunner不断调用SinkProcessor的process方法。
2. 根据配置的SinkProcessor的不同，会使用不同的策略来选择sink。SinkProcessor有3种，默认是DefaultSinkProcessor。
3. 调用选择的sink的process方法。

6.1.1 Sink的Process方法
以LoggerSink为例进行说明。这个方法来自Sink接口，主要用于取出数据进行处理，如果失败则回滚（takeList中内容退回quene）：

public Status process() throws EventDeliveryException {
Status result = Status.READY;
Channel channel = getChannel();
Transaction transaction = channel.getTransaction();
Event event = null;
try {
transaction.begin();
event = channel.take();//从channel中获取一条数据
if (event != null) {
if (logger.isInfoEnabled()) {
logger.info("Event: " + EventHelper.dumpEvent(event, maxBytesToLog));
//输出event到日志
}
} else {
result = Status.BACKOFF;
}
transaction.commit();//执行提交操作
} catch (Exception ex) {
transaction.rollback();//执行回滚操作
throw new EventDeliveryException("Failed to log event: " + event, ex);
} finally {
transaction.close();
}
return result;
}

6.2 Channel部分
6.2.1 doTake方法
这个方法中主要是从queue中取出事件，放到takeList中。

protected Event doTake() throws InterruptedException {
channelCounter.incrementEventTakeAttemptCount();
//获取take列表容量的许可，如果没有则报异常。
if (takeList.remainingCapacity() == 0) {
throw new ChannelException("");
}
//尝试获取queue数量的许可，如果没有则代表没有数据可以取，直接返回。
if (!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) {
return null;
}
Event event;
synchronized (queueLock) {
event = queue.poll();//从queue中取出一条数据
}
Preconditions.checkNotNull(event, "");
takeList.put(event);//放到takeList中
int eventByteSize = (int) Math.ceil(estimateEventSize(event) / byteCapacitySlotSize);
takeByteCounter += eventByteSize;//设置计数器
return event;
}

6.2.2 doCommit方法
前面说到put和take操作的提交都是通过这个方法来提交的。

这个步骤要做的事情有:
1. putList放入queue,完成后就代表eventList->putList->queue这个步骤完成。
2. 假如doTake过程没报错（能进到这个方法说明没报错），说明sink那边已经获取到了全部的event，这时可直接清空takeList，代表queuetakeList & sink这个步骤完成。

综上，两个事情合并在一起的话，要做的就是，把putList放入queue再清空takeList。

protected void doCommit() throws InterruptedException {
int remainingChange = takeList.size() - putList.size();
if (remainingChange < 0) {
if (!bytesRemaining.tryAcquire(putByteCounter, keepAlive, TimeUnit.SECONDS)) {
throw new ChannelException("");
}
if (!queueRemaining.tryAcquire(-remainingChange, keepAlive, TimeUnit.SECONDS)) {
bytesRemaining.release(putByteCounter);
throw new ChannelFullException("");
}
}
int puts = putList.size();
int takes = takeList.size();
synchronized (queueLock) {
if (puts > 0) {
while (!putList.isEmpty()) {
if (!queue.offer(putList.removeFirst())) {
throw new RuntimeException("");
}
}
}
putList.clear();
takeList.clear();
}
//后面是重新设置相关计数器
}

这个方法一开始去比较takeList和putList的容量差，是为了简化申请许可的过程。正常的流程是清空takeList，释放takeList.size个许可，再申请putList.size个许可，它是两个步骤合并起来的。

6.2.3 doRollback方法
与doCommit方法类似，这里的回滚，也分为2种情况:
- 由take操作引起的
该transaction的流程如下:queue->takeList & sink，所以回滚操作要做的事情就是：把takeList放回queue。
- 由put操作引起的
该transaction的流程如下：eventList->putList->queue，由于doPut和doCommit执行出现异常就直接跳出了，还没执行清空语句，也就是eventList还没有清空，所以可以直接清空putList，这样下次循环还会重新读取该eventList中的数据。

综上，两种操作要合为一个方法的话，就把takeList放回queue，然后清理putList就可以了。代码如下:

protected void doRollback() {
int takes = takeList.size();
synchronized (queueLock) {
Preconditions.checkState(queue.remainingCapacity() >= takeList.size(),"");
while (!takeList.isEmpty()) {
queue.addFirst(takeList.removeLast());
}
putList.clear();
}
//后面是重新设置相关计数器
}

附注：从目前的代码看，在take操作的时候，应该已经获取到了部分数据，如果这个时候异常了，把takeList返回queue的话，会导致重复数据。

longdada007

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Review(7)

17 flume如何抽取数据记录pos点？用哪一个source？ tailDir目录能支持递归吗 flume中有三种可监控文件或目录的source、分别是Exec Source、Spooling Directory Source和Taildir Source。 Taildir Source是1.7版本的新特性，综合了Spooling Directory Source和Ex...
复制链接

扫一扫

专栏目录