对Flume-NG的agent启动过程进行详细的分析。
启动过程
flume的main函数在Application.java中,在flume-ng的shell启动脚本中会用java来起flume:
$EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "$FLUME_CLASSPATH" -Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*
- 1
在main函数中,会检查一系列参数,最重要的是no-reload-conf
,根据reload的不同,判断是否动态加载配置文件,然后start:
List<LifecycleAware> components = Lists.newArrayList();
if (reload) {
EventBus eventBus = new EventBus(agentName + "-event-bus");
PollingPropertiesFileConfigurationProvider configurationProvider =
new PollingPropertiesFileConfigurationProvider(
agentName, configurationFile, eventBus, 30);
components.add(configurationProvider);
application = new Application(components);
eventBus.register(application);
} else {
PropertiesFileConfigurationProvider configurationProvider =
new PropertiesFileConfigurationProvider(
agentName, configurationFile);
application = new Application();
application.handleConfigurationEvent(configurationProvider
.getConfiguration());
}
application.start();
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
这里面有很多需要讲解的:
configurationProvider.getConfiguration()会根据配置文件完成source、channel、sink的初始化。
1.1 LifecycleAware接口
实现这个接口的类是有定义好的一系列状态转化的生命周期的:
@InterfaceAudience.Public
@InterfaceStability.Stable
public interface LifecycleAware {
/**
* <p>
* Starts a service or component.
* </p>
* <p>
* Implementations should determine the result of any start logic and effect
* the return value of {@link #getLifecycleState()} accordingly.
* </p>
*
* @throws LifecycleException
* @throws InterruptedException
*/
public void start();
/**
* <p>
* Stops a service or component.
* </p>
* <p>
* Implementations should determine the result of any stop logic and effect
* the return value of {@link #getLifecycleState()} accordingly.
* </p>
*
* @throws LifecycleException
* @throws InterruptedException
*/
public void stop();
/**
* <p>
* Return the current state of the service or component.
* </p>
*/
public LifecycleState getLifecycleState();
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
比如给的样例:
* Example usage
* </p>
* <code>
* public class MyService implements LifecycleAware {
*
* private LifecycleState lifecycleState;
*
* public MyService() {
* lifecycleState = LifecycleState.IDLE;
* }
*
* @Override
* public void start(Context context) throws LifecycleException,
* InterruptedException {
*
* ...your code does something.
*
* lifecycleState = LifecycleState.START;
* }
*
* @Override
* public void stop(Context context) throws LifecycleException,
* InterruptedException {
*
* try {
* ...you stop services here.
* } catch (SomethingException) {
* lifecycleState = LifecycleState.ERROR;
* }
*
* lifecycleState = LifecycleState.STOP;
* }
*
* @Override
* public LifecycleState getLifecycleState() {
* return lifecycleState;
* }
*
* }
* </code>
*/
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
每一个flume node,每一个flume component(source, channel, sink)都实现了这个接口。
1.2 EventBus事件监听和发布订阅模式
回到上面,再往前判断是否设置了no-reload-conf
,如果设置了,会新建一个EventBus
,它是Guava(Google写的java库)的事件处理机制,是设计模式中的观察者模式(生产/消费者编程模型)的优雅实现。
使用Guava之后, 如果要订阅消息, 就不用再继承指定的接口, 只需要在指定的方法上加上@Subscribe注解即可。
我们看flume是如何用的:
PollingPropertiesFileConfigurationProvider configurationProvider =
new PollingPropertiesFileConfigurationProvider(
agentName, configurationFile, eventBus, 30);
- 1
- 2
- 3
将eventBus传给PollingPropertiesFileConfigurationProvider
,一方面它继承了PropertiesFileConfigurationProvider
类,说明它是配置文件的提供者,另一方面,它实现了LifecycleAware
接口,说明它是有生命周期的。
那么它在生命周期做了什么?
@Override
public void start() {
LOGGER.info("Configuration provider starting");
Preconditions.checkState(file != null,
"The parameter file must not be null");
executorService = Executors.newSingleThreadScheduledExecutor(
new ThreadFactoryBuilder().setNameFormat("conf-file-poller-%d")
.build());
FileWatcherRunnable fileWatcherRunnable =
new FileWatcherRunnable(file, counterGroup);
executorService.scheduleWithFixedDelay(fileWatcherRunnable, 0, interval,
TimeUnit.SECONDS);
lifecycleState = LifecycleState.START;
LOGGER.debug("Configuration provider started");
}
@Override
public void stop() {
LOGGER.info("Configuration provider stopping");
executorService.shutdown();
try{
while(!executorService.awaitTermination(500, TimeUnit.MILLISECONDS)) {
LOGGER.debug("Waiting for file watcher to terminate");
}
} catch (InterruptedException e) {
LOGGER.debug("Interrupted while waiting for file watcher to terminate");
Thread.currentThread().interrupt();
}
lifecycleState = LifecycleState.STOP;
LOGGER.debug("Configuration provider stopped");
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
可以看到,在start时候,它起了一个周期调用线程executorService
,这个周期调用线程又回每隔30s调用fileWatcherRunnable
这个配置文件监控线程,在FileWatcherRunnable
这里面,会去监听flume配置文件的变化,如果修改时间发生变化,eventBus
会说我感兴趣的事件发生了!即eventBus.post(getConfiguration())
@Override
public void run() {
LOGGER.debug("Checking file:{} for changes", file);
counterGroup.incrementAndGet("file.checks");
long lastModified = file.lastModified();
if (lastModified > lastChange) {
LOGGER.info("Reloading configuration file:{}", file);
counterGroup.incrementAndGet("file.loads");
lastChange = lastModified;
try {
eventBus.post(getConfiguration());
} catch (Exception e) {
LOGGER.error("Failed to load configuration data. Exception follows.",
e);
} catch (NoClassDefFoundError e) {
LOGGER.error("Failed to start agent because dependencies were not " +
"found in classpath. Error follows.", e);
} catch (Throwable t) {
// caught because the caller does not handle or log Throwables
LOGGER.error("Unhandled error", t);
}
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
之后对该事件感兴趣的listener就会进行事件处理,这里flume本身的Application对配置文件的变化感兴趣:
eventBus.register(application);
# 相当于
eventBus.register(listener);
- 1
- 2
- 3
在application中,用注解@Subscribe
标明的方法就告诉了我们,事件发生后,如何处理:
@Subscribe
public synchronized void handleConfigurationEvent(MaterializedConfiguration conf) {
stopAllComponents();
startAllComponents(conf);
}
- 1
- 2
- 3
- 4
- 5
到这里为止,讲清楚了如果启动flume时候配置了no-reload-con
参数,flume就会动态加载配置文件,默认每30秒检查一次配置文件,如果有修改,会重启所有的components;如果没有配置该参数,则只会启动一次。
1.3 Application的start前与后
Application的start()和handleConfigurationEvent(MaterializedConfiguration conf),handleConfigurationEvent方法是在启动时或者需要动态读取配置文件而配置文件发生变化时,会通过eventBus调用此方法。
,该方法会先关闭所有组件再启动所有组件,因此,flume 所谓的动态加载并不是真正的动态,只能算自动重启吧,代码如下(org.apache.flume.node.Application):
- @Subscribe
- public synchronized void handleConfigurationEvent(MaterializedConfiguration conf) {
- stopAllComponents();
- startAllComponents(conf);
- }
进入stopallComponents方法,该方法是关闭所有的组件:
- private void stopAllComponents() {
- if (this.materializedConfiguration != null) {
- logger.info("Shutting down configuration: {}", this.materializedConfiguration);
- for (Entry<String, SourceRunner> entry : this.materializedConfiguration.getSourceRunners().entrySet()) {
- try {
- logger.info("Stopping Source " + entry.getKey());
- supervisor.unsupervise(entry.getValue());
- } catch (Exception e) {
- logger.error("Error while stopping {}", entry.getValue(), e);
- }
- }
- for (Entry<String, SinkRunner> entry : this.materializedConfiguration.getSinkRunners().entrySet()) {
- try {
- logger.info("Stopping Sink " + entry.getKey());
- supervisor.unsupervise(entry.getValue());
- } catch (Exception e) {
- logger.error("Error while stopping {}", entry.getValue(), e);
- }
- }
- for (Entry<String, Channel> entry : this.materializedConfiguration.getChannels().entrySet()) {
- try {
- logger.info("Stopping Channel " + entry.getKey());
- supervisor.unsupervise(entry.getValue());
- } catch (Exception e) {
- logger.error("Error while stopping {}", entry.getValue(), e);
- }
- }
- }
- if (monitorServer != null) {
- monitorServer.stop();
- }
- }
可以看出,flume关闭组件的顺序为source->sink->channel。
另外,这些组件都调用了supervisor.unsupervise(entry.getValue());这个方法来关闭组件,进入unsupervise方法:
- public synchronized void unsupervise(LifecycleAware lifecycleAware) {
- Preconditions.checkState(supervisedProcesses.containsKey(lifecycleAware),
- "Unaware of " + lifecycleAware + " - can not unsupervise");
- logger.debug("Unsupervising service:{}", lifecycleAware);
- synchronized (lifecycleAware) {
- Supervisoree supervisoree = supervisedProcesses.get(lifecycleAware);
- supervisoree.status.discard = true;
- this.setDesiredState(lifecycleAware, LifecycleState.STOP);
- logger.info("Stopping component: {}", lifecycleAware);
- lifecycleAware.stop();
- }
- supervisedProcesses.remove(lifecycleAware);
- // We need to do this because a reconfiguration simply unsupervises old
- // components and supervises new ones.
- monitorFutures.get(lifecycleAware).cancel(false);
- // purges are expensive, so it is done only once every 2 hours.
- needToPurge = true;
- monitorFutures.remove(lifecycleAware);
- }
这些方法主要是将组件以及监控等从内存中移除。lifecycleAware.stop()方法执行具体的lifecycleAware的stop,LifecycleAware是一个顶级接口,定义了组件的开始,结束以及当前状态,flume中重要组件如source,sink,channel都实现了这个接口:
- public interface LifecycleAware {
- public void start();
- public void stop();
- public LifecycleState getLifecycleState();
- }
通过该接口实现了多态,不同组件执行自己的start,stop方法。组件的停止分析到此处,下面分析另一个方法:startAllComponents方法:
- private void startAllComponents(MaterializedConfiguration materializedConfiguration) {
- logger.info("Starting new configuration:{}", materializedConfiguration);
- //使用读取的配置文件初始化materializedConfiguration对象
- this.materializedConfiguration = materializedConfiguration;
- //先启动channel,等待启动完毕,然后启动sink,最后启动source
- //从materializedConfiguration中读取channel信息,
- for (Entry<String, Channel> entry : materializedConfiguration.getChannels().entrySet()) {
- try {
- logger.info("Starting Channel " + entry.getKey());
- supervisor.supervise(entry.getValue(), new SupervisorPolicy.AlwaysRestartPolicy(),
- LifecycleState.START);
- } catch (Exception e) {
- logger.error("Error while starting {}", entry.getValue(), e);
- }
- }
- /*
- * Wait for all channels to start.
- */
- for (Channel ch : materializedConfiguration.getChannels().values()) {
- while (ch.getLifecycleState() != LifecycleState.START && !supervisor.isComponentInErrorState(ch)) {
- try {
- logger.info("Waiting for channel: " + ch.getName() + " to start. Sleeping for 500 ms");
- Thread.sleep(500);
- } catch (InterruptedException e) {
- logger.error("Interrupted while waiting for channel to start.", e);
- Throwables.propagate(e);
- }
- }
- }
- for (Entry<String, SinkRunner> entry : materializedConfiguration.getSinkRunners().entrySet()) {
- try {
- logger.info("Starting Sink " + entry.getKey());
- supervisor.supervise(entry.getValue(), new SupervisorPolicy.AlwaysRestartPolicy(),
- LifecycleState.START);
- } catch (Exception e) {
- logger.error("Error while starting {}", entry.getValue(), e);
- }
- }
- for (Entry<String, SourceRunner> entry : materializedConfiguration.getSourceRunners().entrySet()) {
- try {
- logger.info("Starting Source " + entry.getKey());
- supervisor.supervise(entry.getValue(), new SupervisorPolicy.AlwaysRestartPolicy(),
- LifecycleState.START);
- } catch (Exception e) {
- logger.error("Error while starting {}", entry.getValue(), e);
- }
- }
- this.loadMonitoring();
- }
可见,与关闭顺序不同,启动组件先启动channel,等待启动完毕,然后启动sink,最后启动source。三个组件的启动都是调用了supervisor.supervise这个方法:
- //supervise方法用于监控对应的组件
- public synchronized void supervise(LifecycleAware lifecycleAware, SupervisorPolicy policy,
- LifecycleState desiredState) {
- if (this.monitorService.isShutdown() || this.monitorService.isTerminated()
- || this.monitorService.isTerminating()) {
- throw new FlumeException("Supervise called on " + lifecycleAware + " "
- + "after shutdown has been initiated. " + lifecycleAware + " will not" + " be started");
- }
- //判断这个组件是不是已经被监控起来,如果已经监控则不再添加到监控map中
- Preconditions.checkState(!supervisedProcesses.containsKey(lifecycleAware),
- "Refusing to supervise " + lifecycleAware + " more than once");
- if (logger.isDebugEnabled()) {
- logger.debug("Supervising service:{} policy:{} desiredState:{}",
- new Object[] { lifecycleAware, policy, desiredState });
- }
- //记录状态信息
- Supervisoree process = new Supervisoree();
- process.status = new Status();
- process.policy = policy;
- process.status.desiredState = desiredState;
- process.status.error = false;
- //MonitorRunnable是一个线程,每过一段时间去检查组件的状态,如果组件状态有误,则改正过来
- //比如本应该start状态,但是组件挂了,则把组件启动起来
- MonitorRunnable monitorRunnable = new MonitorRunnable();
- monitorRunnable.lifecycleAware = lifecycleAware;//监控的对象
- monitorRunnable.supervisoree = process;//监控状态
- monitorRunnable.monitorService = monitorService;//监控的线程池
- //放入当前持有的监控map中
- supervisedProcesses.put(lifecycleAware, process);
- //将持有监控对象,对象状态的monitorrunnable对象吊起来,并且每隔三秒区监控
- ScheduledFuture<?> future = monitorService.scheduleWithFixedDelay(monitorRunnable, 0, 3, TimeUnit.SECONDS);
- //存放每个LifecycleAware组件和调度对应关系记录起来
- monitorFutures.put(lifecycleAware, future);
- }
注意到上面ScheduledFuture<?> future = monitorService.scheduleWithFixedDelay(monitorRunnable, 0, 3, TimeUnit.SECONDS);这段代码启动了一个3s执行的定时任务,每三秒区执行monitorRunnable这个线程,查看该线程的run:
- @Override
- public void run() {
- logger.debug("checking process:{} supervisoree:{}", lifecycleAware, supervisoree);
- long now = System.currentTimeMillis();
- try {
- if (supervisoree.status.firstSeen == null) {
- logger.debug("first time seeing {}", lifecycleAware);
- // 第一次开始运行时,设置firstSeen为当前的时间System.currentTimeMillis()
- supervisoree.status.firstSeen = now;
- }
- supervisoree.status.lastSeen = now;
- synchronized (lifecycleAware) {
- //如果是discard或者error,就丢弃
- if (supervisoree.status.discard) {
- // Unsupervise has already been called on this.
- logger.info("Component has already been stopped {}", lifecycleAware);
- return;
- } else if (supervisoree.status.error) {
- logger.info(
- "Component {} is in error state, and Flume will not" + "attempt to change its state",
- lifecycleAware);
- return;
- }
- supervisoree.status.lastSeenState = lifecycleAware.getLifecycleState();
- //如果状态不是理想的状态,比如理想的状态应该是start,但是现在的状态时stop,那么把组件启动
- //状态只有两种:start和stop
- //否则什么都不做
- if (!lifecycleAware.getLifecycleState().equals(supervisoree.status.desiredState)) {
- logger.debug("Want to transition {} from {} to {} (failures:{})",
- new Object[] { lifecycleAware, supervisoree.status.lastSeenState,
- supervisoree.status.desiredState, supervisoree.status.failures });
- switch (supervisoree.status.desiredState) {
- //本该start状态,但是当前非start状态,则调用该组件的start方法将其启动
- case START:
- try {
- lifecycleAware.start();
- } catch (Throwable e) {
- logger.error("Unable to start " + lifecycleAware + " - Exception follows.", e);
- if (e instanceof Error) {
- // This component can never recover, shut it
- // down.
- supervisoree.status.desiredState = LifecycleState.STOP;
- try {
- lifecycleAware.stop();
- logger.warn(
- "Component {} stopped, since it could not be"
- + "successfully started due to missing dependencies",
- lifecycleAware);
- } catch (Throwable e1) {
- logger.error("Unsuccessful attempt to "
- + "shutdown component: {} due to missing dependencies."
- + " Please shutdown the agent"
- + "or disable this component, or the agent will be"
- + "in an undefined state.", e1);
- supervisoree.status.error = true;
- if (e1 instanceof Error) {
- throw (Error) e1;
- }
- // Set the state to stop, so that the
- // conf poller can
- // proceed.
- }
- }
- supervisoree.status.failures++;
- }
- break;
- case STOP:
- //本该stop状态,但是当前非stop状态,则调用该组件的stop方法将其停止
- try {
- lifecycleAware.stop();
- } catch (Throwable e) {
- logger.error("Unable to stop " + lifecycleAware + " - Exception follows.", e);
- if (e instanceof Error) {
- throw (Error) e;
- }
- supervisoree.status.failures++;
- }
- break;
- default:
- logger.warn("I refuse to acknowledge {} as a desired state",
- supervisoree.status.desiredState);
- }
- if (!supervisoree.policy.isValid(lifecycleAware, supervisoree.status)) {
- logger.error("Policy {} of {} has been violated - supervisor should exit!",
- supervisoree.policy, lifecycleAware);
- }
- }
- }
- } catch (Throwable t) {
- logger.error("Unexpected error", t);
- }
- logger.debug("Status check complete");
- }
可见,该线程主要监控各个组件的执行状态,状态出错则纠正,或重启组件。另外有个线程定期清空缓存里不需要的调度任务:
- private class Purger implements Runnable {
- @Override
- public void run() {
- if (needToPurge) {
- //从工作队列中删除已经cancel的java.util.concurrent.Future对象(释放队列空间)
- //ScheduledFuture的cancel执行后,ScheduledFuture.purge会移除被cancel的任务
- monitorService.purge();
- needToPurge = false;
- }
- }
- }
以上分析的组件启动,关闭,状态监控位于org.apache.flume.lifecycle包的LifecycleSupervisor类。到此,flume组件的关闭,启动,监控分析完毕