前言
之前的博文:Druid.io index_realtime实时任务源码分析 介绍了整个index_realtime任务运行的流程,但是对于某些细节还是没有从源码上分析的很透彻,本文将主要对hand off
这个阶段进行主要分析,说清楚什么是hand off阶段,以及这个阶段做了什么工作
可以先看下官网文档介绍:https://druid.apache.org/docs/0.20.0/design/architecture.html#indexing-and-handoff
简单来说hand off
就是:实时任务完成后的segment文件要经过coordinator加载到historical上,实时任务结束,新segment由historical负责了,要变成历史segment了
源码 + log 说明流程
下面直接进入主题,整个任务结束部分如下图所示,当然包含了所谓的hand off阶段。
借助上图和源码 & log 我们分析下具体的代码逻辑和流程
首先对于tranquility index_realtime实时任务 是通过EventReceiverFirehoseFactory
来接收数据并处理的,这里数据处理时间就是:interval
+ 时间窗口
, 这是tranquility这种pull消费模式特有的,那么过了这个时间窗口即便是再有数据也不会进行处理了,在源码上则是使用定时线程来关闭EventReceiverFirehoseFactory
的,并且任务运行log也直接说明了将会在哪个时间点进行关闭
eg如下:index_realtime_mv2_2020-11-13T08:00:00.000Z_0_0
任务log
2020-11-13T09:00:000Z
+ 10分钟时间窗口 + tranquility的5分钟延迟关闭 就是 2020-11-13T09:15:00.000Z
(具体的源码可见:io.druid.segment.realtime.firehose.TimedShutoffFirehoseFactory.TimedShutoffFirehose
)
确实是到点就"熄火"了, 源码如下
TimedShutoffFirehose(InputRowParser parser, File temporaryDirectory) throws IOException
{
firehose = delegateFactory.connect(parser, temporaryDirectory);
exec = Execs.scheduledSingleThreaded("timed-shutoff-firehose-%d");
exec.schedule(
new Runnable()
{
@Override
public void run()
{
log.info("Closing delegate firehose.");
shutdown = true;
try {
firehose.close();
}
catch (IOException e) {
log.warn(e, "Failed to close delegate firehose, ignoring.");
}
}
},
shutoffTime.getMillis() - System.currentTimeMillis(),
TimeUnit.MILLISECONDS
);
log.info("Firehose created, will shut down at: %s", shutoffTime);
}
回到任务主逻辑(可参看之前的博文): io.druid.indexing.common.task.RealtimeIndexTask#run
run(){
....
try {
toolbox.getDataSegmentServerAnnouncer().announce();
toolbox.getDruidNodeAnnouncer().announce(discoveryDruidNode);
plumber.startJob();
// Set up metrics emission
toolbox.getMonitorScheduler().addMonitor(metricsMonitor);
// Firehose temporary directory is automatically removed when this RealtimeIndexTask completes.
FileUtils.forceMkdir(firehoseTempDir);
// Delay firehose connection to avoid claiming input resources while the plumber is starting up.
final FirehoseFactory firehoseFactory = spec.getIOConfig().getFirehoseFactory();
final boolean firehoseDrainableByClosing = isFirehoseDrainableByClosing(firehoseFactory);
// Skip connecting firehose if we've been stopped before we got started.
synchronized (this) {
if (!gracefullyStopped) {
firehose = firehoseFactory.connect(spec.getDataSchema().getParser(), firehoseTempDir);
committerSupplier = Committers.supplierFromFirehose(firehose);
}
}
// Time to read data!
while (firehose != null && (!gracefullyStopped || firehoseDrainableByClosing) && firehose.hasMore()) {
Plumbers.addNextRow(
committerSupplier,
firehose,
plumber,
tuningConfig.isReportParseExceptions(),
metrics
);
}
}
catch (Throwable e) {
normalExit = false;
log.makeAlert(e, "Exception aborted realtime processing[%s]", dataSchema.getDataSource())
.emit();
throw e;
}
finally {
if (normalExit) {
try {
// Persist if we had actually started.
if (firehose != null) {
log.info("Persisting remaining data.");
final Committer committer = committerSupplier.get();
final CountDownLatch persistLatch = new CountDownLatch(1);
plumber.persist(
new Committer()
{
@Override
public Object getMetadata()
{
return committer.getMetadata();
}
@Override
public void run()
{
try {
committer.run();
}
finally {
persistLatch.countDown();
}
}
}
);
persistLatch.await();
}
if (gracefullyStopped) {
log.info("Gracefully stopping.");
} else {
log.info("Finishing the job.");
synchronized (this) {
if (gracefullyStopped) {
// Someone called stopGracefully after we checked the flag. That's okay, just stop now.
log.info("Gracefully stopping.");
} else {
finishingJob = true;
}
}
if (finishingJob) {
plumber.finishJob();
}
}
}
catch (InterruptedException e) {
log.debug(e, "Interrupted while finishing the job");
}
catch (Exception e) {
log.makeAlert(e, "Failed to finish realtime task").emit();
throw e;
}
finally {
if (firehose != null) {
CloseQuietly.close(firehose);
}
toolbox.getMonitorScheduler().removeMonitor(metricsMonitor);
}
}
toolbox.getDataSegmentServerAnnouncer().unannounce();
toolbox.getDruidNodeAnnouncer().unannounce(discoveryDruidNode);
}
log.info("Job done!");
return TaskStatus.success(getId());
}
}
当firehose
一直工作的时候,RealtimePlumber
是一直处理数据的,看那个while
循环, 如果firsehose close了,那么要走到finally代码了,正常情况下就是正常退出流程,所以可以看到任务log: Persisting remaining data.
可以知道就是最后阶段firehose余下内存中数据的处理了
这里persist剩下数据还用了CountDownLatch
, 即必须要剩下数据处理完成,代码才会走下去, 即如下抽取出来的源码
final CountDownLatch persistLatch = new CountDownLatch(1);
plumber.persist(
new Committer()
{
@Override
public Object getMetadata()
{
return committer.getMetadata();
}
@Override
public void run()
{
try {
committer.run();
}
finally {
persistLatch.countDown();
}
}
}
);
persistLatch.await();
persist 具体源码可以参见:io.druid.segment.realtime.plumber.RealtimePlumber#persistHydrant
& io.druid.segment.IndexMergerV9#persist
, 这里不详细展开,注意观察下任务log
接着代码会执行io.druid.segment.realtime.plumber.RealtimePlumber#finishJob
, 源码如下:这里就是要做最后的merge小segment文件然后进行hand off了
@Override
public void finishJob()
{
log.info("Shutting down...");
shuttingDown = true;
for (final Map.Entry<Long, Sink> entry : sinks.entrySet()) {
persistAndMerge(entry.getKey(), entry.getValue());
}
final long forceEndWaitTime = System.currentTimeMillis() + config.getHandoffConditionTimeout();
while (!sinks.isEmpty()) {
try {
log.info(
"Cannot shut down yet! Sinks remaining: %s",
Joiner.on(", ").join(
Iterables.transform(
sinks.values(),
new Function<Sink, String>()
{
@Override
public String apply(Sink input)
{
return input.getSegment().getIdentifier();
}
}
)
)
);
synchronized (handoffCondition) {
while (!sinks.isEmpty()) {
if (config.getHandoffConditionTimeout() == 0) {
handoffCondition.wait();
} else {
long curr = System.currentTimeMillis();
if (forceEndWaitTime - curr > 0) {
handoffCondition.wait(forceEndWaitTime - curr);
} else {
throw new ISE(
"Segment handoff wait timeout. [%s] segments might not have completed handoff.",
sinks.size()
);
}
}
}
}
}
catch (InterruptedException e) {
throw Throwables.propagate(e);
}
}
handoffNotifier.close();
shutdownExecutors();
stopped = true;
if (!cleanShutdown) {
throw new ISE("Exception occurred during persist and merge.");
}
}
这里的重要操作步骤是persistAndMerge
(核心) 见io.druid.segment.realtime.plumber.RealtimePlumber#persistAndMerge
, 主要源码如下
// mergeExecutor 线程的 segment to deep storage & metadata store
log.info("Pushing [%s] to deep storage", sink.getSegment().getIdentifier());
DataSegment segment = dataSegmentPusher.push(
mergedFile,
sink.getSegment().withDimensions(IndexMerger.getMergedDimensionsFromQueryableIndexes(indexes)),
false
);
log.info("Inserting [%s] to the metadata store", sink.getSegment().getIdentifier());
segmentPublisher.publishSegment(segment);
// 同时 handoffNotifier 开始注册监听hand off 阶段是否完成了
handoffNotifier.registerSegmentHandoffCallback(
new SegmentDescriptor(sink.getInterval(), sink.getVersion(), config.getShardSpec().getPartitionNum()),
mergeExecutor, new Runnable()
{
@Override
public void run()
{
abandonSegment(sink.getInterval().getStartMillis(), sink);
metrics.incrementHandOffCount();
}
}
);
到此hand off
就开始了
- 在
the metadata store
过程中新的segment元数据信息要插入到DRUID_SEGMENTS
表中 - 而Coordinator则是有定时线程任务,去元数据库中取出segment信息加载到内容,并在historicals中进行协调分配(见代码:
io.druid.server.coordinator.helper.DruidCoordinatorRuleRunner#run
) - Coordinator 分配有着自己的算法逻辑,一旦有问题或者卡住,或者保留规则设置问题,等等,那么就会出现经典的
Still waiting for Handoff
问题
如下图log所示,如果Still waiting for Handoff
问题出现,你会看到疯狂的这么个log,下图log最终是Segment Handoff complete
在io.druid.indexing.common.task.RealtimeIndexTask#run
结束代码中还有如下两句,主要是unannounce
,还记得之前的announce peon任务等内容吗?这里就是清理zookeeper node, 因为hand off完成,segment属于historical了,任务要消失了
toolbox.getDataSegmentServerAnnouncer().unannounce();
toolbox.getDruidNodeAnnouncer().unannounce(discoveryDruidNode);
总结
本文主要阐述了下hand off
,并结合源码和log说明了下流程,至少应该理解这么个阶段,实际遇到问题也知道怎么去排查。