1、报错内容如下:
具体描述如下图所示:
This scheduler instance xxxx is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
ClusterManager detected 1 failed or restarted instances.
分析:
1、可以看到当前日志是由LocalDataSourceJobStore打印出来的,源码查看无日志信息,往父类和接口进行查找到JobStoreSupport,主要源码如下:
protected void clusterRecover(Connection conn, List<SchedulerStateRecord> failedInstances)
throws JobPersistenceException {
if (failedInstances.size() > 0) {
long recoverIds = System.currentTimeMillis();
logWarnIfNonZero(failedInstances.size(),
"ClusterManager: detected " + failedInstances.size()
+ " failed or restarted instances.");
// 省略后面的N行代码
// ....
}
}
protected List<SchedulerStateRecord> findFailedInstances(Connection conn)
throws JobPersistenceException {
try {
List<SchedulerStateRecord> failedInstances = new LinkedList<SchedulerStateRecord>();
boolean foundThisScheduler = false;
long timeNow = System.currentTimeMillis();
List<SchedulerStateRecord> states = getDelegate().selectSchedulerStateRecords(conn, null);
for(SchedulerStateRecord rec: states) {
// find own record...
if (rec.getSchedulerInstanceId().equals(getInstanceId())) {
foundThisScheduler = true;
if (firstCheckIn) {
failedInstances.add(rec);
}
} else {
// find failed instances...
if (calcFailedIfAfter(rec) < timeNow) {
failedInstances.add(rec);
}
}
}
// The first time through, also check for orphaned fired triggers.
if (firstCheckIn) {
failedInstances.addAll(findOrphanedFailedInstances(conn, states));
}
// If not the first time but we didn't find our own instance, then
// 不是当前机器同时也不是第一次进行check.
if ((!foundThisScheduler) && (!firstCheckIn)) {
// FUTURE_TODO: revisit when handle self-failed-out impl'ed (see FUTURE_TODO in clusterCheckIn() below)
getLog().warn(
"This scheduler instance (" + getInstanceId() + ") is still " +
"active but was recovered by another instance in the cluster. " +
"This may cause inconsistent behavior.");
}
return failedInstances;
} catch (Exception e) {
lastCheckin = System.currentTimeMillis();
throw new JobPersistenceException("Failure identifying failed instances when checking-in: "
+ e.getMessage(), e);
}
}
可以看到代码中的 // find failed instances… 下面的calcFailedIfAfter方法:
protected long calcFailedIfAfter(SchedulerStateRecord rec) {
return rec.getCheckinTimestamp() +
Math.max(rec.getCheckinInterval(),
(System.currentTimeMillis() - lastCheckin)) +
7500L;
}
由于数据库中没有找到当前机器的instance并不是第一次check,所以会打印如下日志:
This scheduler instance xxxx is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
同时有其他机器节点的时间发生了超时,由于系统的时间差值较大,超过7.5秒,才会将失败的实例增加到failedInstances中,由于存在超时通讯的节点,所以会执行调用clusterRecover方法,则会打印如下的日志:
ClusterManager detected 1 failed or restarted instances.
所以这个问题主要是由于系统服务器时间不同步导致的,同步集群当中服务的时间即可解决该问题。当前源码学习仍在进行中,如有不对,请不吝赐教,感激不尽!