es中的IndicesClusterStateService

1、概述

IndicesClusterStateService作为ClusterStateApplier接口的实现类,在处理集群状态提交时,会执行ClusterStateApplier#applyClusterState。其作为索引的集群状态服务,管理索引及分片。

2、applyClusterState

其主要是管理索引以及分片。

public synchronized void applyClusterState(final ClusterChangedEvent event) {
        if (lifecycle.started() == false) {
            return;
        }

        final ClusterState state = event.state();

        // we need to clean the shards and indices we have on this node, since we
        // are going to recover them again once state persistence is disabled (no master / not recovered)
        // TODO: feels hacky, a block disables state persistence, and then we clean the allocated shards, maybe another flag in blocks?
        if (state.blocks().disableStatePersistence()) {
            for (AllocatedIndex<? extends Shard> indexService : indicesService) {
                // also cleans shards
                indicesService.removeIndex(indexService.index(), NO_LONGER_ASSIGNED, "cleaning index (disabled block persistence)");
            }
            return;
        }

        updateFailedShardsCache(state);

        deleteIndices(event); // also deletes shards of deleted indices

        removeIndices(event); // also removes shards of removed indices

        failMissingShards(state);

        removeShards(state);   // removes any local shards that doesn't match what the master expects

        updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache

        createIndices(state);

        createOrUpdateShards(state);
    }

其通信图为

在applyClusterState时,会调用方法createOrUpdateShards及createShard,创建分片是交给IndicesService来处理,最终是通过IndexShard#startRecovery来执行对一个特定分片的恢复流程,会根据恢复类型执行相应的恢复。

恢复类型有

类型说明
EMPTY_STORE从本地恢复(主分片)
EXISTING_STORE从本地恢复(主分片)
PEER从远端主分片恢复(副分片)
SNAPSHOT从快照恢复
LOCAL_SHARDS从本节点的其它分片恢复
switch (recoveryState.getRecoverySource().getType()) {
            case EMPTY_STORE:
            case EXISTING_STORE:
                executeRecovery("from store", recoveryState, recoveryListener, this::recoverFromStore);
                break;
            case PEER:
                try {
                    markAsRecovering("from " + recoveryState.getSourceNode(), recoveryState);
                    recoveryTargetService.startRecovery(this, recoveryState.getSourceNode(), recoveryListener);
                } catch (Exception e) {
                    failShard("corrupted preexisting index", e);
                    recoveryListener.onRecoveryFailure(recoveryState,
                        new RecoveryFailedException(recoveryState, null, e), true);
                }
                break;
            case SNAPSHOT:
                final String repo = ((SnapshotRecoverySource) recoveryState.getRecoverySource()).snapshot().getRepository();
                executeRecovery("from snapshot",
                    recoveryState, recoveryListener, l -> restoreFromRepository(repositoriesService.repository(repo), l));
                break;
            case LOCAL_SHARDS:
                final IndexMetadata indexMetadata = indexSettings().getIndexMetadata();
                final Index resizeSourceIndex = indexMetadata.getResizeSourceIndex();
                final List<IndexShard> startedShards = new ArrayList<>();
                final IndexService sourceIndexService = indicesService.indexService(resizeSourceIndex);
                final Set<ShardId> requiredShards;
                final int numShards;
                if (sourceIndexService != null) {
                    requiredShards = IndexMetadata.selectRecoverFromShards(shardId().id(),
                        sourceIndexService.getMetadata(), indexMetadata.getNumberOfShards());
                    for (IndexShard shard : sourceIndexService) {
                        if (shard.state() == IndexShardState.STARTED && requiredShards.contains(shard.shardId())) {
                            startedShards.add(shard);
                        }
                    }
                    numShards = requiredShards.size();
                } else {
                    numShards = -1;
                    requiredShards = Collections.emptySet();
                }

                if (numShards == startedShards.size()) {
                    assert requiredShards.isEmpty() == false;
                    executeRecovery("from local shards", recoveryState, recoveryListener,
                        l -> recoverFromLocalShards(mappingUpdateConsumer,
                            startedShards.stream().filter((s) -> requiredShards.contains(s.shardId())).collect(Collectors.toList()), l));
                } else {
                    final RuntimeException e;
                    if (numShards == -1) {
                        e = new IndexNotFoundException(resizeSourceIndex);
                    } else {
                        e = new IllegalStateException("not all required shards of index " + resizeSourceIndex
                            + " are started yet, expected " + numShards + " found " + startedShards.size() + " can't recover shard "
                            + shardId());
                    }
                    throw e;
                }
                break;
            default:
                throw new IllegalArgumentException("Unknown recovery source " + recoveryState.getRecoverySource());
        }

 执行具体的恢复是在generic线程中处理。

private void executeRecovery(String reason, RecoveryState recoveryState, PeerRecoveryTargetService.RecoveryListener recoveryListener,
                                 CheckedConsumer<ActionListener<Boolean>, Exception> action) {
        markAsRecovering(reason, recoveryState); // mark the shard as recovering on the cluster state thread
        threadPool.generic().execute(ActionRunnable.wrap(ActionListener.wrap(r -> {
                if (r) {
                    recoveryListener.onRecoveryDone(recoveryState, getTimestampRange());
                }
            },
            e -> recoveryListener.onRecoveryFailure(recoveryState, new RecoveryFailedException(recoveryState, null, e), true)), action));
    }

3、主分片恢复

主分片从translog中恢复,尚未执行flush到磁盘的Lucene分段可以从translog中重建。

包含以下几个阶段 

名称说明
INIT恢复尚未启动
INDEX恢复Lucene文件,以及在节点音复制索引数据
VERIFY_INDEX验证索引
TRANSLOG启动engine, 重放translog, 建立Lucene索引
FINALIZE清理工作
DONE完毕

3.1 INIT

从开始执行恢复的那一刻起,被标记为INIT阶段,在IndexShard#startRecovery函数的参数中传入。StoreRecovery#internalRecoverFromStore执行恢复。

IndexShard#prepareForIndexRecovery设置状态为INDEX

 public void prepareForIndexRecovery() {
        if (state != IndexShardState.RECOVERING) {
            throw new IndexShardNotRecoveringException(shardId, state);
        }
        recoveryState.setStage(RecoveryState.Stage.INDEX);
        assert currentEngineReference.get() == null;
    }

3.2 INDEX

从Lucene读取最后一次提交的分段信息,index中添加文件信息,标识index为完成。

final Store store = indexShard.store();
si = store.readLastCommittedSegmentsInfo();
final RecoveryState.Index index = recoveryState.getIndex();
addRecoveredFileDetails(si, store, index);
index.setFileDetailsComplete();

3.3 VERIFY_INDEX

VERIFY_INDEX中的INDEX指的是Lucene index。验证当前分片是否损坏,是否进行本项检查依赖于配置项index.shard.check_on_startup,其取值如下

说明
false默认值,打开分片时不检查分片是否损坏
checksum检查物理损坏
true检查物理和逻辑损坏,将消耗大量的内存和CPU资源

3.4 TRANSLOG

一个Lucene索引由许多分段组成,每次搜索时遍历所有分段。内部维护了一个称为“提交点”的信息,其描述了当前Lucene索引都包括哪些分段,这些分段已经被fsyc系统调用,从操作系统的cache刷入磁盘。每次提交操作都会将分段刷入磁盘实现持久化。

此阶段需要重放事务日志中尚未刷入到磁盘的信息,根据最后一次提交的信息做快照,来确定事务日志中哪些需要重放。重放完毕后将新生成的Lucene数据刷入磁盘。

有好几个Engine和EngineFactory,以InternalEngineFactory和InternalEngine为例,其recoverFromTranslog通过从translog中恢复,内部调用recoverFromTranslogInternal

private void recoverFromTranslogInternal(TranslogRecoveryRunner translogRecoveryRunner, long recoverUpToSeqNo) throws IOException {
        final int opsRecovered;
        final long localCheckpoint = getProcessedLocalCheckpoint();
        if (localCheckpoint < recoverUpToSeqNo) {
            try (Translog.Snapshot snapshot = translog.newSnapshot(localCheckpoint + 1, recoverUpToSeqNo)) {
                opsRecovered = translogRecoveryRunner.run(this, snapshot);
            } catch (Exception e) {
                throw new EngineException(shardId, "failed to recover from translog", e);
            }
        } else {
            opsRecovered = 0;
        }
        // flush if we recovered something or if we have references to older translogs
        // note: if opsRecovered == 0 and we have older translogs it means they are corrupted or 0 length.
        assert pendingTranslogRecovery.get() : "translogRecovery is not pending but should be";
        pendingTranslogRecovery.set(false); // we are good - now we can commit
        logger.trace(() -> new ParameterizedMessage(
                "flushing post recovery from translog: ops recovered [{}], current translog generation [{}]",
                opsRecovered, translog.currentFileGeneration()));
        flush(false, true);
        translog.trimUnreferencedReaders();
    }

translogRecoveryRunner的实现为

final Engine.TranslogRecoveryRunner translogRecoveryRunner = (engine, snapshot) -> {
            translogRecoveryStats.totalOperations(snapshot.totalOperations());
            translogRecoveryStats.totalOperationsOnStart(snapshot.totalOperations());
            return runTranslogRecovery(engine, snapshot, Engine.Operation.Origin.LOCAL_TRANSLOG_RECOVERY,
                translogRecoveryStats::incrementRecoveredOperations);
        };

内部调用runTranslogRecovery,遍历所有需要重放的事务日志,执行具体的写操作。

int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,
                            Runnable onOperationRecovered) throws IOException {
        int opsRecovered = 0;
        Translog.Operation operation;
        while ((operation = snapshot.next()) != null) {
            try {
                logger.trace("[translog] recover op {}", operation);
                Engine.Result result = applyTranslogOperation(engine, operation, origin);
                switch (result.getResultType()) {
                    case FAILURE:
                        throw result.getFailure();
                    case MAPPING_UPDATE_REQUIRED:
                        throw new IllegalArgumentException("unexpected mapping update: " + result.getRequiredMappingUpdate());
                    case SUCCESS:
                        break;
                    default:
                        throw new AssertionError("Unknown result type [" + result.getResultType() + "]");
                }

                opsRecovered++;
                onOperationRecovered.run();
            } catch (Exception e) {
                // TODO: Don't enable this leniency unless users explicitly opt-in
                if (origin == Engine.Operation.Origin.LOCAL_TRANSLOG_RECOVERY && ExceptionsHelper.status(e) == RestStatus.BAD_REQUEST) {
                    // mainly for MapperParsingException and Failure to detect xcontent
                    logger.info("ignoring recovery of a corrupt translog entry", e);
                } else {
                    throw ExceptionsHelper.convertToRuntime(e);
                }
            }
        }
        return opsRecovered;
    }

 applyTranslogOperation执行具体的写操作

private Engine.Result applyTranslogOperation(Engine engine, Translog.Operation operation,
                                                 Engine.Operation.Origin origin) throws IOException {
        // If a translog op is replayed on the primary (eg. ccr), we need to use external instead of null for its version type.
        final VersionType versionType = (origin == Engine.Operation.Origin.PRIMARY) ? VersionType.EXTERNAL : null;
        final Engine.Result result;
        switch (operation.opType()) {
            case INDEX:
                final Translog.Index index = (Translog.Index) operation;
                // we set canHaveDuplicates to true all the time such that we de-optimze the translog case and ensure that all
                // autoGeneratedID docs that are coming from the primary are updated correctly.
                result = applyIndexOperation(engine, index.seqNo(), index.primaryTerm(), index.version(),
                    versionType, UNASSIGNED_SEQ_NO, 0, index.getAutoGeneratedIdTimestamp(), true, origin,
                    new SourceToParse(shardId.getIndexName(), index.id(), index.source(),
                        XContentHelper.xContentType(index.source()), index.routing()));
                break;
            case DELETE:
                final Translog.Delete delete = (Translog.Delete) operation;
                result = applyDeleteOperation(engine, delete.seqNo(), delete.primaryTerm(), delete.version(), delete.id(),
                    versionType, UNASSIGNED_SEQ_NO, 0, origin);
                break;
            case NO_OP:
                final Translog.NoOp noOp = (Translog.NoOp) operation;
                result = markSeqNoAsNoop(engine, noOp.seqNo(), noOp.primaryTerm(), noOp.reason(), origin);
                break;
            default:
                throw new IllegalStateException("No operation defined for [" + operation + "]");
        }
        return result;
    }

StoreRecovery#internalRecoverFromStore中调用indexShard.finalizeRecovery()进入FINALIZE阶段。

3.5 FINALIZE

执行refresh操作,将缓冲的数据写入文件 ,但不刷盘,数据在系统的cache中。

public void finalizeRecovery() {
        recoveryState().setStage(RecoveryState.Stage.FINALIZE);
        Engine engine = getEngine();
        engine.refresh("recovery_finalization");
        engine.config().setEnableGcDeletes(true);
    }

StoreRecovery#internalRecoverFromStore中调用indexShard.postRecovery进入DONE阶段。

3.6 DONE

进入DONE阶段之前再次执行refresh,然后更新分片状态

public void postRecovery(String reason) throws IndexShardStartedException, IndexShardRelocatedException, IndexShardClosedException {
	synchronized (postRecoveryMutex) {
		// we need to refresh again to expose all operations that were index until now. Otherwise
		// we may not expose operations that were indexed with a refresh listener that was immediately
		// responded to in addRefreshListener. The refresh must happen under the same mutex used in addRefreshListener
		// and before moving this shard to POST_RECOVERY state (i.e., allow to read from this shard).
		getEngine().refresh("post_recovery");
		synchronized (mutex) {
			recoveryState.setStage(RecoveryState.Stage.DONE);
			changeState(IndexShardState.POST_RECOVERY, reason);
		}
	}
}

3.7 恢复结果处理

主分片恢复完毕后,对恢复结果进行处理。

如果恢复成功,执行IndicesClusterStateService.RecoveryListener#onRecoveryDone

public void onRecoveryDone(final RecoveryState state, ShardLongFieldRange timestampMillisFieldRange) {
	shardStateAction.shardStarted(
			shardRouting,
			primaryTerm,
			"after " + state.getRecoverySource(),
			timestampMillisFieldRange,
			SHARD_STATE_ACTION_LISTENER);
}

主要是向Master发送action为internal:cluster/shard/started的RPC请求。

ShardStateAction#sendShardAction(SHARD_STARTED_ACTION_NAME, currentState, entry, listener)

private void sendShardAction(final String actionName, final ClusterState currentState,
                                 final TransportRequest request, final ActionListener<Void> listener) {
	ClusterStateObserver observer =
		new ClusterStateObserver(currentState, clusterService, null, logger, threadPool.getThreadContext());
	DiscoveryNode masterNode = currentState.nodes().getMasterNode();
	Predicate<ClusterState> changePredicate = MasterNodeChangePredicate.build(currentState);
	if (masterNode == null) {
		logger.warn("no master known for action [{}] for shard entry [{}]", actionName, request);
		waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate);
	} else {
		logger.debug("sending [{}] to [{}] for shard entry [{}]", actionName, masterNode.getId(), request);
		transportService.sendRequest(masterNode,
			actionName, request, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {
				@Override
				public void handleResponse(TransportResponse.Empty response) {
					listener.onResponse(null);
				}

				@Override
				public void handleException(TransportException exp) {
					if (isMasterChannelException(exp)) {
						waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate);
					} else {
						logger.warn(new ParameterizedMessage("unexpected failure while sending request [{}]" +
							" to [{}] for shard entry [{}]", actionName, masterNode, request), exp);
						listener.onFailure(exp instanceof RemoteTransportException ?
							(Exception) (exp.getCause() instanceof Exception ? exp.getCause() :
								new ElasticsearchException(exp.getCause())) : exp);
					}
				}
			});
	}
}

如果恢复失败,则执行IndicesClusterStateService.RecoveryListener#onRecoveryFailure

public void onRecoveryFailure(RecoveryState state, RecoveryFailedException e, boolean sendShardFailure) {
	handleRecoveryFailure(shardRouting, sendShardFailure, e);
}

主要是调用IndicesClusterStateService#handleRecoveryFailure,主要实现是关闭IndexShard, 向Master发送internal:cluster/shard/failure的RPC请求。

private void failAndRemoveShard(ShardRouting shardRouting, boolean sendShardFailure, String message, @Nullable Exception failure,
                                    ClusterState state) {
	
	AllocatedIndex<? extends Shard> indexService = indicesService.indexService(shardRouting.shardId().getIndex());
	if (indexService != null) {
		Shard shard = indexService.getShardOrNull(shardRouting.shardId().id());
		if (shard != null && shard.routingEntry().isSameAllocation(shardRouting)) {
			indexService.removeShard(shardRouting.shardId().id(), message);
		}
	}
	
	if (sendShardFailure) {
		sendFailShard(shardRouting, message, failure, state);
	}
}

private void sendFailShard(ShardRouting shardRouting, String message, @Nullable Exception failure, ClusterState state) {
	failedShardsCache.put(shardRouting.shardId(), shardRouting);
	shardStateAction.localShardFailed(shardRouting, message, failure, SHARD_STATE_ACTION_LISTENER, state);
}

//ShardStateAction
public void localShardFailed(final ShardRouting shardRouting, final String message, @Nullable final Exception failure,
                                 ActionListener<Void> listener, final ClusterState currentState) {
	FailedShardEntry shardEntry = new FailedShardEntry(shardRouting.shardId(), shardRouting.allocationId().getId(),
		0L, message, failure, true);
	sendShardAction(SHARD_FAILED_ACTION_NAME, currentState, shardEntry, listener);
}

private void sendShardAction(final String actionName, final ClusterState currentState,
                                 final TransportRequest request, final ActionListener<Void> listener) {
	ClusterStateObserver observer =
		new ClusterStateObserver(currentState, clusterService, null, logger, threadPool.getThreadContext());
	DiscoveryNode masterNode = currentState.nodes().getMasterNode();
	Predicate<ClusterState> changePredicate = MasterNodeChangePredicate.build(currentState);
	if (masterNode == null) {
		logger.warn("no master known for action [{}] for shard entry [{}]", actionName, request);
		waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate);
	} else {
		logger.debug("sending [{}] to [{}] for shard entry [{}]", actionName, masterNode.getId(), request);
		transportService.sendRequest(masterNode,
			actionName, request, new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {
				@Override
				public void handleResponse(TransportResponse.Empty response) {
					listener.onResponse(null);
				}

				@Override
				public void handleException(TransportException exp) {
					if (isMasterChannelException(exp)) {
						waitForNewMasterAndRetry(actionName, observer, request, listener, changePredicate);
					} else {
						logger.warn(new ParameterizedMessage("unexpected failure while sending request [{}]" +
							" to [{}] for shard entry [{}]", actionName, masterNode, request), exp);
						listener.onFailure(exp instanceof RemoteTransportException ?
							(Exception) (exp.getCause() instanceof Exception ? exp.getCause() :
								new ElasticsearchException(exp.getCause())) : exp);
					}
				}
			});
	}
}

4、副分片恢复

4.1 INIT

从开始执行恢复的那一刻起,被标记为INIT阶段,与主分片恢复一样。

副分片通过PeerRecoveryTargetService#startRecovery来开始恢复过程,通过generic线程来执行RecoveryRunner。

//PeerRecoveryTargetService
public void startRecovery(final IndexShard indexShard, final DiscoveryNode sourceNode, final RecoveryListener listener) {
	final long recoveryId = onGoingRecoveries.startRecovery(indexShard, sourceNode, listener, recoverySettings.activityTimeout());
	threadPool.generic().execute(new RecoveryRunner(recoveryId));
}

4.2 INDEX

IndexShard#prepareForIndexRecovery设置状态为INDEX。

副分片向主分片发送internal:index/shard/recovery/start_recovery的RPC请求,主分片节点对此请求的处理注册在PeerRecoverySourceService类中,具体处理是PeerRecoverySourceService.StartRecoveryTransportRequestHandler,最终处理是交给RecoverySourceHandler#recoverToTarget。

副分片等待主分片处理结果后,处理响应,标识状态为DONE。

public void handleResponse(RecoveryResponse recoveryResponse) {
	final TimeValue recoveryTime = new TimeValue(timer.time());
	onGoingRecoveries.markRecoveryAsDone(recoveryId);
}

//RecoveriesCollection
public void markRecoveryAsDone(long id) {
	RecoveryTarget removed = onGoingRecoveries.remove(id);
	if (removed != null) {
		removed.markAsDone();
	}
}
//RecoveryTarget
public void markAsDone() {
	if (finished.compareAndSet(false, true)) {
		try {
			indexShard.postRecovery("peer recovery done");
		} finally {
			decRef();
		}
		listener.onRecoveryDone(state(), indexShard.getTimestampRange());
	}
}

4.3 VERIFY_INDEX

副分片接收到主分片的internal:index/shard/recovery/clean_files RPC请求,调用CleanFilesRequestHandler处理。RecoveryTarget#cleanFiles调用indexShard.maybeCheckIndex()进入VERIFY_INDEX阶段。

//PeerRecoveryTargetService
class CleanFilesRequestHandler implements TransportRequestHandler<RecoveryCleanFilesRequest> {

	@Override
	public void messageReceived(RecoveryCleanFilesRequest request, TransportChannel channel, Task task) throws Exception {
		try (RecoveryRef recoveryRef = onGoingRecoveries.getRecoverySafe(request.recoveryId(), request.shardId())) {
			final ActionListener<Void> listener = createOrFinishListener(recoveryRef, channel, Actions.CLEAN_FILES, request);
			if (listener == null) {
				return;
			}

			recoveryRef.target().cleanFiles(request.totalTranslogOps(), request.getGlobalCheckpoint(), request.sourceMetaSnapshot(),
				listener.delegateFailure((l, r) -> {
						Releasable reenableMonitor = recoveryRef.target().disableRecoveryMonitor();
						recoveryRef.target().indexShard().afterCleanFiles(() -> {
							reenableMonitor.close();
							l.onResponse(null);
						});
					}));
		}
	}
}

//RecoveryTarget
public void cleanFiles(int totalTranslogOps, long globalCheckpoint, Store.MetadataSnapshot sourceMetadata,
                           ActionListener<Void> listener) {
	ActionListener.completeWith(listener, () -> {
		state().getTranslog().totalOperations(totalTranslogOps);
		multiFileWriter.renameAllTempFiles();
		final Store store = store();
		store.incRef();
		try {
			store.cleanupAndVerify("recovery CleanFilesRequestHandler", sourceMetadata);
			final String translogUUID = Translog.createEmptyTranslog(
				indexShard.shardPath().resolveTranslog(), globalCheckpoint, shardId, indexShard.getPendingPrimaryTerm());
			store.associateIndexWithNewTranslog(translogUUID);

			if (indexShard.getRetentionLeases().leases().isEmpty()) {
		 
				indexShard.persistRetentionLeases();
			}
			indexShard.maybeCheckIndex();
			state().setStage(RecoveryState.Stage.TRANSLOG);
		}
		return null;
	});
}

4.4 TRANSLOG

RecoveryTarget#cleanFiles此时进入TRANSLOG阶段。主分片向副分片发送internal:index/shard/recovery/prepare_translog RPC请求。
//RemoteRecoveryTargetHandler
public void prepareForTranslogOperations(int totalTranslogOps, ActionListener<Void> listener) {
	final String action = PeerRecoveryTargetService.Actions.PREPARE_TRANSLOG;
	final long requestSeqNo = requestSeqNoGenerator.getAndIncrement();
	final RecoveryPrepareForTranslogOperationsRequest request =
		new RecoveryPrepareForTranslogOperationsRequest(recoveryId, requestSeqNo, shardId, totalTranslogOps);
	final Writeable.Reader<TransportResponse.Empty> reader = in -> TransportResponse.Empty.INSTANCE;
	executeRetryableAction(action, request, standardTimeoutRequestOptions, listener.map(r -> null), reader);
}

副分片收到请求后,调用IndexShard#openEngineAndSkipTranslogRecovery,创建新的Engine, 跳过Engine自身的translog恢复。此时主分片的phase2还没有开始,接下来的TRANSLOG阶段就是等待主分片将translog发到副分片重放。

//RecoveryTarget
public void prepareForTranslogOperations(int totalTranslogOps, ActionListener<Void> listener) {
	ActionListener.completeWith(listener, () -> {
		state().getIndex().setFileDetailsComplete(); // ops-based recoveries don't send the file details
		state().getTranslog().totalOperations(totalTranslogOps);
		indexShard().openEngineAndSkipTranslogRecovery();
		return null;
	});
}

主分片接着发送internal:index/shard/recovery/translog_ops RPC请求,此时主分片处理phase2阶段。副分片通过TranslogOperationsRequestHandler来处理,回放主分片发送过来 的操作。

class TranslogOperationsRequestHandler implements TransportRequestHandler<RecoveryTranslogOperationsRequest> {

	@Override
	public void messageReceived(final RecoveryTranslogOperationsRequest request, final TransportChannel channel,
								Task task) throws IOException {
		try (RecoveryRef recoveryRef = onGoingRecoveries.getRecoverySafe(request.recoveryId(), request.shardId())) {
			final RecoveryTarget recoveryTarget = recoveryRef.target();
			final ActionListener<Void> listener = createOrFinishListener(recoveryRef, channel, Actions.TRANSLOG_OPS, request,
				nullVal -> new RecoveryTranslogOperationsResponse(recoveryTarget.indexShard().getLocalCheckpoint()));
			if (listener == null) {
				return;
			}

			performTranslogOps(request, listener, recoveryRef);
		}
	}

	private void performTranslogOps(final RecoveryTranslogOperationsRequest request, final ActionListener<Void> listener,
									final RecoveryRef recoveryRef) {
		final RecoveryTarget recoveryTarget = recoveryRef.target();

		final ClusterStateObserver observer = new ClusterStateObserver(clusterService, null, logger, threadPool.getThreadContext());
		final Consumer<Exception> retryOnMappingException = exception -> {
			// in very rare cases a translog replay from primary is processed before a mapping update on this node
			// which causes local mapping changes since the mapping (clusterstate) might not have arrived on this node.
			logger.debug("delaying recovery due to missing mapping changes", exception);
			// we do not need to use a timeout here since the entire recovery mechanism has an inactivity protection (it will be
			// canceled)
			observer.waitForNextChange(new ClusterStateObserver.Listener() {
				@Override
				public void onNewClusterState(ClusterState state) {
					threadPool.generic().execute(ActionRunnable.wrap(listener, l -> {
						try (RecoveryRef recoveryRef = onGoingRecoveries.getRecoverySafe(request.recoveryId(), request.shardId())) {
							performTranslogOps(request, listener, recoveryRef);
						}
					}));
				}

				@Override
				public void onClusterServiceClose() {
					listener.onFailure(new ElasticsearchException(
						"cluster service was closed while waiting for mapping updates"));
				}

				@Override
				public void onTimeout(TimeValue timeout) {
					// note that we do not use a timeout (see comment above)
					listener.onFailure(new ElasticsearchTimeoutException("timed out waiting for mapping updates " +
						"(timeout [" + timeout + "])"));
				}
			});
		};
		final IndexMetadata indexMetadata = clusterService.state().metadata().index(request.shardId().getIndex());
		final long mappingVersionOnTarget = indexMetadata != null ? indexMetadata.getMappingVersion() : 0L;
		recoveryTarget.indexTranslogOperations(
			request.operations(),
			request.totalTranslogOps(),
			request.maxSeenAutoIdTimestampOnPrimary(),
			request.maxSeqNoOfUpdatesOrDeletesOnPrimary(),
			request.retentionLeases(),
			request.mappingVersionOnPrimary(),
			ActionListener.wrap(
				checkpoint -> listener.onResponse(null),
				e -> {
					// do not retry if the mapping on replica is at least as recent as the mapping
					// that the primary used to index the operations in the request.
					if (mappingVersionOnTarget < request.mappingVersionOnPrimary() && e instanceof MapperException) {
						retryOnMappingException.accept(e);
					} else {
						listener.onFailure(e);
					}
				})
		);
	}
}

4.5 FINALIZE

主分片执行完phase2, 调用RemoteRecoveryTargetHandler#finalizeRecovery向副分片发送action为internal:index/shard/recovery/finalize的RPC请求。

//RemoteRecoveryTargetHandler
public void finalizeRecovery(final long globalCheckpoint, final long trimAboveSeqNo, final ActionListener<Void> listener) {
	final String action = PeerRecoveryTargetService.Actions.FINALIZE;
	final long requestSeqNo = requestSeqNoGenerator.getAndIncrement();
	final RecoveryFinalizeRecoveryRequest request =
		new RecoveryFinalizeRecoveryRequest(recoveryId, requestSeqNo, shardId, globalCheckpoint, trimAboveSeqNo);
	final Writeable.Reader<TransportResponse.Empty> reader = in -> TransportResponse.Empty.INSTANCE;
	executeRetryableAction(action, request, TransportRequestOptions.timeout(recoverySettings.internalActionLongTimeout()),
			listener.map(r -> null), reader);
}

副分片对应的处理为FinalizeRecoveryRequestHandler,先更新全局检查点,然后执行与主分片相同的清理操作。

class FinalizeRecoveryRequestHandler implements TransportRequestHandler<RecoveryFinalizeRecoveryRequest> {

	@Override
	public void messageReceived(RecoveryFinalizeRecoveryRequest request, TransportChannel channel, Task task) throws Exception {
		try (RecoveryRef recoveryRef = onGoingRecoveries.getRecoverySafe(request.recoveryId(), request.shardId())) {
			final ActionListener<Void> listener = createOrFinishListener(recoveryRef, channel, Actions.FINALIZE, request);
			if (listener == null) {
				return;
			}

			recoveryRef.target().finalizeRecovery(request.globalCheckpoint(), request.trimAboveSeqNo(), listener);
		}
	}
}

//RecoveryTarget
public void finalizeRecovery(final long globalCheckpoint, final long trimAboveSeqNo, ActionListener<Void> listener) {
	ActionListener.completeWith(listener, () -> {
		indexShard.updateGlobalCheckpointOnReplica(globalCheckpoint, "finalizing recovery");
		// Persist the global checkpoint.
		indexShard.sync();
		indexShard.persistRetentionLeases();
		if (trimAboveSeqNo != SequenceNumbers.UNASSIGNED_SEQ_NO) {
			// We should erase all translog operations above trimAboveSeqNo as we have received either the same or a newer copy
			// from the recovery source in phase2. Rolling a new translog generation is not strictly required here for we won't
			// trim the current generation. It's merely to satisfy the assumption that the current generation does not have any
			// operation that would be trimmed (see TranslogWriter#assertNoSeqAbove). This assumption does not hold for peer
			// recovery because we could have received operations above startingSeqNo from the previous primary terms.
			indexShard.rollTranslogGeneration();
			// the flush or translog generation threshold can be reached after we roll a new translog
			indexShard.afterWriteOperation();
			indexShard.trimOperationOfPreviousPrimaryTerms(trimAboveSeqNo);
		}
		if (hasUncommittedOperations()) {
			indexShard.flush(new FlushRequest().force(true).waitIfOngoing(true));
		}
		indexShard.finalizeRecovery();
		return null;
	});
}

4.6 副分片恢复时主分片处理

主分片分为phase1,phase2两个阶段。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

kgduu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值