【elasticsearch】elasticsearch源码分析一：副本恢复流程

莫薇

已于 2022-02-17 17:54:33 修改

阅读量2.2k

点赞数 1

分类专栏： elasticsearch 文章标签： elasticsearch 搜索引擎大数据

于 2022-01-20 18:00:34 首次发布

本文链接：https://blog.csdn.net/jing_flower/article/details/122607147

版权

本文深入探讨了Elasticsearch的副本恢复流程，从IndicesClusterStateService的createOrUpdateShards开始，涉及routing table和routing nodes，重点分析了副本分片如何通过PEER方式恢复，包括Primary Terms和Sequence Numbers的角色，以及GlobalCheckpoint和LocalCheckpoint在确保数据一致性中的作用。恢复过程中，segment文件的同步和translog的重放是关键步骤，保证了数据的完整性和一致性。

摘要由CSDN通过智能技术生成

es 7.7
副本分片请求流程，函数入口：IndicesClusterStateService.createOrUpdateShards

routing table：get _cluster/state/routing_table.保存“索引->分片”的对应关系，即每个索引的每个分片在哪个节点

routing nodes：get _cluster/state/routing_nodes。后者保存“节点->分片”的对应关系，即每个节点分别有哪些索引的哪些分片。
获取本节点的routing table（实际为shard routing信息），即通过routing nodes拿到了当前节点的分片信息，通过分片获取到对应Index，再拿到IndexService，查看分片是否已经存在，如果不存在，就进入createShard，否则进到updateShard；createShards主要处理处于initializing的分片，即分片恢复也会进入

org.elasticsearch.indices.cluster.IndicesClusterStateService

    private void createOrUpdateShards(final ClusterState state) {
   
        RoutingNode localRoutingNode = state.getRoutingNodes().node(state.nodes().getLocalNodeId());
        if (localRoutingNode == null) {
   
            return;
        }

        DiscoveryNodes nodes = state.nodes();
        RoutingTable routingTable = state.routingTable();

        for (final ShardRouting shardRouting : localRoutingNode) {
    
        //判断本地节点是否在routingNodes，如果在，说明本地节点有分片创建或更新的需求，否则跳过
            ShardId shardId = shardRouting.shardId();
            if (failedShardsCache.containsKey(shardId) == false) {
    //
                AllocatedIndex<? extends Shard> indexService = indicesService.indexService(shardId.getIndex());
                assert indexService != null : "index " + shardId.getIndex() + " should have been created by createIndices";//探测indexService非空
                Shard shard = indexService.getShardOrNull(shardId.id());
                if (shard == null) {
   //shard不存在创建？？？----->
                    assert shardRouting.initializing() : shardRouting + " should have been removed by failMissingShards"; //探测shard状态是否为INITIALIZING
                    createShard(nodes, routingTable, shardRouting, state);//副本恢复入口
                } else {
   //shard存在更新
                    updateShard(nodes, shardRouting, shard, routingTable, state);
                }
            }
        }
    }

IndicesClusterStateService.createShard函数判断shardRouting的类型，如果恢复类型为PEER，则寻找源节点，调用indicesService.createShard。

shardRouting恢复类型可选为：

EMPTY_STORE：recovery from an empty store 从空store恢复
EXISTING_STORE：recovery from an existing store 从存在的store恢复
PEER：recovery from a primary on another node 从其他节点的主恢复---副本分片走此流程
SNAPSHOT：recovery from a snapshot 从快照恢复
LOCAL_SHARDS：recovery from other shards of another index on the same node 从本节点的另一个index的其他shard恢复

主分片主要从Translog中自我恢复，尚未执行flush到磁盘的分段可以从tanslog中重建

副本分片走peer，

Why peer？

1）如果主副分片同时开始恢复的话，还要选主？so直接就让副本分片等待。。等待主分片恢复后，跟主分片对比，所以才走peer？

2）只有peer才会跟primary对比？

如何判断主副？本地会记录？

org.elasticsearch.indices.cluster.IndicesClusterStateService

    private void createShard(DiscoveryNodes nodes, RoutingTable routingTable, ShardRouting shardRouting, ClusterState state) {
   
        assert shardRouting.initializing() : "only allow shard creation for initializing shard but was " + shardRouting;

        DiscoveryNode sourceNode = null;
        if (shardRouting.recoverySource().getType() == Type.PEER)  {
    //判断类型
            sourceNode = findSourceNodeForPeerRecovery(logger, routingTable, nodes, shardRouting); //寻找源节点
            if (sourceNode == null) {
   
                logger.trace("ignoring initializing shard {} - no source node can be found.", shardRouting.shardId());
                return;
            }
        }

        try {
   
            final long primaryTerm = state.metaData().index(shardRouting.index()).primaryTerm(shardRouting.id());
            logger.debug("{} creating shard with primary term [{}]", shardRouting.shardId(), primaryTerm);
            RecoveryState  recoveryState = new RecoveryState(shardRouting, nodes.getLocalNode(), sourceNode); //保存恢复信息，即当前恢复阶段、主分片、分片ID、source节点、target节点
            indicesService.createShard(   //函数入口
                    shardRouting,   
                    recoveryState,  
                    recoveryTargetService, 
                    new RecoveryListener(shardRouting, primaryTerm),  
                           //恢复状态改变callback(finishes or fails)
                    repositoriesService, //service responsible for snapshot/restore
                    failedShardHandler,  //shard fails 的callback
                    globalCheckpointSyncer,  //shard同步全局checkpoint的callback
                    retentionLeaseSyncer);  //shard syce租约的callback
        } catch (Exception e) {
   
            failAndRemoveShard(shardRouting, true, "failed to create shard", e, state);
        }
    }

寻找源节点：

源节点的确定分两种情况，如果当前shard本身不是primary shard，则源节点为primary shard所在节点，否则，如果当前shard正在搬迁中（从其他节点搬迁到本节点），则源节点为数据搬迁的源头节点。得到源节点后调用IndicesService.createShard，在该方法中调用方法IndexShard.startRecovery开始恢复。

org.elasticsearch.indices.cluster.IndicesClusterStateService

    private static DiscoveryNode findSourceNodeForPeerRecovery(Logger logger, RoutingTable routingTable, DiscoveryNodes nodes,
                                                               ShardRouting shardRouting) {
   
        DiscoveryNode sourceNode = null;
        if (!shardRouting.primary()) {
    //如果shard本身不是primary shard？？？---->
            ShardRouting primary = routingTable.shardRoutingTable(shardRouting.shardId()).primaryShard();
            // 只能从started状态的primary恢复，否则继续轮询，
            
            if (primary.active()) {
    //判断主分片的状态---主分片优先恢复，副本分片等待
                sourceNode = nodes.get(primary.currentNodeId()); // 找到primary shard所在节点
                if (sourceNode == null) {
   
                    logger.trace("can't find replica source node because primary shard {} is assigned to an unknown node.", primary);
                }
            } else {
   
                logger.trace("can't find replica source node because primary shard {} is not active.", primary);
            }
        } else if (shardRouting.relocatingNodeId() != null) {
    //如果正在搬迁
            sourceNode = nodes.get(shardRouting.relocatingNodeId());
            if (sourceNode == null) {
   
                logger.trace("can't find relocation source node for shard {} because it is assigned to an unknown node [{}].",
                    shardRouting.shardId(), shardRouting.relocatingNodeId()); // 找到搬迁的源节点
            }
        } else {
   
            throw new IllegalStateException("trying to find source node for peer recovery when routing state means no peer recovery: " +
                shardRouting);
        }
        return sourceNode;
    }

对于恢复类型为PEER的任务，恢复动作的真正执行者为PeerRecoveryTargetService

 final AllocatedIndices<? extends Shard, ? extends AllocatedIndex<? extends Shard>> indicesService;
 
 T createShard(
                ShardRouting shardRouting,
                RecoveryState recoveryState,
                PeerRecoveryTargetService recoveryTargetService,     //实际执行函数-------------------------
                PeerRecoveryTargetService.RecoveryListener recoveryListener, //listener
                RepositoriesService repositoriesService,
                Consumer<IndexShard.ShardFailure> onShardFailure,
                Consumer<ShardId> globalCheckpointSyncer,
                RetentionLeaseSyncer retentionLeaseSyncer) throws IOException;

PeerRecoveryTargetService.doRecovery： StartRecoveryRequest通过RPC发送到源节点：

org.elasticsearch.indices.recovery.PeerRecoveryTargetService

public void doRun() {
   
            doRecovery(recoveryId);
        }


private void doRecovery(final long recoveryId) {
   
        final StartRecoveryRequest request;
        final RecoveryState.Timer timer;
        CancellableThreads cancellableThreads;
        try (RecoveryRef recoveryRef = onGoingRecoveries.getRecovery(recoveryId)) {
   
            if (recoveryRef == null) {
   
                logger.trace("not running recovery with id [{}] - can not find it (probably finished)", recoveryId);
                return;
            }
            final RecoveryTarget recoveryTarget = recoveryRef.target();
            timer = recoveryTarget.state().getTimer();
            cancellableThreads = recoveryTarget.cancellableThreads();
            try {
   
                assert recoveryTarget.sourceNode() != null : "can not do a recovery without a source node";
                logger.trace("{} preparing shard for peer recovery", recoveryTarget.shardId());
                recoveryTarget.indexShard().prepareForIndexRecovery();
                final long startingSeqNo = recoveryTarget.indexShard().recoverLocallyUpToGlobalCheckpoint(); //获取到startingseqno
                assert startingSeqNo == UNASSIGNED_SEQ_NO || recoveryTarget.state().getStage() == RecoveryState.Stage.TRANSLOG :
                    "unexpected recovery stage [" + recoveryTarget.state().getStage() + "] starting seqno [ " + startingSeqNo + "]";
                request = getStartRecoveryRequest(logger, clusterService.localNode(), recoveryTarget, startingSeqNo);   // 将metadataSnapshot等信息包装成request------------------------
            } catch (final Exception e) {
   
                // this will be logged as warning later on...
                logger.trace("unexpected error while preparing shard for peer recovery, failing recovery", e);
                onGoingRecoveries.failRecovery(recoveryId,
                    new RecoveryFailedException(recoveryTarget.state(), "failed to prepare shard for recovery", e), true);
                return;
            }
        }
       //.....................此处省略若干行

        try {
   
            logger.trace("{} starting recovery from {}", request.shardId(), request.sourceNode());
            cancellableThreads.executeIO(() -> //向源节点发送请求，请求恢复----------------------------
            //在cancelableThreads后会继续执行，确保在网络或者其他情况导致传输延迟情况下中断任何阻塞调用，在moving异步执行后是不干净的，但是错过请求更难以排查且不可接受。
                transportService.submitRequest(request.sourceNode(), PeerRecoverySourceService.Actions.START_RECOVERY, request,
                    new TransportResponseHandler<RecoveryResponse>() {
   
                        @Override
                        public void handleResponse(RecoveryResponse recoveryResponse) {
   
                            final TimeValue recoveryTime = new TimeValue(timer.time());
                            // do this through ongoing recoveries to remove it from the collection
                            onGoingRecoveries.markRecoveryAsDone(recoveryId);
                            //..........此处省略若干行
                        }

                        @Override
                        public void handleException(TransportException e) {
   
                            handleException.accept(e);
                        }

                        @Override
                        public String executor() {
   
                            // we do some heavy work like refreshes in the response so fork off to the generic threadpool
                            return ThreadPool.Names.GENERIC;
                        }

                        @Override
                        public RecoveryResponse read(StreamInput in) throws IOException {
   
                            return new RecoveryResponse(in);
                        }
                    })
            );
        } catch (CancellableThreads.ExecutionCancelledException e) {
   
            logger.trace("recovery cancelled", e);
        } catch (Exception e) {
   
            handleException.accept(e);
        }
    }

IndexShard.java

Primary Terms： 由主节点分配给每个主分片，每次主分片发生变化时递增。主要作用是能够区别新旧两种主分片，只对最新的Terms进行操作。

Sequence Numbers： 标记发生在某个分片上的写操作。由主分片分配，只对写操作分配。假设索引test有两个主分片一个副本分片，当0号分片的序列号增加到5时，它的主分片离线，副本提升为新的主，对于后续的写操作，序列号从6开启递增。1号分片有自己独立的Sequence Numbers。

主分片在每次向副本转发写请求时，都会带上这两个值。

有了Primary Terms和Sequence Numbers，理论上好像就可以检测出分片之间的差异（从旧的主分片删除新的主分片操作历史中不存在的操作，并且将缺少的操作索引到旧主分片），但是当同时为每秒成百上千的事件做索引时，比较数百万个操作的历史是不切实际的，且耗费大量的存储成本，所以ES维护了一个GlobalCheckpoint的安全标记。

先来看下checkpoint的概念和作用：

GlobalCheckpoint： 全局检查点是所有活跃分片历史都已经对齐的序列号，即所有低于全局检查点的操作都保证已被所有活跃的分片处理完毕。这意味着，当主分片失效时，我们只需要比较新主分片和其他副本分片之间的最后一个全局检查点之后的操作即可。当就主分片恢复时，使用它知道的全局检查点，与新的主分片进行比较。这样，我们只需要进行小部分操作比较，而不是全部。

主分片负责推进全局检查点，它通过跟踪副本上完成的操作来实现。一旦检测到有副本分片已经超出给定序列号，它将相应的更新全局检查点。副本分片不会跟踪所有操作，而是维护一个本地检查点。

LocalCheckpoint： 本地检查点也是一个序列号，所有序列号低于它的操作都已在该分片上（写lucene和translog成功）处理完毕。

 public long recoverLocallyUpToGlobalCheckpoint() {
   
        assert Thread.holdsLock(mutex) == false : "recover locally under mutex";
        if (state != IndexShardState.RECOVERING) {
   
            throw new IndexShardNotRecoveringException(shardId, state);
        }
        assert recoveryState.getStage() == RecoveryState.Stage.INDEX : "unexpected recovery stage [" + recoveryState.getStage() + "]";
        assert routingEntry().recoverySource().getType() == RecoverySource.Type.PEER : "not a peer recovery [" + routingEntry() + "]";
        final Optional<SequenceNumbers.CommitInfo> safeCommit;
        final long globalCheckpoint;
        try {
   
            final String translogUUID = store.readLastCommittedSegmentsInfo().getUserData().get(Translog.TRANSLOG_UUID_KEY);//获取最后提交到segment的translogUUID
            globalCheckpoint = Translog.readGlobalCheckpoint(translogConfig.getTranslogPath(), translogUUID);
          //获取translog的globalCheckpoint
            safeCommit = store.</