es中的AllocationService

kgduu

已于 2022-04-05 16:44:07 修改

阅读量910

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

于 2022-04-01 20:45:14 首次发布

本文链接：https://blog.csdn.net/xiexingshishu/article/details/123907220

版权

elasticsearch 专栏收录该内容

34 篇文章 5 订阅

订阅专栏

1、简介

分配服务主要是作分片的分配，决定哪些分片应该在哪个节点上，以及哪个为主分片，哪个为副分片。对于新建索引和已有索引，分片分配过程不相同。

2、基础

包含ShardsAllocator,ExistingShardsAllocator和AllocationDecider

2.1 allocators

2.1.1 ExistingShardsAllocator

接口方法有

方法	说明
beforeAllocation	分配前操作
afterPrimariesBeforeReplicas	主分片分配后，副分片分配之前
allocateUnassigned	分配未赋值的分片
explainUnassignedShardAllocation	解释未赋值分片分配
cleanCaches	当节点成为主节点或者主节点变为其他时调用
applyStartedShards	应用启动的分片
applyFailedShards	应用失败的分片
getNumberOfInFlightFetches	获取in-flight获取的数目

其实现类有

2.1.2 ShardsAllocator

分片分配器接口，方法有

方法	说明
allocate	分配分片到集群中的一个节点
decideShardAllocation	分片分配到集群中的决策

实现类有

2.1.3 AllocationDecider

分片决策，方法有

方法	说明
canRebalance	判断分片路由是否可以平衡
canAllocate	判断分片路由是否可以分配在指定节点
canRemain	判断分片路由是否可以保留在指定节点
shouldAutoExpandToNode	判断索引分片是否应该自动扩展到节点
canForceAllocatePrimary	判断指定的主分片是否强制分配到指定节点

有以下几类决策

2.2 分配时机

以下情况会触发AllocationService#reroute重新分配

节点增加或者删除
重启恢复
恢复快照
索引元数据创建、删除、关闭、打开、更新配置
集群配置更新
执行reroute命令
网关恢复

类	方法
TransportClusterUpdateSettingsAction	masterOperation
MetadataCreateIndexService	applyCreateIndexWithTemporaryService
MetadataDeleteIndexService	deleteIndices
MetadataIndexStateService	closeIndices,onlyOpenIndex
MetadataUpdateSettingsService	updateSettings
DelayedAllocationService.DelayedRerouteTask	execute
AllocationService	disassociateDeadNodes
GatewayService.RecoverStateUpdateTask	execute
LocalAllocateDangledIndices.AllocateDangledRequestHandler	messageReceived
Node	构造函数中创建RerouteService
RestoreService	restoreSnapshot

3、reroute

3.1 基础分析

reroute中主要实现两种分配

gatewayAllocator,分配已存在的分片，从磁盘中找到它们
shardsAllocator, 用于平衡分片在节点中的分布。

private void reroute(RoutingAllocation allocation) {
        removeDelayMarkers(allocation);
        allocateExistingUnassignedShards(allocation);  // try to allocate existing shard copies first
        shardsAllocator.allocate(allocation);
     
    }

reroute主要运行于MasterService.UpdateTask线程中

3.2 集群启动时gateway触发

创建GatewayService时，如果discovery为Coordinator时，会创建恢复的任务。

if (discovery instanceof Coordinator) {
            recoveryRunnable = () ->
                    clusterService.submitStateUpdateTask("local-gateway-elected-state", new RecoverStateUpdateTask());
        }

在两阶段提交的commit阶段时，会执行ClusterStateListener#clusterChanged，调用performStateRecovery

private void performStateRecovery(final boolean enforceRecoverAfterTime, final String reason) {
        if (enforceRecoverAfterTime && recoverAfterTime != null) {
            if (scheduledRecovery.compareAndSet(false, true)) {
                logger.info("delaying initial state recovery for [{}]. {}", recoverAfterTime, reason);
                threadPool.schedule(new AbstractRunnable() {
                    @Override
                    public void onFailure(Exception e) {
                        logger.warn("delayed state recovery failed", e);
                        resetRecoveredFlags();
                    }

                    @Override
                    protected void doRun() {
                        if (recoveryInProgress.compareAndSet(false, true)) {
                            logger.info("recover_after_time [{}] elapsed. performing state recovery...", recoverAfterTime);
                            recoveryRunnable.run();
                        }
                    }
                }, recoverAfterTime, ThreadPool.Names.GENERIC);
            }
        } else {
            if (recoveryInProgress.compareAndSet(false, true)) {
                threadPool.generic().execute(new AbstractRunnable() {
                    @Override
                    public void onFailure(final Exception e) {
                        logger.warn("state recovery failed", e);
                        resetRecoveredFlags();
                    }

                    @Override
                    protected void doRun() {
                        logger.debug("performing state recovery...");
                        recoveryRunnable.run();
                    }
                });
            }
        }
    }

3.3 gatewayAllocator

包含主分片器和副分片分配器PrimaryShardAllocator和ReplicaShardAllocator，都继承BaseGatewayShardAllocator

3.3.1 allocateExistingUnassignedShards

遍历存在的分片分配器，执行beforeAllocation
遍历未分配主分片，执行allocateUnassigned
遍历存在的分片分配器，执行afterPrimariesBeforeReplicas
遍历未分配副分片，执行allocateUnassigned

3.3.2 PrimaryShardAllocator

allocateUnassigned是调用父类BaseGatewayShardAllocator的，其作分配决策是调用 PrimaryShardAllocator中的makeAllocationDecision。

如果未分配分片的恢复源类型为SNAPSHOT并且路由分配allocation的snapshotShardSizeInfo的分片大小没有设置，则通过决策器来决定是否分配。否则发起向所有数据节点获取某个shard元信息的fetchData请求。

AsyncShardFetch方法有

方法	说明
fetchData	向所有节点获取某个shard元信息
asyncFetch	异步获取集群中指定分片的元数据
reroute	抽象方法，实现此用于调度另一轮产生调用获取数据

其类层次关系有

InternalAsyncFetch泛化AsyncShardFetch，其实现了reroute方法，调用BatchedRerouteService#reroute

AsyncShardFetch的成员包含Lister接口，InternalPrimaryShardAllocator对应使用的是listStartedShards方法，用来发送TransportNodesListGatewayStartedShards.TYPE请求，请求节点shard元信息的action为internal:gateway/local/started_shards，发送的过程主要实现为AsyncAction#start

void start() {
            final DiscoveryNode[] nodes = request.concreteNodes();
            if (nodes.length == 0) {
                // nothing to notify, so respond immediately, but always fork even if finalExecutor == SAME
                final String executor = finalExecutor.equals(ThreadPool.Names.SAME) ? ThreadPool.Names.GENERIC : finalExecutor;
                threadPool.executor(executor).execute(() -> newResponse(task, request, responses, listener));
                return;
            }
            final TransportRequestOptions transportRequestOptions = TransportRequestOptions.timeout(request.timeout());
            for (int i = 0; i < nodes.length; i++) {
                final int idx = i;
                final DiscoveryNode node = nodes[i];
                final String nodeId = node.getId();
                try {
                    TransportRequest nodeRequest = newNodeRequest(request);
                    if (task != null) {
                        nodeRequest.setParentTask(clusterService.localNode().getId(), task.getId());
                    }

                    transportService.sendRequest(node, getTransportNodeAction(node), nodeRequest, transportRequestOptions,
                            new TransportResponseHandler<NodeResponse>() {
                                @Override
                                public NodeResponse read(StreamInput in) throws IOException {
                                    return newNodeResponse(in);
                                }

                                @Override
                                public void handleResponse(NodeResponse response) {
                                    onOperation(idx, response);
                                }

                                @Override
                                public void handleException(TransportException exp) {
                                    onFailure(idx, node.getId(), exp);
                                }
                            });
                } catch (Exception e) {
                    onFailure(idx, nodeId, e);
                }
            }
        }

接收端处理主要是NodeTransportHandler，其调用nodeOperation处理

class NodeTransportHandler implements TransportRequestHandler<NodeRequest> {

        @Override
        public void messageReceived(NodeRequest request, TransportChannel channel, Task task) throws Exception {
            channel.sendResponse(nodeOperation(request, task));
        }
    }

请求端收到响应后处理逻辑为processAsyncFetch，收到各节点返回的shard级别元数据后，放到cache中，下次reroute从cache时取，然后再次执行reroute

protected synchronized void processAsyncFetch(List<T> responses, List<FailedNodeException> failures, long fetchingRound) {
        if (closed) {
            // we are closed, no need to process this async fetch at all
            logger.trace("{} ignoring fetched [{}] results, already closed", shardId, type);
            return;
        }
        logger.trace("{} processing fetched [{}] results", shardId, type);

        if (responses != null) {
            for (T response : responses) {
                NodeEntry<T> nodeEntry = cache.get(response.getNode().getId());
                if (nodeEntry != null) {
                    if (nodeEntry.getFetchingRound() != fetchingRound) {
                        assert nodeEntry.getFetchingRound() > fetchingRound : "node entries only replaced by newer rounds";
                        logger.trace("{} received response for [{}] from node {} for an older fetching round (expected: {} but was: {})",
                            shardId, nodeEntry.getNodeId(), type, nodeEntry.getFetchingRound(), fetchingRound);
                    } else if (nodeEntry.isFailed()) {
                        logger.trace("{} node {} has failed for [{}] (failure [{}])", shardId, nodeEntry.getNodeId(), type,
                            nodeEntry.getFailure());
                    } else {
                        // if the entry is there, for the right fetching round and not marked as failed already, process it
                        logger.trace("{} marking {} as done for [{}], result is [{}]", shardId, nodeEntry.getNodeId(), type, response);
                        nodeEntry.doneFetching(response);
                    }
                }
            }
        }
        if (failures != null) {
            for (FailedNodeException failure : failures) {
                logger.trace("{} processing failure {} for [{}]", shardId, failure, type);
                NodeEntry<T> nodeEntry = cache.get(failure.nodeId());
                if (nodeEntry != null) {
                    if (nodeEntry.getFetchingRound() != fetchingRound) {
                        assert nodeEntry.getFetchingRound() > fetchingRound : "node entries only replaced by newer rounds";
                        logger.trace("{} received failure for [{}] from node {} for an older fetching round (expected: {} but was: {})",
                            shardId, nodeEntry.getNodeId(), type, nodeEntry.getFetchingRound(), fetchingRound);
                    } else if (nodeEntry.isFailed() == false) {
                        // if the entry is there, for the right fetching round and not marked as failed already, process it
                        Throwable unwrappedCause = ExceptionsHelper.unwrapCause(failure.getCause());
                        // if the request got rejected or timed out, we need to try it again next time...
                        if (unwrappedCause instanceof EsRejectedExecutionException ||
                            unwrappedCause instanceof ReceiveTimeoutTransportException ||
                            unwrappedCause instanceof ElasticsearchTimeoutException) {
                            nodeEntry.restartFetching();
                        } else {
                            logger.warn(() -> new ParameterizedMessage("{}: failed to list shard for {} on node [{}]",
                                shardId, type, failure.nodeId()), failure);
                            nodeEntry.doneFetching(failure.getCause());
                        }
                    }
                }
            }
        }
        reroute(shardId, "post_response");
    }

获取到分片数据后，得到索引元数据的inSyncAllocationIds，通过buildNodeShardsResult构建节点分片结果

对于分片恢复源为SNAPSHOT，排序规则为顺序为，是否在inSyncAllocationIds->没有存储异常->主分片
对于非SNAPSHOT,排序规则顺序为，没有存储异常->主分片

protected static NodeShardsResult buildNodeShardsResult(ShardRouting shard, boolean matchAnyShard,
                                                            Set<String> ignoreNodes, Set<String> inSyncAllocationIds,
                                                            FetchResult<NodeGatewayStartedShards> shardState,
                                                            Logger logger) {
        List<NodeGatewayStartedShards> nodeShardStates = new ArrayList<>();
        int numberOfAllocationsFound = 0;
        for (NodeGatewayStartedShards nodeShardState : shardState.getData().values()) {
            DiscoveryNode node = nodeShardState.getNode();
            String allocationId = nodeShardState.allocationId();

            if (ignoreNodes.contains(node.getId())) {
                continue;
            }

            if (nodeShardState.storeException() == null) {
                if (allocationId == null) {
                    logger.trace("[{}] on node [{}] has no shard state information", shard, nodeShardState.getNode());
                } else {
                    logger.trace("[{}] on node [{}] has allocation id [{}]", shard, nodeShardState.getNode(), allocationId);
                }
            } else {
                final String finalAllocationId = allocationId;
                if (nodeShardState.storeException() instanceof ShardLockObtainFailedException) {
                    logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be " +
                        "opened as it's locked, treating as valid shard", shard, nodeShardState.getNode(), finalAllocationId),
                        nodeShardState.storeException());
                } else {
                    logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be " +
                        "opened, treating as no allocation id", shard, nodeShardState.getNode(), finalAllocationId),
                        nodeShardState.storeException());
                    allocationId = null;
                }
            }

            if (allocationId != null) {
                assert nodeShardState.storeException() == null ||
                    nodeShardState.storeException() instanceof ShardLockObtainFailedException :
                    "only allow store that can be opened or that throws a ShardLockObtainFailedException while being opened but got a " +
                        "store throwing " + nodeShardState.storeException();
                numberOfAllocationsFound++;
                if (matchAnyShard || inSyncAllocationIds.contains(nodeShardState.allocationId())) {
                    nodeShardStates.add(nodeShardState);
                }
            }
        }

        final Comparator<NodeGatewayStartedShards> comparator; // allocation preference
        if (matchAnyShard) {
            // prefer shards with matching allocation ids
            Comparator<NodeGatewayStartedShards> matchingAllocationsFirst = Comparator.comparing(
                (NodeGatewayStartedShards state) -> inSyncAllocationIds.contains(state.allocationId())).reversed();
            comparator = matchingAllocationsFirst.thenComparing(NO_STORE_EXCEPTION_FIRST_COMPARATOR)
                .thenComparing(PRIMARY_FIRST_COMPARATOR);
        } else {
            comparator = NO_STORE_EXCEPTION_FIRST_COMPARATOR.thenComparing(PRIMARY_FIRST_COMPARATOR);
        }

        nodeShardStates.sort(comparator);

        if (logger.isTraceEnabled()) {
            logger.trace("{} candidates for allocation: {}", shard, nodeShardStates.stream().map(s -> s.getNode().getName())
                .collect(Collectors.joining(", ")));
        }
        return new NodeShardsResult(nodeShardStates, numberOfAllocationsFound);
    }

基于分片决策做分片分配，分成yes/no/throttle三组。

private static NodesToAllocate buildNodesToAllocate(RoutingAllocation allocation,
                                                        List<NodeGatewayStartedShards> nodeShardStates,
                                                        ShardRouting shardRouting,
                                                        boolean forceAllocate) {
        List<DecidedNode> yesNodeShards = new ArrayList<>();
        List<DecidedNode> throttledNodeShards = new ArrayList<>();
        List<DecidedNode> noNodeShards = new ArrayList<>();
        for (NodeGatewayStartedShards nodeShardState : nodeShardStates) {
            RoutingNode node = allocation.routingNodes().node(nodeShardState.getNode().getId());
            if (node == null) {
                continue;
            }

            Decision decision = forceAllocate ? allocation.deciders().canForceAllocatePrimary(shardRouting, node, allocation) :
                                                allocation.deciders().canAllocate(shardRouting, node, allocation);
            DecidedNode decidedNode = new DecidedNode(nodeShardState, decision);
            if (decision.type() == Type.THROTTLE) {
                throttledNodeShards.add(decidedNode);
            } else if (decision.type() == Type.NO) {
                noNodeShards.add(decidedNode);
            } else {
                yesNodeShards.add(decidedNode);
            }
        }
        return new NodesToAllocate(Collections.unmodifiableList(yesNodeShards), Collections.unmodifiableList(throttledNodeShards),
                                      Collections.unmodifiableList(noNodeShards));
    }

3.3.3 ReplicaShardAllocator

与PrimaryShardAllocator在作分配决策时，有一些差异，在fetchData前判断是否可以在一个节点上分配canBeAllocatedToAtLeastOneNode

public static Tuple<Decision, Map<String, NodeAllocationResult>> canBeAllocatedToAtLeastOneNode(ShardRouting shard,
                                                                                                    RoutingAllocation allocation) {
        Decision madeDecision = Decision.NO;
        final boolean explain = allocation.debugDecision();
        Map<String, NodeAllocationResult> nodeDecisions = explain ? new HashMap<>() : null;
        for (ObjectCursor<DiscoveryNode> cursor : allocation.nodes().getDataNodes().values()) {
            RoutingNode node = allocation.routingNodes().node(cursor.value.getId());
            if (node == null) {
                continue;
            }
            // if we can't allocate it on a node, ignore it, for example, this handles
            // cases for only allocating a replica after a primary
            Decision decision = allocation.deciders().canAllocate(shard, node, allocation);
            if (decision.type() == Decision.Type.YES && madeDecision.type() != Decision.Type.YES) {
                if (explain) {
                    madeDecision = decision;
                } else {
                    return Tuple.tuple(decision, null);
                }
            } else if (madeDecision.type() == Decision.Type.NO && decision.type() == Decision.Type.THROTTLE) {
                madeDecision = decision;
            }
            if (explain) {
                nodeDecisions.put(node.nodeId(), new NodeAllocationResult(node.node(), null, decision));
            }
        }
        return Tuple.tuple(madeDecision, nodeDecisions);
    }