allocatio模块介绍
ES的分片分配就是把分片指派到集群中某个节点的过程,分配决策是有主节点完成的,其分配决策主要有两两面
1:哪些节点需要分配到哪个节点
2:哪个分片是主分片,哪个分片是副分片
对于分片的分配主要有两个组件allocation和deciders完成,allocation的任务是找个最优的节点来分配分片,而deciders负责判断是否要进行这次分配。
比如对于新建索引,allocation模块负责找出拥有分片最少的节点列表,然后deciders依次遍历节点,决定要不要把分片分配到节点。
对于已有的索引,主要区分哪个是主分片,哪个是副分片,对于主分片,allocation会找到已经拥有该分片最完整数据的节点上
allocatio触发条件
index的增删
node节点的增删
手工reroute
集群重启等
allocation模块结构概述
这个复杂的分配过程是在reroute函数中实现的:
allocationService.reroute方法,此方法催分片进行分配后,分配后的新的集群状态,Master节点将对新的集群状态进行广播.
组件allocator
allocator主要分gatewayAllocation和shardAllocation
gatewayAllocation
primayShardAllocation :找出拥有最新分片数据节点
replicationShardAllocation 找出拥有分片数据的节点
shardAllocation
rebalanceShardAllocation 找出拥有最少分片的数据节点
组件deciders
决策器决定了分片是否分配到节点中,其决策过程都会调用canAllocate()方法,其可以分类为
负载均衡类
SameShardAllocationDecider
ShardsLimitAllocationDecider
AwarenessAllocationDecider
条件限制类决策器
RebalanceOnlyWhenActiveAllocationDecider
FilterAllocationDecider
index.routing.allocation.include.
index.routing.allocation.exclude.
allocation核心流程
其核心过程在AllocationService类reroute方法中,其实现过程如下
private void reroute(RoutingAllocation allocation) {
assert hasDeadNodes(allocation) == false : "dead nodes should be explicitly cleaned up. See disassociateDeadNodes";
assert AutoExpandReplicas.getAutoExpandReplicaChanges(allocation.metaData(), allocation.nodes()).isEmpty() :
"auto-expand replicas out of sync with number of nodes in the cluster";
判断是否有未分配的分片
if (allocation.routingNodes().unassigned().size() > 0) {
removeDelayMarkers(allocation);
//此方法分配分片
gatewayAllocator.allocateUnassigned(allocation);
}
//rebalance集群
shardsAllocator.allocate(allocation);
assert RoutingNodes.assertShardStats(allocation.routingNodes());
}
关键流程走到GatewayAllocator类的innerAllocatedUnassigned()方法
protected static void innerAllocatedUnassigned(RoutingAllocation allocation,
PrimaryShardAllocator primaryShardAllocator,
ReplicaShardAllocator replicaShardAllocator) {
//找到未分配分片
RoutingNodes.UnassignedShards unassigned = allocation.routingNodes().unassigned();
按恢复等级排序
unassigned.sort(PriorityComparator.getAllocationComparator(allocation)); // sort for priority ordering
//分配主分片,进入主分片分配方法
primaryShardAllocator.allocateUnassigned(allocation);
//分配副本
replicaShardAllocator.processExistingRecoveries(allocation);
replicaShardAllocator.allocateUnassigned(allocation);
}
public void allocateUnassigned(RoutingAllocation allocation) {
final RoutingNodes routingNodes = allocation.routingNodes();
final RoutingNodes.UnassignedShards.UnassignedIterator unassignedIterator = routingNodes.unassigned().iterator();
//循环分片
while (unassignedIterator.hasNext()) {
final ShardRouting shard = unassignedIterator.next();
//此方法关键,调用主分片的决策器进行决策,决定分配要分配到哪个节点和是否要分配,这个方法放在最后单独分析
final AllocateUnassignedDecision allocateUnassignedDecision = makeAllocationDecision(shard, allocation, logger);
if (allocateUnassignedDecision.isDecisionTaken() == false) {
// no decision was taken by this allocator
continue;
}
//若决策器决定可以分配
if (allocateUnassignedDecision.getAllocationDecision() == AllocationDecision.YES) {
//初始化为分配分片
unassignedIterator.initialize(allocateUnassignedDecision.getTargetNode().getId(),
allocateUnassignedDecision.getAllocationId(),
shard.primary() ? ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE :
allocation.clusterInfo().getShardSize(shard, ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE),
allocation.changes());
} else {
unassignedIterator.removeAndIgnore(allocateUnassignedDecision.getAllocationStatus(), allocation.changes());
}
}
}
流程到RoutingNodes类的initializeShard()方法,此实现过程如下代码
public ShardRouting initializeShard(ShardRouting unassignedShard, String nodeId, @Nullable String existingAllocationId,
long expectedSize, RoutingChangesObserver routingChangesObserver) {
ensureMutable();
assert unassignedShard.unassigned() : "expected an unassigned shard " + unassignedShard;
//初始化未分配分片
ShardRouting initializedShard = unassignedShard.initialize(nodeId, existingAllocationId, expectedSize);
添加到目的节点的分片列表
node(nodeId).add(initializedShard);
inactiveShardCount++;
if (initializedShard.primary()) {
inactivePrimaryCount++;
}
addRecovery(initializedShard);
//把分片放在已经分配中
assignedShardsAdd(initializedShard);
//设置状态更新
routingChangesObserver.shardInitialized(unassignedShard, initializedShard);
return initializedShard;
}
以上就是完成了主分片的allocation的任务,当allocation成功后,构建集群状态。当makeAllocationDecision成功后,unassignedShard.initialize()方法,创建一个新的ShardRouting对象,
把相关信息添加到集群状态中,后面再把状态广播出去。
主分片决策器流程分析
此流程分析有makeAllocationDecision PrimaryShardAllocator类的makeAllocationDecision方法,其实现过程去如下
public AllocateUnassignedDecision makeAllocationDecision(final ShardRouting unassignedShard,
final RoutingAllocation allocation,
final Logger logger) {
if (isResponsibleFor(unassignedShard) == false) {
// this allocator is not responsible for allocating this shard
return AllocateUnassignedDecision.NOT_TAKEN;
}
final boolean explain = allocation.debugDecision();
//获取分片元数据
final FetchResult<NodeGatewayStartedShards> shardState = fetchData(unassignedShard, allocation);
if (shardState.hasData() == false) {
allocation.setHasPendingAsyncFetch();
List<NodeAllocationResult> nodeDecisions = null;
if (explain) {
//此方法调用决策器
nodeDecisions = buildDecisionsForAllNodes(unassignedShard, allocation);
}
return AllocateUnassignedDecision.no(AllocationStatus.FETCHING_SHARD_DATA, nodeDecisions);
}
// don't create a new IndexSetting object for every shard as this could cause a lot of garbage
// on cluster restart if we allocate a boat load of shards
final IndexMetaData indexMetaData = allocation.metaData().getIndexSafe(unassignedShard.index());
final Set<String> inSyncAllocationIds = indexMetaData.inSyncAllocationIds(unassignedShard.id());
final boolean snapshotRestore = unassignedShard.recoverySource().getType() == RecoverySource.Type.SNAPSHOT;
assert inSyncAllocationIds.isEmpty() == false;
// use in-sync allocation ids to select nodes
final NodeShardsResult nodeShardsResult = buildNodeShardsResult(unassignedShard, snapshotRestore,
allocation.getIgnoreNodes(unassignedShard.shardId()), inSyncAllocationIds, shardState, logger);
final boolean enoughAllocationsFound = nodeShardsResult.orderedAllocationCandidates.size() > 0;
logger.debug("[{}][{}]: found {} allocation candidates of {} based on allocation ids: [{}]", unassignedShard.index(),
unassignedShard.id(), nodeShardsResult.orderedAllocationCandidates.size(), unassignedShard, inSyncAllocationIds);
if (enoughAllocationsFound == false) {
if (snapshotRestore) {
// let BalancedShardsAllocator take care of allocating this shard
logger.debug("[{}][{}]: missing local data, will restore from [{}]",
unassignedShard.index(), unassignedShard.id(), unassignedShard.recoverySource());
return AllocateUnassignedDecision.NOT_TAKEN;
} else {
// We have a shard that was previously allocated, but we could not find a valid shard copy to allocate the primary.
// We could just be waiting for the node that holds the primary to start back up, in which case the allocation for
// this shard will be picked up when the node joins and we do another allocation reroute
logger.debug("[{}][{}]: not allocating, number_of_allocated_shards_found [{}]",
unassignedShard.index(), unassignedShard.id(), nodeShardsResult.allocationsFound);
return AllocateUnassignedDecision.no(AllocationStatus.NO_VALID_SHARD_COPY,
explain ? buildNodeDecisions(null, shardState, inSyncAllocationIds) : null);
}
}
NodesToAllocate nodesToAllocate = buildNodesToAllocate(
allocation, nodeShardsResult.orderedAllocationCandidates, unassignedShard, false
);
DiscoveryNode node = null;
String allocationId = null;
boolean throttled = false;
if (nodesToAllocate.yesNodeShards.isEmpty() == false) {
DecidedNode decidedNode = nodesToAllocate.yesNodeShards.get(0);
logger.debug("[{}][{}]: allocating [{}] to [{}] on primary allocation",
unassignedShard.index(), unassignedShard.id(), unassignedShard, decidedNode.nodeShardState.getNode());
node = decidedNode.nodeShardState.getNode();
allocationId = decidedNode.nodeShardState.allocationId();
} else if (nodesToAllocate.throttleNodeShards.isEmpty() && !nodesToAllocate.noNodeShards.isEmpty()) {
// The deciders returned a NO decision for all nodes with shard copies, so we check if primary shard
// can be force-allocated to one of the nodes.
nodesToAllocate = buildNodesToAllocate(allocation, nodeShardsResult.orderedAllocationCandidates, unassignedShard, true);
if (nodesToAllocate.yesNodeShards.isEmpty() == false) {
final DecidedNode decidedNode = nodesToAllocate.yesNodeShards.get(0);
final NodeGatewayStartedShards nodeShardState = decidedNode.nodeShardState;
logger.debug("[{}][{}]: allocating [{}] to [{}] on forced primary allocation",
unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeShardState.getNode());
node = nodeShardState.getNode();
allocationId = nodeShardState.allocationId();
} else if (nodesToAllocate.throttleNodeShards.isEmpty() == false) {
logger.debug("[{}][{}]: throttling allocation [{}] to [{}] on forced primary allocation",
unassignedShard.index(), unassignedShard.id(), unassignedShard, nodesToAllocate.throttleNodeShards);
throttled = true;
} else {
logger.debug("[{}][{}]: forced primary allocation denied [{}]",
unassignedShard.index(), unassignedShard.id(), unassignedShard);
}
} else {
// we are throttling this, since we are allowed to allocate to this node but there are enough allocations
// taking place on the node currently, ignore it for now
logger.debug("[{}][{}]: throttling allocation [{}] to [{}] on primary allocation",
unassignedShard.index(), unassignedShard.id(), unassignedShard, nodesToAllocate.throttleNodeShards);
throttled = true;
}
List<NodeAllocationResult> nodeResults = null;
if (explain) {
nodeResults = buildNodeDecisions(nodesToAllocate, shardState, inSyncAllocationIds);
}
if (allocation.hasPendingAsyncFetch()) {
return AllocateUnassignedDecision.no(AllocationStatus.FETCHING_SHARD_DATA, nodeResults);
} else if (node != null) {
return AllocateUnassignedDecision.yes(node, allocationId, nodeResults, false);
} else if (throttled) {
return AllocateUnassignedDecision.throttle(nodeResults);
} else {
return AllocateUnassignedDecision.no(AllocationStatus.DECIDERS_NO, nodeResults, true);
}
}
引用借鉴 es源码分析