Elasticsearch-Bulk基本流程（二）

最新推荐文章于 2023-05-10 09:23:32 发布

cigarL

最新推荐文章于 2023-05-10 09:23:32 发布

阅读量584

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/weixin_43211119/article/details/103886150

版权

elasticsearch 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

1.3.2.2.1 执行写入操作

由于监听、校验、初始化等操作过多，我们直接看代码流程，走到写入的部分，ReplicationOperation#execute()#perform(request) -> TransportReplicationAction#perform(Request request) -> TransportShardBulkAction#shardOperationOnPrimary(…) -> performOnPrimary(…) -> performOnPrimary(…) -> executeBulkItemRequest(…) --> executeIndexRequestOnPrimary(…) --> applyIndexOperationOnPrimary(…) --> applyIndexOperation(…) --> index(…) --> InternalEngine#index()
可以看到，上述代码流程中，箭头有“->”和“–>”两种，从“–>”开始，有些有必要操作，需要仔细了解，而“->”的过程中，大多是为下一步处理做准备，在学习时可以先忽略其中的操作。
最后走到了index(…) --> InternalEngine#index()的过程，这是写数据的主要过程。先是通过index获取对应的策略，即plan，通过plan执行对应操作，如要正常写入，则到了indexIntoLucene(…)，然后写translog。
获取plan时，先判断当前请求的来源类型，看是主分片写入，还是translog恢复这种类型（我们当前为主分片写入，其他类型后续补充）

if (index.origin() == Operation.Origin.PRIMARY) {
    return planIndexingAsPrimary(index);
} else {
    return planIndexingAsNonPrimary(index);
}

将外部请求转为可以安全执行的内部请求

final IndexResult indexResult;
if (plan.earlyResultOnPreFlightError.isPresent()) {
    indexResult = plan.earlyResultOnPreFlightError.get();
    assert indexResult.getResultType() == Result.Type.FAILURE : indexResult.getResultType();
} else if (plan.indexIntoLucene || plan.addStaleOpToLucene) {
    indexResult = indexIntoLucene(index, plan);
} else {
    indexResult = new IndexResult(
            plan.versionForIndexing, getPrimaryTerm(), plan.seqNoForIndexing, plan.currentNotFoundOrDeleted);
}

写入Lucene，即indexIntoLucene(…)部分，先校验seqNo和Version是否大于等于0，并更新，接下来就直接到了Lucence部分的写入了，即使用IndexWriter，调用addDocuments或updateDocuments来添加、更新文档。

// 更新文档
if (docs.size() > 1) {
    indexWriter.updateDocuments(uid, docs);
} else {
    indexWriter.updateDocument(uid, docs.get(0));
}
// 增加文档
if (docs.size() > 1) {
    indexWriter.addDocuments(docs);
} else {
    indexWriter.addDocument(docs.get(0));
}

然后是写translog。

// 如果请求不是来自translog，则需要写translog
if (index.origin().isFromTranslog() == false) {
    final Translog.Location location;
    // 如果上述操作执行成功，则写入translog，并记录该操作对应的位置
    if (indexResult.getResultType() == Result.Type.SUCCESS) {
        location = translog.add(new Translog.Index(index, indexResult));
    } else if (indexResult.getSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO) {
        // 如果有文档写入失败, 则记录该操作为“no-op”，即无操作
        final NoOp noOp = new NoOp(indexResult.getSeqNo(), index.primaryTerm(), index.origin(),
            index.startTime(), indexResult.getFailure().toString());
        location = innerNoOp(noOp).getTranslogLocation();
    } else {
        location = null;
    }
    indexResult.setTranslogLocation(location);
}

1.3.2.2.2 更新LocalCheckpoint

首先来说一下Checkpoint,分为LocalCheckpoint和GlobalCheckpoint，checkpoint是在6.x加入的概念，即保存每次操作的最新位置，LocalCheckpoint保存当前分片执行操作的结果位置(我现在执行到了哪个操作)，GlobalCheckpoint保存全局操作位置(我成功执行到了哪个操作，与主分片的GlobalCheckpoint是否一致)，用于保证各节点与主分片节点的操作一致(如果我的GlobalCheckpoint和LocalCheckpoint不一致，且比主分片的小，说明我的操作有丢失，需要再执行一次操作)。主分片在每次操作完后，先更新LocalCheckpoint，更新成功后，如果LocalCheckpoint比GlobalCheckpoint大，说明本次操作是追加的，需要更新主分片上的GlobalCheckpoint；请求转发到副本分片的节点上，同样，副本分片在执行操作后更新LocalCheckpoint；在ES6.7之前，副本分片的GlobalCheckpoint是在下一次请求过来时，再检查当前LocalCheckpoint与主分片的GlobalCheckpoint是否一致，如果一致，说明操作正常，将副本分片的GlobalCheckpoint更新至LocalCheckpoint；而在6.7之后，副本分片的GlobalCheckpoint更新放在了当前请求结束之后，不会等待下一次请求到来时再更新（这样做的好处，是可以在数据恢复时，减少要恢复的操作；假设副本分片A的GlobalCheckpoint当前为2，但是已经执行到了5，且是正常的，如果这个时候开始恢复，之前版本会从3开始，将3~5全部执行一遍，而6.7之后，不会做重复操作）。

// 检查是否需要更新LocalCheckpoint，即需要更新的值是否大于当前已有值
boolean increasedLocalCheckpoint = updateLocalCheckpoint(allocationId, cps, localCheckpoint);
// pendingInSync是一个保存等待更新LocalCheckpoint的Set，存放allocation IDs
boolean pending = pendingInSync.contains(allocationId);
// 如果是待更新的，且当前的localCheckpoint大于等于GlobalCheckpoint(每次都是先更新Local再Global，正常情况下，Local应该大于等于Global)
if (pending && cps.localCheckpoint >= getGlobalCheckpoint()) {
  // 从待更新集合中移除
    pendingInSync.remove(allocationId);
    pending = false;
    // 此分片是否同步，用于更新GlobalCheckpoint时使用
    cps.inSync = true;
    replicationGroup = calculateReplicationGroup();
    notifyAllWaiters();
}
// 更新GlobalCheckpoint
if (increasedLocalCheckpoint && pending == false) {
    updateGlobalCheckpointOnPrimary();
}
/**
 * 更新GlobalCheckpoint的具体操作
 */
private synchronized void updateGlobalCheckpointOnPrimary() {
    final CheckpointState cps = checkpoints.get(shardAllocationId);
    final long globalCheckpoint = cps.globalCheckpoint;
    // 计算GlobalCheckpoint，即检验无误后，取Math.min(cps.localCheckpoint, Long.MAX_VALUE)
    final long computedGlobalCheckpoint = computeGlobalCheckpoint(pendingInSync, checkpoints.values(), getGlobalCheckpoint());
    assert computedGlobalCheckpoint >= globalCheckpoint : "new global checkpoint [" + computedGlobalCheckpoint +
        "] is lower than previous one [" + globalCheckpoint + "]";
    // 需要更新到的GlobalCheckpoint值比当前的global值大，则需要更新
    if (globalCheckpoint != computedGlobalCheckpoint) {
        cps.globalCheckpoint = computedGlobalCheckpoint;
        onGlobalCheckpointUpdated.accept(computedGlobalCheckpoint);
    }
}

1.3.2.2.3 转发请求给副本分片

主分片写入成功后，需要将请求转发给副本分片，replicaRequest是在写入主分片后，从primaryResult中获取，并非原始Request。

private void performOnReplica(final ShardRouting shard, final ReplicaRequest replicaRequest,
                              final long globalCheckpoint, final long maxSeqNoOfUpdatesOrDeletes) {
    replicasProxy.performOn(shard, replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes, new ActionListener<ReplicaResponse>() {
        @Override
        public void onResponse(ReplicaResponse response) {
          // 成功的话，更新LocalCheckpoint和GlobalCheckpoint
           primary.updateLocalCheckpointForShard(shard.allocationId().getId(), response.localCheckpoint());
           primary.updateGlobalCheckpointForShard(shard.allocationId().getId(), response.globalCheckpoint());
        }
        @Override
        public void onFailure(Exception replicaException) {...}
    });
}
/**
 * 转发请求的具体操作
 */
protected void sendReplicaRequest(
      final ConcreteReplicaRequest<ReplicaRequest> replicaRequest,
      final DiscoveryNode node,
      // 监听器，用来获取Response
      final ActionListener<ReplicationOperation.ReplicaResponse> listener) {
  final ActionListenerResponseHandler<ReplicaResponse> handler = new ActionListenerResponseHandler<>(listener, in -> {
      ReplicaResponse replicaResponse = new ReplicaResponse();
      replicaResponse.readFrom(in);
      return replicaResponse;
  });
  // 发送transport请求到具体节点
  transportService.sendRequest(node, transportReplicaAction, replicaRequest, transportOptions, handler);
}

1.3.2.2.4 副本失败时的处理过程

在请求转发副本分片时，会通过监听器监听副本操作的结果，成功则更新checkpoint构造响应等，不多说；失败时，可以看到会出现一些常见的问题（此处只对代码流程简单说明，不再跟踪具体代码）。
副本写入失败，会向master发送一个内部请求internal:cluster/shard/failure，master接收到该请求，会根据shardId和allocationId去匹配，如果匹配关系不成立，说明主分片节点的routingTable不对，会更新clusterStats，生成StaleShard；如果匹配关系成立，说明路由信息正确，但分片无法写入，会生成FailedShard，FailedShard处理过程就是上图代码流程，生成UnassignedInfo，并将分片加到unassigned，该分片重试了5次后就一直处于unassigned了。
遇到的问题总结一下：1.重启时写入不中断，shard failed日志刷屏，task大量堆积，原因即为副本写入失败，不断在发送请求，需要master更新ClusterStats，即上述StaleShard，较早的版本对于task唯一性决定有误，即相同的请求的task不一致，导致task数据量异常，而新版本修改逻辑后，仍然是单个分片的task唯一，仍会出现该问题，只是进行了改善；2. 回合真实内存熔断后，如果内存不足，会出现分片不断处于初始化过程，即上述的FailedShard，原因即为副本写入失败，会被移到unassigned队列中去，社区答复为，es需要确保主副分片写入一致，因此做了这个操作。
附社区答复：https://discuss.elastic.co/t/why-move-shard-to-unassigned-when-the-circuit-breaker-is-open/212085

cigarL

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch-Bulk基本流程（二）

1.3.2.2.1 执行写入操作由于监听、校验、初始化等操作过多，我们直接看代码流程，走到写入的部分，ReplicationOperation#execute()#perform(request) -> TransportReplicationAction#perform(Request request) -> TransportShardBulkAction#shardOper...
复制链接

扫一扫