Elasticsearch-PEER RECOVERY（一）

最新推荐文章于 2023-09-13 14:53:26 发布

cigarL

最新推荐文章于 2023-09-13 14:53:26 发布

阅读量874

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/weixin_43211119/article/details/103886261

版权

elasticsearch 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

3. 索引恢复

代码入口: IndicesClusterStateService#applyClusterState
看完创建索引的部分，就有疑问了，只看到了创建索引，那集群状态在何时何地同步的呢？进到IndicesClusterStateService#applyClusterState可能就明白了，直接贴代码，同步状态就是进入到createIndices，通过transportService，发送对应的action去update mapping，此处不做过多说明，本章节学习一下索引恢复的流程。

public synchronized void applyClusterState(final ClusterChangedEvent event) {
       /.../
       updateFailedShardsCache(state);
       deleteIndices(event); // also deletes shards of deleted indices
       removeUnallocatedIndices(event); // also removes shards of removed indices
       failMissingShards(state);
       removeShards(state);   // removes any local shards that doesn't match what the master expects
       updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache
       createIndices(state);
       createOrUpdateShards(state);
   }

上面的代码我们可以看到，需要集群同步信息时，都会走到这里，根据event获取到state，每个分支走一遍，如果有对应的事件信息，则继续往下走。分片恢复可以看到，是创建和更新的操作，进入到了createOrUpdateShards，下面看下实际代码流程。

3.1 主分片恢复

3.2 副本分片恢复

先说一下routing table和routing nodes，可以通过命令_cluster/state/routing_table和_cluster/state/routing_nodes查看相关信息。前者保存“索引->分片”的对应关系，即每个索引的每个分片在哪个节点；后者保存“节点->分片”的对应关系，即每个节点分别有哪些索引的哪些分片。
获取本节点的routing table（实际为shard routing信息），即通过routing nodes拿到了当前节点的分片信息，通过分片获取到对应Index，再拿到IndexService，查看分片是否已经存在，如果不存在，就进入createShard，否则进到updateShard；createShards主要处理处于initializing的分片，即分片恢复也会进到createShards。

3.2.1 INIT

通过indicesService#createShard执行时，初始化了一个RecoveryState（此时stage置为INIT阶段），并将其作为一个参数传入，RecoveryState主要保存恢复信息，即当前恢复阶段、主分片、分片ID、source节点、target节点等。
先通过index获取到对应的indexService，调用indexService中的createShard，生成分片信息（path、mapperService、engine等）；进到IndexShard#startRecovery，根据恢复类型，做不同处理；当前为PEER，即副本分片恢复。进入到doRecovery，构建StartRecoveryRequest来发送请求到主分片（分片id、seqNo等），根据并将恢复阶段置为INDEX。
（构建StartRecoveryRequest阶段）获取seqNo时，先从translog中拿到globalCheckpoint，再获取到所有的commit，通过globalCheckpoint和commits获取到safeCommit，最终拿到safeCommit对应的checkpoint即为seqNo。
注：在获取seqNo时，需要通过commit来决策具体哪一次安全的commit操作；commit信息可通过命令 {indexName}/_stats?filter_path=**.commit&level=shards&pretty 查看（ES对大多数数据结构都进行了封装，可以通过API查看，可以更好的理解代码，建议在看源码的时候，涉及到的数据结构，如果不清楚可以搜一下是否存在对应的API）。

// 根据恢复类型，做不同处理
public void startRecovery() {
  switch (recoveryState.getRecoverySource().getType()) {
    case EMPTY_STORE:
    case EXISTING_STORE:
    // 主分片从本地恢复
        recoverFromStore();break;
    case PEER:
    // 副本分片从远程主分片恢复
        recoveryTargetService.startRecovery();break;
    case SNAPSHOT:
    // 从snapshot恢复
        restoreFromRepository(repository);break;
    case LOCAL_SHARDS:
    // 从本节点的分片中恢复（shrink）
        recoverFromLocalShards();break;
    default:
        throw new IllegalArgumentException("Unknown recovery source " + recoveryState.getRecoverySource());
  }
}

// 获取要开始恢复的checkpoint，即就是从哪里开始恢复
public static long getStartingSeqNo() {
  // 获取target的存储数据（这里主要来获取目录，即下面需要读取的tlog和ckp文件的路径）
  final Store store = recoveryTarget.store();
  // translog对应的UUID，ES通过唯一的UUID来对应一个translog（shard、index、node等也是如此）
  final String translogUUID = store.readLastCommittedSegmentsInfo().getUserData().get(Translog.TRANSLOG_UUID_KEY);
  // 获取globalCheckpoint
  final long globalCheckpoint = Translog.readGlobalCheckpoint(recoveryTarget.translogLocation(), translogUUID);
  // 获取所有的commit
  final List<IndexCommit> existingCommits = DirectoryReader.listCommits(store.directory());
  // 拿到safeCommit
  final IndexCommit safeCommit = CombinedDeletionPolicy.findSafeCommitPoint(existingCommits, globalCheckpoint);
  // 获取sequenceNumber
  final SequenceNumbers.CommitInfo seqNoStats = Store.loadSeqNoInfo(safeCommit);
}

个人觉得读取translog和checkpoint文件的部分比较有意思（也就是tlog和ckp文件），所以详细写一下：
首先是读取checkpoint，checkpoint是从translog.ckp文件中读取，文件名由“translog”字符串和一个“.ckp”后缀构成。

public static final String CHECKPOINT_SUFFIX = ".ckp";
public static final String CHECKPOINT_FILE_NAME = "translog" + CHECKPOINT_SUFFIX;

static Checkpoint readCheckpoint(final Path location) throws IOException {
    return Checkpoint.read(location.resolve(CHECKPOINT_FILE_NAME));
}

先会对文件进行校验，此处的校验调用的Lucene接口，即CodecUtil工具类，这里常用的即为checksumEntireFile方法。Lucene文件中，会使用一个魔法值 0x3fd76c17 作为文件的开头，对该魔法值取反放置文件的末尾，以此来保证文件的开始和结束可以正常进行。

// 校验文件的合法性
public static long checksumEntireFile(IndexInput input) throws IOException {
  // clone一个输入，而并没有直接对input操作，防止游标读取对后续操作的影响
  IndexInput clone = input.clone();
  // seek方法表示，要跳到第几个字节处，如为10，表示从当前位置往后移动10个Byte；0表示从当前位置开始；小于0抛异常；实际的实现方式就是readBytes
  clone.seek(0);
  ChecksumIndexInput in = new BufferedChecksumIndexInput(clone);
  // 当前位置必须文件的开头
  assert in.getFilePointer() == 0;
  // 输入的长度小于16抛异常，为什么是16：文件结尾时，用一个int表示魔法值的反码，一个int表示算法ID，一个long表示CRC校验码，共16个字节
  if (in.length() < footerLength()) {
    throw new CorruptIndexException("misplaced codec footer (file truncated?): length=" + in.length() + " but footerLength==" + footerLength(), input);
  }
  // 跳到倒数第16个字节处
  in.seek(in.length() - footerLength());
  // 校验尾部数据是否正常
  return checkFooter(in);
}

// 校验尾部数据
public static long checkFooter(ChecksumIndexInput in) throws IOException {
    // 校验 FOOTER_MAGIC 和 algorithmID 是否合法，即魔法值反码是否可以对上，算法是否用0表示
    validateFooter(in);
    // 计算当前输入的CRC校验码
    long actualChecksum = in.getChecksum();
    // 读取最后的8个字节，即为文件中记录的CRC校验码
    long expectedChecksum = readCRC(in);
    // 如果CRC校验不通过，抛异常
    if (expectedChecksum != actualChecksum) {
      throw new CorruptIndexException(...);
    }
    // 此处返回了校验码，而ES并没有使用，只是作为校验步骤
    return actualChecksum;
  }

到这里，ckp文件的基本校验结束了，开始读取文件内容，首先获取版本号，通过版本号生成对应版本的数据结构，ES当前维护了三个版本的信息，即5.0到6.0，6.0到6.4，和6.4之后，因为我们当前所有流程均以ES 7.1为基础，因此，早前两个版本不做过多说明。

/**
 * 检查文件头部
 */
public static int checkHeader(DataInput in, String codec, int minVersion, int maxVersion) throws IOException {
    // 读取第一个int
    final int actualHeader = in.readInt();
    // 如果第一个int不是期望值（即魔法值0x3fd76c17），说明文件有损坏，或非ckp格式化文件
    if (actualHeader != CODEC_MAGIC) {
      throw new CorruptIndexException("codec header mismatch: actual header=" + actualHeader + " vs expected header=" + CODEC_MAGIC, in);
    }
    return checkHeaderNoMagic(in, codec, minVersion, maxVersion);
  }

/**
 * 检查文件头去掉魔法值后的部分
 */
public static int checkHeaderNoMagic(DataInput in, String codec, int minVersion, int maxVersion) throws IOException {
  // 读取一个String，这里先通过VInt确定这个String的长度（1~5字节），然后再readBytes
  final String actualCodec = in.readString();
  // 如果该字符串不是“ckp”，则校验不通过
  if (!actualCodec.equals(codec)) {
    throw new CorruptIndexException("codec mismatch: actual codec=" + actualCodec + " vs expected codec=" + codec, in);
  }
  // 读取一个int，作为版本号，如果该版本号小于最小版本，或大于最大版本，均无效
  // 需要注意，这里的版本号并非使用x.x.x（如7.1.1），而是使用了1、2、3表示三个版本阶段（上文中提到的5.0到6.0，6.0到6.4，和6.4之后）
  final int actualVersion = in.readInt();
  if (actualVersion < minVersion) {
    throw new IndexFormatTooOldException(in, actualVersion, minVersion, maxVersion);
  }
  if (actualVersion > maxVersion) {
    throw new IndexFormatTooNewException(in, actualVersion, minVersion, maxVersion);
  }
  return actualVersion;
}

/**
 * 6.4之后的版本，checkpoint在文件中的格式
 */
static Checkpoint readCheckpointV6_4_0(final DataInput in) throws IOException {
    final long offset = in.readLong();
    final int numOps = in.readInt();
    final long generation = in.readLong();
    final long minSeqNo = in.readLong();
    final long maxSeqNo = in.readLong();
    final long globalCheckpoint = in.readLong();
    final long minTranslogGeneration = in.readLong();
    final long trimmedAboveSeqNo = in.readLong();
    return new Checkpoint(offset, numOps, generation, minSeqNo, maxSeqNo, globalCheckpoint, minTranslogGeneration, trimmedAboveSeqNo);
}

至此，checkpoint读取结束了，我们来梳理一下ckp的文件结构：首先使用一个int存储魔法值，然后使用一个String（VInt）存储字符串“ckp”用来表示文件类型，然后使用一个int存储版本号（1或2或3），再存储checkpoint（由于字段太多，不做赘述，结构如上），最后存储文件尾部验证数据（int类型的魔法值反码 + int类型的数字 0 + long类型的CRC校验码）。
我们看到，在readCheckpoint方法中，读取到checkpoint后，会读一遍translog但并没有用到任何返回结果，此处只是用translogUUID读取一个translog，检验该文件头部，来保证该"UUID -> checkpoint"是正常可用的；下面看下translog文件的校验方式。
首先从上面已经获取到的checkpoint，拿到对应的generation，translog文件名为：“translog”+generation+".tlog"；由于文件header格式类似ckp，不再附代码细说：int类型的魔法值0x3fd76c17 + String字符串"translog" + int类型的版本号（1或2或3） + int类型的UUID.lenght + UUID + long类型的primaryTerm + int类型的校验码（因为写入时，ES直接把long类型的checksum强转为了int；这部分并没有与ckp调用lucene保持一致）。

cigarL

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch-PEER RECOVERY（一）

3. 索引恢复代码入口: IndicesClusterStateService#applyClusterState 看完创建索引的部分，就有疑问了，只看到了创建索引，那集群状态在何时何地同步的呢？进到IndicesClusterStateService#applyClusterState可能就明白了，直接贴代码，同步状态就是进入到createIndices，通过transportServ...
复制链接

扫一扫