es中document的主键id及局部更新

最新推荐文章于 2024-03-29 10:23:56 发布

georgesnoopy

最新推荐文章于 2024-03-29 10:23:56 发布

阅读量5.2k

点赞数

文章标签： elasticsearch lucene 主键id 局部更新

本文链接：https://blog.csdn.net/sinat_14913533/article/details/91991707

版权

很多介绍es的文章，都会说一下lucene的不足，其中两条有迷惑：

1. lucene的document没有全局唯一的主键id

2. lucene不支持更新。

疑惑点1：没有主键id。

可以用lucene的search接口完成搜索的时候，分两步：1. 获得召回的documentid。2.根据docuemntId获得document的详情

 TopDocs topDocs = searcher.search(queryParser.parse(gap), 10);    #1
        System.out.println(topDocs.totalHits);
        for (ScoreDoc doc : topDocs.scoreDocs) {
            int docId = doc.doc;
            System.out.println(docId);
            Document document = searcher.doc(docId);
            // System.out.println(document);
            System.out.println(document.get("title"));
        }

这tm不是docId是啥，根据这个id能够唯一获得对应document的详情，凭啥不唯一，很迷。

首先明确下：能够作为主键id的条件(非官方，个人意见)：1.唯一。2.不变。唯一好理解，就是每个document的id应该不一样，没啥说的。不变就是说如果id=12345是document1的主键id，那么只要document没删除，那么id=12345对应的document一定是documment1，不会是其他document。

lucene中的这里的docId(这么叫不准确，官方文档叫document number，截图是源码中)，他是段内唯一，不是全局唯一

也就是说，随着段合并，这玩意会改变。它没法用来当全局主键。这么一说可能更疑惑了，搜索的时候刚进行到第一步(代码中的#1)处，然后docId变了，我特么还怎么拿到document的详情，岂不是乱了。初始化IndexSearcher的时候，实际上搜索范围就固定了，就是当前快照，往后索引的改变，通过这个IndexSearcher是查不到的(在使用ES的时候别想当然的给SearchRequest搞个单例)。

从另一个方面看，lucene的接口中，没有根据docId获得document的接口，也没有根据docId更新的接口。

杠精：我找到根据docId获得document的接口，接口如下(IndexReader中，IndexSearcher的doc()方法最终调用的是这个方法)。这不就是搜索第二步调的接口么。没有第一部，你能调用这个接口么？根本就不知道这个docId是个啥。

/**
   * Returns the stored fields of the <code>n</code><sup>th</sup>
   * <code>Document</code> in this index.  This is just
   * sugar for using {@link DocumentStoredFieldVisitor}.
   * <p>
   * <b>NOTE:</b> for performance reasons, this method does not check if the
   * requested document is deleted, and therefore asking for a deleted document
   * may yield unspecified results. Usually this is not required, however you
   * can test if the doc is deleted by checking the {@link
   * Bits} returned from {@link MultiFields#getLiveDocs}.
   *
   * <b>NOTE:</b> only the content of a field is returned,
   * if that field was stored during indexing.  Metadata
   * like boost, omitNorm, IndexOptions, tokenized, etc.,
   * are not preserved.
   * 
   * @throws CorruptIndexException if the index is corrupt
   * @throws IOException if there is a low-level IO error
   */
  // TODO: we need a separate StoredField, so that the
  // Document returned here contains that class not
  // IndexableField
  public final Document document(int docID) throws IOException {
    final DocumentStoredFieldVisitor visitor = new DocumentStoredFieldVisitor();
    document(docID, visitor);
    return visitor.getDocument();
  }

疑惑点2：lucene不支持部分更新

杠精：IndexWriter找到update接口一枚。

/**
   * Updates a document by first deleting the document(s)
   * containing <code>term</code> and then adding the new
   * document.  The delete and then add are atomic as seen
   * by a reader on the same index (flush may happen only after
   * the add).
   *
   * @return The <a href="#sequence_number">sequence number</a>
   * for this operation
   *
   * @param term the term to identify the document(s) to be
   * deleted
   * @param doc the document to be added
   * @throws CorruptIndexException if the index is corrupt
   * @throws IOException if there is a low-level IO error
   */
  public long updateDocument(Term term, Iterable<? extends IndexableField> doc) throws IOException {
    ensureOpen();
    try {
      boolean success = false;
      try {
        long seqNo = docWriter.updateDocument(doc, analyzer, term);
        if (seqNo < 0) {
          seqNo = - seqNo;
          processEvents(true, false);
        }
        success = true;
        return seqNo;
      } finally {
        if (!success) {
          if (infoStream.isEnabled("IW")) {
            infoStream.message("IW", "hit exception updating document");
          }
        }
      }
    } catch (AbortingException | VirtualMachineError tragedy) {
      tragicEvent(tragedy, "updateDocument");

      // dead code but javac disagrees:
      return -1;
    }
  }

细看接口描述确实可以更新，条件是Term+Field，即将那些包含该term的document的对应字段都更新成指定值，所以说它实际是个term query+update，而且，这根本不是更新，而是新建，是将满足term的给删了，然后新建一个指定的docment。但是它原生并不能支持像mysql的update一样的部分更新操作。

从这个角度看，lucene不能支持更新的两个原因：1. 没有全局唯一的主键id。2. lucene中没有存储完整的文档原始内容(即使有个stored属性，它也是字段级别的，还是可选的，没法保证docuemnt的原始内容)，没有原始数据，就不能够进行部分更新。

es解决这个问题就是加了_id和_source字段。

ES的做法

每个文档都会默认有两个字段：_id，_source.

_id是全局唯一的(not_ananlyzed)，如果客户端不指定，es生成一个默认的；如果指定不唯一，则索引文档的时候就报错。

_source(stored=true)将document的所有字段值以json的格式存储在该字段中。

注意：凡是要基于document的原始内容的操作，如部分更新，rebuild接口，script脚本等，这些都要保证没有禁用_source字段，否则会出现问题。

其实根据上面的描述也不用多说这两个字段怎么实现局部更新的。

注：elasticsearch新人，以上内容纯粹是自己学习过程中遇到的疑惑，"杠精"就是我当时的真实想法。不正确或者不准确的地方，希望大神留言指针，我将多多学习。

georgesnoopy

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
es中document的主键id及局部更新

很多介绍es的文章，都会说一下lucene的不足，其中两条有迷惑：1. lucene的document没有全局唯一的主键id2. lucene不支持更新。疑惑点1：没有主键id。可以用lucene的search接口完成搜索的时候，分两步：1. 获得召回的documentid。2.根据docuemntId获得document的详情 TopDocs topDocs = searc...
复制链接

扫一扫