Lucene中的遍历

chuanyangwang

已于 2022-01-24 19:00:41 修改

阅读量291

点赞数

分类专栏： ES 文章标签： lucene apache 全文检索

于 2022-01-24 14:20:26 首次发布

本文链接：https://blog.csdn.net/chuanyangwang/article/details/122666695

版权

ES 专栏收录该内容

50 篇文章 0 订阅

订阅专栏

遍历递增的docid

org.apache.lucene.search.DocIdSetIterator

/**
 * This abstract class defines methods to iterate over a set of non-decreasing doc ids. Note that
 * this class assumes it iterates on doc Ids, and therefore {@link #NO_MORE_DOCS} is set to {@value
 * #NO_MORE_DOCS} in order to be used as a sentinel object. Implementations of this class are
 * expected to consider {@link Integer#MAX_VALUE} as an invalid value.
 */
public abstract class DocIdSetIterator {

}

合并的时候提供的遍历工具类

org.apache.lucene.index.DocIDMerger

/**
 * Utility class to help merging documents from sub-readers according to either simple concatenated
 * (unsorted) order, or by a specified index-time sort, skipping deleted documents and remapping
 * non-deleted documents.
 */
public abstract class DocIDMerger<T extends DocIDMerger.Sub> {

}

org.apache.lucene.index.DocIDMerger.Sub

  /** Represents one sub-reader being merged */
  public abstract static class Sub {
    /** Mapped doc ID */
    public int mappedDocID;

    /** Map from old to new doc IDs */
    public final MergeState.DocMap docMap;

    /** Sole constructor */
    protected Sub(MergeState.DocMap docMap) {
      this.docMap = docMap;
    }

    /**
     * Returns the next document ID from this sub reader, and {@link DocIdSetIterator#NO_MORE_DOCS}
     * when done
     */
    public abstract int nextDoc() throws IOException;

    /**
     * Like {@link #nextDoc()} but skips over unmapped docs and returns the next mapped doc ID, or
     * {@link DocIdSetIterator#NO_MORE_DOCS} when exhausted. This method sets {@link #mappedDocID}
     * as a side effect.
     */
    public final int nextMappedDoc() throws IOException {
      while (true) {
        int doc = nextDoc();
        if (doc == NO_MORE_DOCS) {
          return this.mappedDocID = NO_MORE_DOCS;
        }
        int mappedDoc = docMap.get(doc);
        if (mappedDoc != -1) {
          return this.mappedDocID = mappedDoc;
        }
      }
    }
  }

对字典的遍历

/** A simple iterator interface for {@link BytesRef} iteration. */
public interface BytesRefIterator {

  /**
   * Increments the iteration to the next {@link BytesRef} in the iterator. Returns the resulting
   * {@link BytesRef} or <code>null</code> if the end of the iterator is reached. The returned
   * BytesRef may be re-used across calls to next. After this method returns null, do not call it
   * again: the results are undefined.
   *
   * @return the next {@link BytesRef} in the iterator or <code>null</code> if the end of the
   *     iterator is reached.
   * @throws IOException If there is a low-level I/O error.
   */
  BytesRef next() throws IOException;

  /** Singleton BytesRefIterator that iterates over 0 BytesRefs. */
  BytesRefIterator EMPTY = () -> null;
}

/**
 * Iterator to seek ({@link #seekCeil(BytesRef)}, {@link #seekExact(BytesRef)}) or step through
 * ({@link #next} terms to obtain frequency information ({@link #docFreq}), {@link PostingsEnum} or
 * {@link PostingsEnum} for the current term ({@link #postings}.
 *
 * <p>Term enumerations are always ordered by BytesRef.compareTo, which is Unicode sort order if the
 * terms are UTF-8 bytes. Each term in the enumeration is greater than the one before it.
 *
 * <p>The TermsEnum is unpositioned when you first obtain it and you must first successfully call
 * {@link #next} or one of the <code>seek</code> methods.
 *
 * @lucene.experimental
 */
public abstract class TermsEnum implements BytesRefIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected TermsEnum() {}

  /** Returns the related attributes. */
  public abstract AttributeSource attributes();

  /** Represents returned result from {@link #seekCeil}. */
  public enum SeekStatus {
    /** The term was not found, and the end of iteration was hit. */
    END,
    /** The precise term was found. */
    FOUND,
    /** A different term was found after the requested term */
    NOT_FOUND
  };

  /**
   * Attempts to seek to the exact term, returning true if the term is found. If this returns false,
   * the enum is unpositioned. For some codecs, seekExact may be substantially faster than {@link
   * #seekCeil}.
   *
   * @return true if the term is found; return false if the enum is unpositioned.
   */
  public abstract boolean seekExact(BytesRef text) throws IOException;

  /**
   * Seeks to the specified term, if it exists, or to the next (ceiling) term. Returns SeekStatus to
   * indicate whether exact term was found, a different term was found, or EOF was hit. The target
   * term may be before or after the current term. If this returns SeekStatus.END, the enum is
   * unpositioned.
   */
  public abstract SeekStatus seekCeil(BytesRef text) throws IOException;

  /**
   * Seeks to the specified term by ordinal (position) as previously returned by {@link #ord}. The
   * target ord may be before or after the current ord, and must be within bounds.
   */
  public abstract void seekExact(long ord) throws IOException;

  /**
   * Expert: Seeks a specific position by {@link TermState} previously obtained from {@link
   * #termState()}. Callers should maintain the {@link TermState} to use this method. Low-level
   * implementations may position the TermsEnum without re-seeking the term dictionary.
   *
   * <p>Seeking by {@link TermState} should only be used iff the state was obtained from the same
   * {@link TermsEnum} instance.
   *
   * <p>NOTE: Using this method with an incompatible {@link TermState} might leave this {@link
   * TermsEnum} in undefined state. On a segment level {@link TermState} instances are compatible
   * only iff the source and the target {@link TermsEnum} operate on the same field. If operating on
   * segment level, TermState instances must not be used across segments.
   *
   * <p>NOTE: A seek by {@link TermState} might not restore the {@link AttributeSource}'s state.
   * {@link AttributeSource} states must be maintained separately if this method is used.
   *
   * @param term the term the TermState corresponds to
   * @param state the {@link TermState}
   */
  public abstract void seekExact(BytesRef term, TermState state) throws IOException;

  /** Returns current term. Do not call this when the enum is unpositioned. */
  public abstract BytesRef term() throws IOException;

  /**
   * Returns ordinal position for current term. This is an optional method (the codec may throw
   * {@link UnsupportedOperationException}). Do not call this when the enum is unpositioned.
   */
  public abstract long ord() throws IOException;

  /**
   * Returns the number of documents containing the current term. Do not call this when the enum is
   * unpositioned. {@link SeekStatus#END}.
   */
  public abstract int docFreq() throws IOException;

  /**
   * Returns the total number of occurrences of this term across all documents (the sum of the
   * freq() for each doc that has this term). Note that, like other term measures, this measure does
   * not take deleted documents into account.
   */
  public abstract long totalTermFreq() throws IOException;

  /**
   * Get {@link PostingsEnum} for the current term. Do not call this when the enum is unpositioned.
   * This method will not return null.
   *
   * <p><b>NOTE</b>: the returned iterator may return deleted documents, so deleted documents have
   * to be checked on top of the {@link PostingsEnum}.
   *
   * <p>Use this method if you only require documents and frequencies, and do not need any proximity
   * data. This method is equivalent to {@link #postings(PostingsEnum, int) postings(reuse,
   * PostingsEnum.FREQS)}
   *
   * @param reuse pass a prior PostingsEnum for possible reuse
   * @see #postings(PostingsEnum, int)
   */
  public final PostingsEnum postings(PostingsEnum reuse) throws IOException {
    return postings(reuse, PostingsEnum.FREQS);
  }

  /**
   * Get {@link PostingsEnum} for the current term, with control over whether freqs, positions,
   * offsets or payloads are required. Do not call this when the enum is unpositioned. This method
   * will not return null.
   *
   * <p><b>NOTE</b>: the returned iterator may return deleted documents, so deleted documents have
   * to be checked on top of the {@link PostingsEnum}.
   *
   * @param reuse pass a prior PostingsEnum for possible reuse
   * @param flags specifies which optional per-document values you require; see {@link
   *     PostingsEnum#FREQS}
   */
  public abstract PostingsEnum postings(PostingsEnum reuse, int flags) throws IOException;

  /**
   * Return a {@link ImpactsEnum}.
   *
   * @see #postings(PostingsEnum, int)
   */
  public abstract ImpactsEnum impacts(int flags) throws IOException;

  /**
   * Expert: Returns the TermsEnums internal state to position the TermsEnum without re-seeking the
   * term dictionary.
   *
   * <p>NOTE: A seek by {@link TermState} might not capture the {@link AttributeSource}'s state.
   * Callers must maintain the {@link AttributeSource} states separately
   *
   * @see TermState
   * @see #seekExact(BytesRef, TermState)
   */
  public abstract TermState termState() throws IOException;

  /**
   * An empty TermsEnum for quickly returning an empty instance e.g. in {@link
   * org.apache.lucene.search.MultiTermQuery}
   *
   * <p><em>Please note:</em> This enum should be unmodifiable, but it is currently possible to add
   * Attributes to it. This should not be a problem, as the enum is always empty and the existence
   * of unused Attributes does not matter.
   */
  public static final TermsEnum EMPTY =
      new TermsEnum() {

        private AttributeSource atts = null;

        @Override
        public SeekStatus seekCeil(BytesRef term) {
          return SeekStatus.END;
        }

        @Override
        public void seekExact(long ord) {}

        @Override
        public BytesRef term() {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public int docFreq() {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public long totalTermFreq() {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public long ord() {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public PostingsEnum postings(PostingsEnum reuse, int flags) {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public ImpactsEnum impacts(int flags) throws IOException {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public BytesRef next() {
          return null;
        }

        @Override // make it synchronized here, to prevent double lazy init
        public synchronized AttributeSource attributes() {
          if (atts == null) {
            atts = new AttributeSource();
          }
          return atts;
        }

        @Override
        public boolean seekExact(BytesRef text) throws IOException {
          return seekCeil(text) == SeekStatus.FOUND;
        }

        @Override
        public TermState termState() {
          throw new IllegalStateException("this method should never be called");
        }

        @Override
        public void seekExact(BytesRef term, TermState state) {
          throw new IllegalStateException("this method should never be called");
        }
      };
}

[LUCENE-10010] Should we have a NFA Query? - ASF JIRAhttps://issues.apache.org/jira/browse/LUCENE-10010 [LUCENE-9570] Review code diffs after automatic formatting and correct problems before it is applied - ASF JIRAhttps://issues.apache.org/jira/browse/LUCENE-9570

https://swtch.com/~rsc/regexp/regexp1.htmlhttps://swtch.com/~rsc/regexp/regexp1.html

chuanyangwang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene中的遍历

关键类：org.apache.lucene.search.DocIdSetIterator/** * This abstract class defines methods to iterate over a set of non-decreasing doc ids. Note that * this class assumes it iterates on doc Ids, and therefore {@link #NO_MORE_DOCS} is set to {@value * #
复制链接

扫一扫

专栏目录