org.apache.lucene.analysis（三）

最新推荐文章于 2021-06-03 11:51:59 发布

Sir yes sir

最新推荐文章于 2021-06-03 11:51:59 发布

阅读量134

点赞数

分类专栏： java 文章标签： java es lucene

原文链接：https://lucene.apache.org/core/8_8_2/core/org/apache/lucene/analysis/package-summary.html

版权

java 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

这里写自定义目录标题

Field Section Boundaries

当 document.add(field) 对于相同的字段名称被调用多次的时候，我们可以说，每次这样的调用都会为该文档中的该字段创建一个新的 section。实际上，对于这些所谓的“section”，将分别调用 tokenStream(field，reader)。然而，默认的 Analyzer 行为是将所有这些 section 视为一个大的 section。这使得短语搜索和近义词搜索可以无缝地跨越这些“section”之间的边界。换句话说，如果像这样加上某个字段“f”：

document.add(new Field("f","first ends",...);
document.add(new Field("f","starts two",...);
indexWriter.addDocument(document);

然后，短语搜索“end starts”将会找到那个 document。如果需要，这种行为可以通过在连续的字段“sections”之间引入“position gap”来修改，只需覆盖 Analyzer.getPositionIncrementGap(fieldName)：

Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
Analyzer myAnalyzer = new StandardAnalyzer(matchVersion) {
  public int getPositionIncrementGap(String fieldName) {
    return 10;
  }
};

End of Input Cleanup

在每个字段的末端，Lucene 将会调用 TokenStream.end()。token stream 的组件（tokenizer 和 token filters）必须将准确的值放入令牌属性中，以反映字段末尾的情况。Offset 属性必须包含起始和结束的最终偏移量（处理的字符总数）。PositionLength等属性必须是正确的。
基础方法 TokenStream.end() 将 PositionIncrement 设置成 0，这是必须的。其他组件必须重写此方法以修复其他属性。

Token Position Increments

默认情况下，TokenStream将所有令牌的位置增量设置为1。这意味着索引中为该令牌存储的位置将比前一个令牌的位置多一个。回想一下，短语和近义词搜索依赖于位置信息。

如果选用的分析器过滤掉了 stop words “is” 和 “the”，那么对于一个包含字符串“blue is the sky”的 document，仅仅“blue”，“sky”标记被索引，position(“sky”) = 3 + position(“blue”)。现在，一个短语查询“blue is the sky”将会找到那个 document。因为相同的分析器在查询中过滤掉了相同的 stop words。但是短语查询“blue sky”将不会找到那个文档，因为“blue” 和 “sky” 的位置增量仅仅是1。

如果这个现象不适和应用的需求，在生成短语查询时，需要将查询解析器配置为不考虑位置增量。

注意，过滤掉 token 的过滤器必须增加位置的增量，以避免生成损坏的 tokenstream。下面是 StopFilter 在过滤出 token 时增加位置的逻辑：

public TokenStream tokenStream(final String fieldName, Reader reader) {
  final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
  TokenStream res = new TokenStream() {
    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

    public boolean incrementToken() throws IOException {
      int extraIncrement = 0;
      while (true) {
        boolean hasNext = ts.incrementToken();
        if (hasNext) {
          if (stopWords.contains(termAtt.toString())) {
            extraIncrement += posIncrAtt.getPositionIncrement(); // filter this word
            continue;
          } 
          if (extraIncrement > 0) {
            posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement);
          }
        }
        return hasNext;
      }
    }
  };
  return res;
}

修改位置增量的其他一些用例如下:

在句子边界上抑制短语和接近度匹配 —— 为此，标识新句子的 tokenizer 可以在新句子的第一个 token 的位置增量上增加1。
注入同义词 —— 标记的同义词应该在与原始标记相同的位置创建，原始标记和注入的同义词的输出顺序是未定义的，只要它们都从相同的位置离开。因此，一个标记的所有同义词将被认为出现在与该标记完全相同的位置上，因此可以通过短语和接近性搜索看到它们。要使多令牌同义词正确工作，您应该只在搜索时使用 SynoymGraphFilter 。