Solr.分词器总览

Wild__Child

于 2021-10-10 21:38:12 发布

阅读量1.1k

点赞数 5

文章标签： solr

本文链接：https://blog.csdn.net/Wild__Child/article/details/120690550

版权

2021SC@SDUSC

概述
代码分析
- 代码概述
- TokenStream抽象类中的import包分析

概述

首先我们需要明确的是Solr 是一个基于 Apache Lucene 之上的搜索服务器，这也意味着Lucene相当于solr的底层，Solr中的源代码包含着Lucene的源代码，所以本次博客首先分析的代码为Lucene中Analysis的代码。

token(词条)，term(词项) & analyzer(分词器)

在分析代码之前，我们必须先要了解token(词条)，term(词项)，以及分词器这三个概念，否则后续代码难以理解。

官方解释

Term：

A Term represents a word from text. This is the unit of search.

It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in, an interned string.

Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.

Token：

A Token is an occurrence of a term from the text of a field. It consists of a term’s text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc.

The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type “eos”. The default token type is “word”.

A Token can optionally have metadata (a.k.a. Payload) in the form of a variable length byte array. Use TermPositions.getPayloadLength() and TermPositions.getPayload(byte[], int) to retrieve the payloads from the index.

Analyzer：

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.

ps:在此不对英文解释进行翻译，有需要者自行使用翻译工具。

这里的官方解释更多的是阐述了这三个概念的属性，但是缺少它们之间联系的描述，所以下面我将通过举例以及使用便于理解的定义来对它们进行更好的解释（如有不当之处还请斧正）。

进一步解释

所谓Analyzer就是词法分析器，分词器。

分词器是Lucene以及Solr中极其重要的一个部分，在创建索引时需要分词器，在检索时同样需要分词器，它将一段文本中的词按一定规则进行拆分，指导索引中的内容如何建立，而且对于不同的语言需要使用不同的分词器以得到更好的结果。

这里举一个简单的例子并引入type(词条类)概念来帮助理解分词器以及上述两个概念：

I come from China so I come from CHN.

type(词条类)：相同词条构成的集合。

分词器就有可能将我这句话切分为
“I”，“come”，“from”，“China”，“so”，“I”，“come”，“from”，“CHN”。

也就是说分词器可以将一个给定的字符序列拆分成一系列子序列，这些子序列被我们称之为token（词条），需要注意的是相同的子序列也都是一个单独的token对象，所以这段文本通过分词得到了9个token，并且得到了6个type，即"I"，“come”，“from”，“China”，“so”，“CHN”。

接下来我们再看词项的一个更便于理解的定义：
一个词项指的是在信息检索系统词典中所包含的某个可能经过归一化处理的词条类。
（该定义源于《信息检索导论》，Christopher D. Manning ，Prabhakar Raghavan ）

这里的“归一化处理”我们可以简单理解为将相近意思的词条类进行合并(当然事实上没说的这么简单)，形成一个等价类。那么上例中的"China"，“CHN”，经过归一化处理后就可以映射到"China"或者"CHN"上，也就是说如果我们搜索"China"，那么包含"CHN"的文档也可以被返回，所以这样我们最终就得到了5个term。

当然，token不仅仅局限于一个英文单词，若字符序列是中文，我们的token就可能会是单独的字。

对于英语来说，有时候空格符就已经为我们进行了分词，但若遇到“San Francisco”，直接按照空格符进行分词的话就会出现错误，所以分词器也有许多类型来应付不同的语言及情况。

总结

整合一下官方解释以及其他定义。

token：它在分词过程中产生，是字符序列中的子序列并且是独立的对象，它包含了term中的文本内容，起止偏移量，和一个类型字符串。

term：指可能经过归一化处理的词条类，它是搜索的基本单位，包含字符串类型的文本，以及该文本中出现的域名。

分词器相关类

介绍完相关概念后，我们再来看看代码层面的东西。

分词器所对应的类是Analyzer类，它是所有分词器的抽象父类。分词的工作流程是字符流先经过Tokenizer将文本分解成词汇单元，再经过TokenFilter进行过滤。

分词的过程主要由Analyzer类解析实现，而Analyzer通过调用TokenStream及其两个子类TokenFilter、Tokenizer来实现。

图源：https://blog.csdn.net/weixin_30537451/article/details/98805734

这是analysis子文件夹的目录结构。
在这里插入图片描述
根据目录可以很容易的看出analysis包下由多种语言相应的analysis子文件夹组成。

我们打开en（英语）子文件夹。
在这里插入图片描述
package-info：

内容很简单明了，说明该子文件夹中包含的是对英语文本的词法分析器。

这是core中的analysis包，目录结构如下，其中包含了常用的standard分词器，以及Analyzer类，Tokenizer类等。这也是本篇博客所需要重点分析的类代码。
在这里插入图片描述

代码分析

代码概述

Analyzer抽象类调用了许多其他类中的方法，所以我们暂且不分析Analyzer类。

这是Tokenizer抽象类，它继承了TokenStream类

public abstract class Tokenizer extends TokenStream

这是TokenFliter抽象类，它继承了TokenStream类。

public abstract class TokenFilter extends TokenStream

综上我们将目标先转至TokenStream抽象类。

TokenStream抽象类中的import包分析

如下是TokenStream类中所import的包。

import java.io.IOException;
import java.io.Closeable;
import java.lang.reflect.Modifier;

import org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.util.Attribute;
import org.apache.lucene.util.AttributeFactory;
import org.apache.lucene.util.AttributeImpl;
import org.apache.lucene.util.AttributeSource;

java.xxx类的分析略过。

PackedTokenAttributeImpl.java
它包含了Lucene中默认实现的公共属性。

  /*
    这里定义了一些变量,需要注意的是:
    public static final String DEFAULT_TYPE = "word";
    这一变量来自TypeAttribute接口，指明了token的类型，默认值为"word"。
  */
  private int startOffset,endOffset;
  private String type = DEFAULT_TYPE;
  private int positionIncrement = 1;
  private int positionLength = 1;
  private int termFrequency = 1;
  
  /*
    PackedTokenAttributeImpl类接入了PositionIncrementAttribute接口。
    该接口定义了setPositionIncrement方法来设置positionIncrement（位置增量）。
    位置增量的含义是同一个tokensteam中，当前token与其前一个token之间的距离，
    所以当前token的位置就是上一个token位置的值与positionIncrement的值的和，
    它的初始值为1，用于phrase searching（短语查询）。

    在PackedTokenAttributeImpl类中则覆盖了该方法，位置增量为负则抛出错误，
    否则将成员变量进行赋值。
    
    它的使用案例可见：
    https://lucene.apache.org/core/3_6_0/api/core/index.html?org/apache/lucene/analysis/tokenattributes/PositionLengthAttribute.html
  */
  @Override
  public void setPositionIncrement(int positionIncrement) {
    if (positionIncrement < 0) {
      throw new IllegalArgumentException("Increment must be zero or greater: " + positionIncrement);
    }
    this.positionIncrement = positionIncrement;
  }

  /*
    getPositionIncrement()不必多言。
  */
@Override
  public int getPositionIncrement() {
    return positionIncrement;
  }
  
  /*
     这是positionLength的官方解释：
     The positionLength determines how many positions this token spans. 
     Very few analyzer components actually produce this attribute, and indexing ignores it,
     but it's useful to express the graph structure naturally produced by decompounding, 
     word splitting/joining, synonym filtering, etc.
     The default value is one.

	 positionLength指的是一个token的位置跨度。
	 
	 综上，个人认为position应该相当于一个长度为1的区间，
	 一个token就位于一个区间上。
	 而positionIncrement就类似于两个区间之间的距离。
  */
@Override
  public void setPositionLength(int positionLength) {
    if (positionLength < 1) {
      throw new IllegalArgumentException("Position length must be 1 or greater: got " + positionLength);
    }
    this.positionLength = positionLength;
  }
  
@Override
  public int getPositionLength() {
    return positionLength;
  }

/*
    以下三个方法来自接口OffsetAttribute。
    
    该接口描述以下内容：
    The start and end character offset of a Token. 
    即一个Token中起始与结束字符的偏移量。

	startOffset方法返回起始偏移量，
	第一个字符所在位置需要对应于原文本中的token。
	
	需要注意的是结束偏移量以及起始偏移量之差不一定等于termText.length()，
    因为该term text可能已经被过滤器过滤。
*/
@Override
  public final int startOffset() {
    return startOffset;
  }

/* 
	关于结束偏移量的注释，如下：
    Returns this Token's ending offset, 
    one greater than the position of the last character 
    corresponding to this token in the source text. 
    The length of the token in the source text is 
    (<code>endOffset()</code> - {@link #startOffset()}). 
    @see #setOffset(int, int)

	结束字符偏移量要比对应于原文中的token里的最后一个字符的position要大1。
	
	这个1的差距不是很能理解，可能需要研究一下存储所用的数据结构，先挖个坑。
	
	此token的长度等于结束偏移量减去起始偏移量倒是没啥好说的。
*/
@Override
  public final int endOffset() {
    return endOffset;
  }
  
/*
    若开始字符与结束字符的偏移量为负，则抛出错误，
    若开始字符的偏移量小于结束字符的偏移量，同样抛出错误。
*/
@Override
  public void setOffset(int startOffset, int endOffset) {
    if (startOffset < 0 || endOffset < startOffset) {
      throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset; got "
          + "startOffset=" + startOffset + ",endOffset=" + endOffset);
    }
    this.startOffset = startOffset;
    this.endOffset = endOffset;
  }
  
/*
    A Token's lexical type. The Default value is "word". 
    接口TypeAttribute中定义的方法，
    说明了Token的类型，默认值为"word"。    
*/
@Override
  public final String type() {
    return type;
  }
@Override
  public final void setType(String type) {
    this.type = type;
  }

/*
    这个方法在该类中没有注释，但是它其实是调用的TermFrequencyAttribute接口中的方法。
    Sets the custom term frequency of a term within one document.  
    If this attributeis present in your analysis chain for a given field, 
    that field must be indexed with
    {@link IndexOptions#DOCS_AND_FREQS}.
    它的意思是设置一个document中的词项所出现的次数，
    小于1将抛出错误。   
*/
@Override
  public final void setTermFrequency(int termFrequency) {
    if (termFrequency < 1) {
      throw new IllegalArgumentException("Term frequency must be 1 or greater; got " + termFrequency);
    }
    this.termFrequency = termFrequency;
  }

  @Override
  public final int getTermFrequency() {
    return termFrequency;
  }

后续待更新
2. Document.java
3. Field.java
4. IndexWriter.java
5. Attribute.java
6. AttributeFactory.java
7. AttributeImpl.java
8. AttributeSource.java

Wild__Child

关注

5
点赞
踩
1

收藏

觉得还不错? 一键收藏
3
评论
Solr.分词器总览

2021SC@SDUSC 分词器概述代码分析代码概述TokenStream抽象类中的import包分析分词器概述首先我们需要明确的是Solr 是一个基于 Apache Lucene 之上的搜索服务器，这也意味着Lucene相当于solr的底层，Solr中的源代码包含着Lucene的源代码，所以本次博客首先分析的代码为Lucene中analyzer的代码。所谓Analyzer就是词法分析器，分词器。分词器是Lucene以及Solr中极其重要的一个部分，在创建索引时需要分词器，在检索时同样需要分词器，
复制链接

扫一扫