分词器的核心类Analyzer,TokenStream,Tokenizer,TokenFilter.
Analyzer
Lucene中的分词器有StandardAnalyzer,StopAnalyzer,SimpleAnalyzer,WhitespaceAnalyzer.
TokenStream
分词器做好处理之后得到的一个流,这个流中存储了分词的各种信息.可以通过TokenStream有效的获取到分词单元
Tokenizer
主要负责接收字符流Reader,将Reader进行分词操作.有如下一些实现类
KeywordTokenizer,
standardTokenizer,
CharTokenizer
|----WhitespaceTokenizer
|----LetterTokenizer
|----LowerCaseTokenizer
TokenFilter
将分好词的语汇单元进行各种各样的过滤.
看一下官方给出的Analyer的定义
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting
index terms from text.
首先为了提升效率,在同一个线程中的Tokenream是可以复用的。在新建一个Analyer时候会指定复用的策略。
private final ReuseStrategy reuseStrategy;
......
public Analyzer(ReuseStrategy reuseStrategy) {
this.reuseStrategy = reuseStrategy;
}
在ReuseStrategy中存在两个变量。其中通过Analyzer可以获取TokenStreamComponents,在TokenStreamComponents封装了TokenStream 以及Tokenzier。下面的fieldName为复用
analyzer: Analyzer from which to get the reused components.
fieldName: Name of the field whose reusable TokenStreamComponents
现在我们看看TokenStream抽象类,其相当于Token的迭代器,获取token
TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text.
其中TokenStream存在以下几种:
- LegacyNumericTokenStream: indexing numeric values
- BinaryTokenStream: returns a BytesRef as single token
- BytesRefIteratorTokenStream
- CannedBinaryTokenStream: TokenStream from a canned list of binary (BytesRef-based)
tokens. - CannedTokenStream: TokenStream from a canned list of Tokens.
- CompletionTokenStream: converts a provided token stream to an automaton
- EmptyTokenStream:An always exhausted token stream.
而Tokenizer也是一种TokenStream,在lucene中存在以下几种分词:
- CharTokenizer
- LetterTokenizer
- LowerCaseTokenizer
- WhitespaceTokenizer
- SmartChineseAnalyzer/HMMChineseTokenizer
- CJKTokenizer
- EdgeNGramTokenizer
- KeywordTokenizer
- NGramTokenizer
- SentenceTokenizer
- StandardTokenizer