TokenStream API
接口Attribute
和类AttributeSource
是实现灵活索引的基础依赖。AttributeSource
包含了一个
Attribute
列表,提供了get
和add
方法去触达到它们,且每种Attribute
最多只有一个实例。
TokenStream
继承自AttributeSource
,故可以往TokenStream
添加Attribute
;
TokenFilter
继承自TokenStream
类 | 说明 |
---|---|
Attribute | 包含 text token的特定信息 |
CharTermAttribute | 包含token的术语文本(term text) |
OffsetAttribute | 包含 token的开始结束字符偏移量 |
PositionIncrementAttribute | 确定token流中当前token相对于前一个token的位置信息 |
PositionLengthAttribute | 一个token占据的位置量 |
PayloadAttribute | 可选,影响搜索打分 |
TypeAttribute | token类型,默认word |
FlagsAttribute | 包含 text token的特定信息 |
KeywordAttribute | 标记一个token是否是关键词 |
TokenStream工作流程
- 初始化得到一个
TokenStream
,并向其内部添加Attribute
。 - 执行其
reset()
方法。要在incrementToken()
执行前执行,使tokenStream恢复到原始状态,子类在实现该方法时,必须先调用父类的该方法。 - 使用者调用
incrementToken()
方法 查询 流中的每个token的Attribute
信息。直至返回false(流结束)。 - 调用
end()
使流关闭动作都被执行。其实内部最终执行AttributeSource
的每个Attribute
的end
方法``。 - 流遍历结束后,调用
close()
释放Reader资源。
一个代码示例
public class MyAnalyzer extends Analyzer {
private Version matchVersion;
public MyAnalyzer(Version matchVersion) {
this.matchVersion = matchVersion;
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion));
}
public static void main(String[] args) throws IOException {
// text to tokenize
final String text = "This is a demo of the TokenStream API";
Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
MyAnalyzer analyzer = new MyAnalyzer(matchVersion);
//第一步:获取TokenStream,就本案例而言,是一个Tokenizer,即
//WhitespaceTokenizer
TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
// get the CharTermAttribute from the TokenStream
//第二步:增加并返回自己想要的Attribute
//由于TokenStream内部定义了自己的AttributeFactory,这里
//生成的termAtt是PackedTokenAttributeImpl类型
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
try {
//第三步
stream.reset();
// print all tokens until stream is exhausted
while (stream.incrementToken()) {
//第四步 遍历token,打印Attribute信息
System.out.println(termAtt.toString());
}
//第五步 执行结束相关工作
stream.end();
} finally {
//第六步 释放资源
stream.close();
}
}
}
TokenFilter使用
上面的示例我们可以看到可以将This is a demo of the TokenStream API
分解为8个token单词。那如果再加上一层过滤条件呢?比如只关心那些长度大于3的单词,TokenFiter
就用上了。
TokenFilter
继承自TokenStream
,也以TokenStream
作为构造方法的入参(实际使用上入参更多可能是一个Tokenizer
),使用了装饰者模式,对构造方法上的TokenStream
做了一层功能上的加强。
public class MyLengthFilterAnalyzer extends Analyzer {
private Version matchVersion;
public MyLengthFilterAnalyzer(Version matchVersion) {
this.matchVersion = matchVersion;
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer source = new WhitespaceTokenizer();
LengthFilter result = new LengthFilter(source, 3, Integer.MAX_VALUE);
return new TokenStreamComponents(source, result);
}
public static void main(String[] args) throws IOException {
// text to tokenize
final String text = "This is a demo of the TokenStream API";
Version matchVersion = Version.LUCENE_8_8_2; // Substitute desired Lucene version for XY
MyLengthFilterAnalyzer analyzer = new MyLengthFilterAnalyzer(matchVersion);
TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
// get the CharTermAttribute from the TokenStream
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
try {
stream.reset();
// print all tokens until stream is exhausted
while (stream.incrementToken()) {
System.out.println(termAtt.toString());
System.out.println(stream.reflectAsString(true));
System.out.println("------");
}
stream.end();
} finally {
stream.close();
}
}
}
输出结果(这里手动将输出结果进行了换行,方便显示)
This
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=This,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[54 68 69 73],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=0,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=4,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
demo
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=demo,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[64 65 6d 6f],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=10,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=14,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=3,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
the
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=the,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[74 68 65],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=18,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=21,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=2,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
TokenStream
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=TokenStream,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[54 6f 6b 65 6e 53 74 72 65 61 6d],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=22,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=33,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
API
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=API,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[41 50 49],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=34,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=37,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------