lucene analyzer

spring-hz

于 2021-04-24 16:29:26 发布

阅读量149

点赞数

分类专栏： lucene

本文链接：https://blog.csdn.net/gs_albb/article/details/116079213

版权

lucene 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

TokenStream API

接口Attribute和类AttributeSource是实现灵活索引的基础依赖。AttributeSource包含了一个
Attribute列表，提供了get和add方法去触达到它们，且每种Attribute最多只有一个实例。

TokenStream继承自AttributeSource，故可以往TokenStream 添加Attribute;
TokenFilter继承自TokenStream
在这里插入图片描述

类	说明
Attribute	包含 text token的特定信息
CharTermAttribute	包含token的术语文本(term text)
OffsetAttribute	包含 token的开始结束字符偏移量
PositionIncrementAttribute	确定token流中当前token相对于前一个token的位置信息
PositionLengthAttribute	一个token占据的位置量
PayloadAttribute	可选，影响搜索打分
TypeAttribute	token类型，默认`word`
FlagsAttribute	包含 text token的特定信息
KeywordAttribute	标记一个token是否是关键词

TokenStream工作流程

初始化得到一个TokenStream,并向其内部添加Attribute。
执行其reset()方法。要在incrementToken()执行前执行，使tokenStream恢复到原始状态，子类在实现该方法时，必须先调用父类的该方法。
使用者调用incrementToken()方法查询流中的每个token的Attribute信息。直至返回false（流结束)。
调用end()使流关闭动作都被执行。其实内部最终执行AttributeSource的每个Attribute的end方法``。
流遍历结束后，调用close()释放Reader资源。

一个代码示例

public class MyAnalyzer extends Analyzer {
 
   private Version matchVersion;
   
   public MyAnalyzer(Version matchVersion) {
     this.matchVersion = matchVersion;
   }
 
   @Override
   protected TokenStreamComponents createComponents(String fieldName) {
     return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion));
   }
   
   public static void main(String[] args) throws IOException {
     // text to tokenize
     final String text = "This is a demo of the TokenStream API";
     
     Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
     MyAnalyzer analyzer = new MyAnalyzer(matchVersion);
     //第一步:获取TokenStream,就本案例而言，是一个Tokenizer，即
     //WhitespaceTokenizer
     TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
     
     // get the CharTermAttribute from the TokenStream
     //第二步:增加并返回自己想要的Attribute
     //由于TokenStream内部定义了自己的AttributeFactory，这里
     //生成的termAtt是PackedTokenAttributeImpl类型
     CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
 
     try {
       //第三步
       stream.reset();
     
       // print all tokens until stream is exhausted
       while (stream.incrementToken()) {
         //第四步 遍历token，打印Attribute信息
         System.out.println(termAtt.toString());
       }
       //第五步 执行结束相关工作
       stream.end();
     } finally {
       //第六步 释放资源
       stream.close();
     }
   }
 }

TokenFilter使用

上面的示例我们可以看到可以将This is a demo of the TokenStream API分解为8个token单词。那如果再加上一层过滤条件呢？比如只关心那些长度大于3的单词，TokenFiter就用上了。

TokenFilter继承自TokenStream，也以TokenStream作为构造方法的入参(实际使用上入参更多可能是一个Tokenizer）,使用了装饰者模式，对构造方法上的TokenStream做了一层功能上的加强。

public class MyLengthFilterAnalyzer extends Analyzer {
    private Version matchVersion;

    public MyLengthFilterAnalyzer(Version matchVersion) {
        this.matchVersion = matchVersion;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        WhitespaceTokenizer source = new WhitespaceTokenizer();
        LengthFilter result = new LengthFilter(source, 3, Integer.MAX_VALUE);
        return new TokenStreamComponents(source, result);
    }

    public static void main(String[] args) throws IOException {
        // text to tokenize
        final String text = "This is a demo of the TokenStream API";

        Version matchVersion = Version.LUCENE_8_8_2; // Substitute desired Lucene version for XY
        MyLengthFilterAnalyzer analyzer = new MyLengthFilterAnalyzer(matchVersion);
        TokenStream stream = analyzer.tokenStream("field", new StringReader(text));

        // get the CharTermAttribute from the TokenStream
        CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

        try {
            stream.reset();

            // print all tokens until stream is exhausted
            while (stream.incrementToken()) {
                System.out.println(termAtt.toString());
                System.out.println(stream.reflectAsString(true));
                System.out.println("------");
            }

            stream.end();
        } finally {
            stream.close();
        }
    }
}

输出结果(这里手动将输出结果进行了换行，方便显示)

This
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=This,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[54 68 69 73],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=0,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=4,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
demo
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=demo,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[64 65 6d 6f],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=10,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=14,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=3,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
the
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=the,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[74 68 65],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=18,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=21,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=2,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
TokenStream
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=TokenStream,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[54 6f 6b 65 6e 53 74 72 65 61 6d],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=22,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=33,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------
API
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=API,
org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[41 50 49],
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=34,
org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=37,
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,
org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength=1,
org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=word,
org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency=1
------