理解Lucene中的Analyzer

最新推荐文章于 2018-05-05 11:39:00 发布

dejing6575

最新推荐文章于 2018-05-05 11:39:00 发布

阅读量140

点赞数

原文链接：http://www.cnblogs.com/weiyinfu/p/6538024.html

版权

学习一个库，最好去官网。因为很多库API变动十分大，从博客上找的教程都过时了。

Lucene原理就是简简单单的“索引”，以空间换时间。但是Lucene将这件事做到了极致，后人再有想写倒排索引的，只能算是练练手。

Lucene的重要模块之一就是分析器模块，这个模块负责对输入文本进行一些处理，比如分词、去除停止词（如“的”、“着”）等琐碎操作。
这个模块产生的token，就相当于键，求token的哈希值，然后把文档id放到对应的桶中。

中文Analyzer有三个，分别是：

ChineseAnalyzer (in the analyzers/cn package): 一个汉字是一个token。
CJKAnalyzer (in the analysis/cjk package): 两个汉字是一个token。
SmartChineseAnalyzer (in the analyzers/smartcn package): 每一个词语是一个token。

这三个Analyzer只有CJKAnalyzer是Lucene标准库中的，另外两个需要额外添加依赖。

它们的效果分别如下：
Example phrase： "我是中国人"

ChineseAnalyzer: 我－是－中－国－人
CJKAnalyzer: 我是－是中－中国－国人
SmartChineseAnalyzer: 我－是－中国－人

显而易见，ChineseAnalyzer和CJKAnalyzer太难用，只有SmartChineseAnalyzer自带分词功能。

下面这段代码，演示了如何创建Analyzer，获取Analyzer的TokenStream，从tokenStream中读取词组。

Analyzer analyzer = new SmartChineseAnalyzer(); // or any other analyzer
TokenStream ts = analyzer.tokenStream("myfield", new StringReader("床前明月光，疑是地上霜。举头望明月，低头思故乡。"));
// The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s),
//   and pass the resulting Reader to the Tokenizer.
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);

try {
    ts.reset(); // Resets this stream to the beginning. (Required)
    while (ts.incrementToken()) {
        // Use AttributeSource.reflectAsString(boolean)
        // for token stream debugging.
        System.out.println("token: " + ts.reflectAsString(true));

        System.out.println("token start offset: " + offsetAtt.startOffset());
        System.out.println("  token end offset: " + offsetAtt.endOffset());
    }
    ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
} finally {
    ts.close(); // Release resources associated with this stream.
}

tokenStream对象可以添加以下属性，用来获取token的附加信息

OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class);

以上内容来自Lucene官方文档，但是据我观察ChineseAnalyzer已经不见了，因为它跟StandardAnalyzer没啥区别。
可见即便是官方文档也存在各种版本问题。
所以学习一定要随随便便学点，不能过于较真，不能过于相信文档。

转载于:https://www.cnblogs.com/weiyinfu/p/6538024.html