lucene4.0各内置分析器的使用及比较

最新推荐文章于 2022-03-19 12:23:31 发布

enjoyinwind

最新推荐文章于 2022-03-19 12:23:31 发布

阅读量5.5k

点赞数 2

分类专栏： lucene java

本文链接：https://blog.csdn.net/enjoyinwind/article/details/8277551

版权

java 同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

lucene

3 篇文章 0 订阅

订阅专栏

最近看lucene，觉得书上的例子很好懂，代码也容易理解，可是从官网上下的lucene4.0，跟以前的版本还是有些出入。

package org.apache.lucene.demo;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;

public class AnalyzerDemo {

	private static final String[] examples = {"The quick brown fox jumped over the lazy dog",
		"xyz&corporation - xyz@example.com"};
	
	private static final Analyzer[] analyzers = {
		new WhitespaceAnalyzer(Version.LUCENE_40),
		new SimpleAnalyzer(Version.LUCENE_40),
		new KeywordAnalyzer(),
		new StopAnalyzer(Version.LUCENE_40),
		new StandardAnalyzer(Version.LUCENE_40)
	};
	
	/**
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		// TODO Auto-generated method stub
        String[] strings = examples;
        
        if(args.length > 0){
        	strings = args;
        }
        for(String text : strings){
        	analyze(text);
        }
	}
	
	public static void analyze(String text) throws IOException{
		for(Analyzer analyzer : analyzers){
			System.out.println(" "+analyzer.getClass().getSimpleName() + " ");
			AnalyzerUtils.displayTokes(analyzer, text);
			System.out.println();
		}
	}

}

AnalyzerUtils类很简单，只是显示分析器分析结果的一些属性。

package org.apache.lucene.demo;

import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

public class AnalyzerUtils {
    public static void displayTokes(Analyzer analyzer,String text) throws IOException{
    	displayTokes(analyzer.tokenStream("contents", new StringReader(text)));
    }
    
    public static void displayTokes(TokenStream tokenStream) throws IOException{
    	 CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    	 tokenStream.reset();  //此行，不能少，不然会报 java.lang.ArrayIndexOutOfBoundsException
    	 while(tokenStream.incrementToken()){
    		 System.out.print("["+termAttribute.toString()+"]");
    	 }
    }
}

tokenStream.reset(); 这句不能少，网上关于lucene4.0的使用及介绍比较少，由于刚接触，对其源码还没有阅读，不知道其中缘由，是自己从官网api中看到试了一下，成功运行。

运行结果也与书上描述不一致，代码运行结果为

WhitespaceAnalyzer
[The][quick][brown][fox][jumped][over][the][lazy][dog]
SimpleAnalyzer
[the][quick][brown][fox][jumped][over][the][lazy][dog]
KeywordAnalyzer
[The quick brown fox jumped over the lazy dog]
StopAnalyzer
[quick][brown][fox][jumped][over][lazy][dog]
StandardAnalyzer
[quick][brown][fox][jumped][over][lazy][dog]
WhitespaceAnalyzer
[xyz&corporation][-][xyz@example.com]
SimpleAnalyzer
[xyz][corporation][xyz][example][com]
KeywordAnalyzer
[xyz&corporation - xyz@example.com]
StopAnalyzer
[xyz][corporation][xyz][example][com]
StandardAnalyzer
[xyz][corporation][xyz][example.com]

书上介绍为：

经观察，主要是StandarAnalyzer和书上介绍的不一致，并没有识别出邮件地址，并且不能很好地对中文进行分词，只是按字进行分词，一个字一个词（在对“我们是中国人”进行分词时，它的分词结果是[我:0->1:<IDEOGRAPHIC>][们:1->2:<IDEOGRAPHIC>][是:2->3:<IDEOGRAPHIC>][中:3->4:<IDEOGRAPHIC>][国:4->5:<IDEOGRAPHIC>][人:5->6:<IDEOGRAPHIC>]）。

ps: 刚接触，希望跟大家多交流沟通，彼此学习。