Lucene 3.4 版本中文分词和2.X的版本不同之处:
2.X 版本都是用 je.analyzer.jar 等
3.4 的中文分词所需的jar 可以在本身包里找到,
- StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
- CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
- SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase: "我是中国人"
- StandardAnalyzer: 我-是-中-国-人
- CJKAnalyzer: 我是-是中-中国-国人
- SmartChineseAnalyzer: 我-是-中国-人
这是分词结果,代码如下:
Analyzer analyzer4 = new SimpleAnalyzer(Version.LUCENE_34);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);//StandardAnalyzer
Analyzer analyzer2 = new SmartChineseAnalyzer(Version.LUCENE_34);//SmartChineseAnalyzer