Lucene04---分词器

最新推荐文章于 2024-08-16 15:33:24 发布

索隆

最新推荐文章于 2024-08-16 15:33:24 发布

阅读量741

点赞数

分类专栏：搜索引擎文章标签： lucene exception string iteye class 测试

搜索引擎专栏收录该内容

22 篇文章 0 订阅

订阅专栏

我们知道，Lucene所做的事情主要就两件，一是创建索引，一是搜索。那么这里就有一个很重要的东西就是分词器，分词器在http://xdwangiflytek.iteye.com/blog/1389308里就提到了，这里再说说，分词器，对文本资源进行切分，将文本按规则切分成一个个进行索引的最小单位（关键词）。建立索引和进行搜索时都要用到分词器，为了保证正确的搜索到结果，在建立索引与进行搜索时使用的分词器应为同一个。

分词器，我们涉及到的是两种，一是英文的，一是中文的，英文的相对中文来说要简单点。

直接上代码吧

AnalyzerDemo.java:

Java代码

package com.iflytek.lucene;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
/**
* @author xudongwang 2012-2-4
*
* Email:xdwangiflytek@gmail.com
*/
public class AnalyzerDemo {
public void analyze(Analyzer analyzer, String text) throws Exception {
System.out.println("----------------------->分词器：" + analyzer.getClass());
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
CharTermAttribute termAtt = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);
// TypeAttribute typeAtt = (TypeAttribute) tokenStream
// .getAttribute(TypeAttribute.class);
while (tokenStream.incrementToken()) {
System.out.println(termAtt.toString());
// System.out.println(typeAtt.type());
}
}
public static void main(String[] args) throws Exception {
AnalyzerDemo demo = new AnalyzerDemo();
System.out.println("---------------->测试英文");
String enText = "Hello, my name is wang xudong, I in iteye blog address is xdwangiflytek.iteye.com";
System.out.println(enText);
System.out.println("By StandardAnalyzer 方式分词：");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
demo.analyze(analyzer, enText);
System.out.println("By SimpleAnalyzer 方式分词：");
Analyzer analyzer2 = new SimpleAnalyzer(Version.LUCENE_35);
demo.analyze(analyzer2, enText);
System.out.println("通过上面的结果发现StandardAnalyzer分词器不会按.来区分的，而SimpleAnalyzer是按.来区分的");
System.out.println();
System.out.println("---------------->测试中文");
String znText = "我叫王旭东";
System.out.println(znText);
System.out.println("By StandardAnalyzer 方式分词：");
// 通过结果发现它是将每个字都作为一个关键字，这样的话效率肯定很低咯
demo.analyze(analyzer, znText);
System.out.println("By CJKAnalyzer 方式（二分法分词）分词：");
Analyzer analyzer3 = new CJKAnalyzer(Version.LUCENE_35);
demo.analyze(analyzer3, znText);
}
}

package com.iflytek.lucene;import java.io.StringReader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.cjk.CJKAnalyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.util.Version;/** * @author xudongwang 2012-2-4 *  *         Email:xdwangiflytek@gmail.com */public class AnalyzerDemo {	public void analyze(Analyzer analyzer, String text) throws Exception {		System.out.println("----------------------->分词器：" + analyzer.getClass());		TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));		CharTermAttribute termAtt = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);		// TypeAttribute typeAtt = (TypeAttribute) tokenStream		// .getAttribute(TypeAttribute.class);		while (tokenStream.incrementToken()) {			System.out.println(termAtt.toString());			// System.out.println(typeAtt.type());		}	}	public static void main(String[] args) throws Exception {		AnalyzerDemo demo = new AnalyzerDemo();		System.out.println("---------------->测试英文");		String enText = "Hello, my name is wang xudong, I in iteye blog address is xdwangiflytek.iteye.com";		System.out.println(enText);		System.out.println("By StandardAnalyzer 方式分词：");		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);		demo.analyze(analyzer, enText);		System.out.println("By SimpleAnalyzer 方式分词：");		Analyzer analyzer2 = new SimpleAnalyzer(Version.LUCENE_35);		demo.analyze(analyzer2, enText);		System.out.println("通过上面的结果发现StandardAnalyzer分词器不会按.来区分的，而SimpleAnalyzer是按.来区分的");		System.out.println();		System.out.println("---------------->测试中文");		String znText = "我叫王旭东";		System.out.println(znText);		System.out.println("By StandardAnalyzer 方式分词：");		// 通过结果发现它是将每个字都作为一个关键字，这样的话效率肯定很低咯		demo.analyze(analyzer, znText);		System.out.println("By CJKAnalyzer 方式（二分法分词）分词：");		Analyzer analyzer3 = new CJKAnalyzer(Version.LUCENE_35);		demo.analyze(analyzer3, znText);	}}

运行结果：

---------------->测试英文

Hello, my name is wang xudong, I in iteye blog address is xdwangiflytek.iteye.com

By StandardAnalyzer 方式分词：

----------------------->分词器：class org.apache.lucene.analysis.standard.StandardAnalyzer

hello

name

wang

xudong

iteye

blog

address

xdwangiflytek.iteye.com

By SimpleAnalyzer 方式分词：

----------------------->分词器：class org.apache.lucene.analysis.SimpleAnalyzer

hello

name

wang

xudong

iteye

blog

address

xdwangiflytek

iteye

com

通过上面的结果发现StandardAnalyzer分词器不会按.来区分的，而SimpleAnalyzer是按.来区分的

---------------->测试中文

我叫王旭东

By StandardAnalyzer 方式分词：

----------------------->分词器：class org.apache.lucene.analysis.standard.StandardAnalyzer

我

叫

王

旭

东

By CJKAnalyzer 方式（二分法分词）分词：

----------------------->分词器：class org.apache.lucene.analysis.cjk.CJKAnalyzer

我叫

叫王

王旭

旭东

上面讲的分词方式中对于中文来说最好的还是语义分词，就是中科院的那个。

后面对于具体中文分词器，我会在专门的专题中去说的，这里先简单了解一下.

索隆

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Lucene04---分词器

我们知道，Lucene所做的事情主要就两件，一是创建索引，一是搜索。那么这里就有一个很重要的东西就是分词器，分词器在http://xdwangiflytek.iteye.com/blog/1389308里就提到了，这里再说说，分词器，对文本资源进行切分，将文本按规则切分成一个个进行索引的最小单位（关键词）。建立索引和进行搜索时都要用到分词器，为了保证正确的搜索到结果，在建立索引与进行搜索时使用的分
复制链接

扫一扫

专栏目录