分词器，使用中文分词器，扩展词库，停用词

最新推荐文章于 2023-06-05 08:00:00 发布

彷徨的石头

最新推荐文章于 2023-06-05 08:00:00 发布

阅读量4k

点赞数

分类专栏：个性化搜索引擎 Nutch、Solr 文章标签：扩展 lucene encoding ext 测试 2010

Nutch、Solr 同时被 2 个专栏收录

55 篇文章 0 订阅

订阅专栏

个性化搜索引擎

22 篇文章 0 订阅

订阅专栏

1. 常见的中文分词器有：极易分词的(MMAnalyzer) 、"庖丁分词" 分词器(PaodingAnalzyer)、IKAnalyzer 等等。其中 MMAnalyzer 和 PaodingAnalzyer 不支持 lucene3.0及以后版本。

使用方式都类似，在构建分词器时

Analyzer analyzer = new [My]Analyzer();

2. 这里只示例 IKAnalyzer，目前只有它支持Lucene3.0 以后的版本。

首先需要导入 IKAnalyzer3.2 .0Stable.jar 包

3 . 示例代码

view plaincopy to clipboardprint?

public class AnalyzerTest

{

@Test

public void test() throws Exception

{

String text = "An IndexWriter creates and maintains an index.";

/**//* 标准分词器：单子分词 */

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

testAnalyzer(analyzer, text);

String text2 = "测试中文环境下的信息检索";

testAnalyzer(new IKAnalyzer(), text2); // 使用IKAnalyzer，词库分词

}

/** *//**

* 使用指定的分词器对指定的文本进行分词，并打印结果

*

* @param analyzer

* @param text

* @throws Exception

*/

private void testAnalyzer(Analyzer analyzer, String text) throws Exception

{

System.out.println("当前使用的分词器：" + analyzer.getClass());

TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));

tokenStream.addAttribute(TermAttribute.class);

while (tokenStream.incrementToken())

{

TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

System.out.println(termAttribute.term());

}

}

}

public class AnalyzerTest

{

@Test

public void test() throws Exception

{

String text = "An IndexWriter creates and maintains an index.";

/**//* 标准分词器：单子分词 */

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

testAnalyzer(analyzer, text);

String text2 = "测试中文环境下的信息检索";

testAnalyzer(new IKAnalyzer(), text2); // 使用IKAnalyzer，词库分词

}

/** *//**

* 使用指定的分词器对指定的文本进行分词，并打印结果

*

* @param analyzer

* @param text

* @throws Exception

*/

private void testAnalyzer(Analyzer analyzer, String text) throws Exception

{

System.out.println("当前使用的分词器：" + analyzer.getClass());

TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));

tokenStream.addAttribute(TermAttribute.class);

while (tokenStream.incrementToken())

{

TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

System.out.println(termAttribute.term());

}

}

}

3 . 如何扩展词库：很多情况下，我们可能需要定制自己的词库，例如 XXX 公司，我们希望这能被分词器识别，并拆分成一个词。

IKAnalyzer 可以很方便的实现我们的这种需求。

新建 IKAnalyzer.cfg.xml

view plaincopy to clipboardprint?

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

<entry key="ext_dict">/mydict.dic</entry>

</properties>

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

<entry key="ext_dict">/mydict.dic</entry>

</properties>

解析：

<entry key="ext_dict">/mydict.dic</entry> 扩展了一个自己的词典，名字叫 mydict.dic

因此我们要建一个文本文件，名为：mydict.dic （此处使用的 .dic 并非必须）

在这个文本文件里写入：

北京XXXX科技有限公司

这样就添加了一个词汇。

如果要添加多个，则新起一行：

词汇一

词汇二

词汇三

需要注意的是，这个文件一定要使用 UTF- 8编码

4 . 停用词：

有些词在文本中出现的频率非常高，但是对文本所携带的信息基本不产生影响，例如英文的"a、an、the、of"，或中文的"的、了、着" ，以及各种标点符号等，这样的词称为停用词（stop word）。

文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。

排除停用词可以加快建立索引的速度，减小索引库文件的大小。

IKAnalyzer 中自定义停用词也非常方便，和配置 "扩展词库" 操作类型，只需要在 IKAnalyzer.cfg.xml 加入如下配置：

<entry key="ext_stopwords">/ext_stopword.dic</entry>

同样这个配置也指向了一个文本文件 / ext_stopword.dic （后缀名任意），格式如下：

也

了

仍

从

本文来自CSDN博客，转载请标明出处：http://blog.csdn.net/wenlin56/archive/2010/12/13/6074124.aspx

彷徨的石头

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。