solr学习（二）Solr4.7.2中整合中文分词mmseg4j-1.9.1

最新推荐文章于 2018-08-06 17:38:01 发布

拾叶者说

最新推荐文章于 2018-08-06 17:38:01 发布

阅读量476

点赞数

分类专栏：搜索引擎文章标签： solr tomcat 算法

本文链接：https://blog.csdn.net/heikafei888/article/details/52526077

版权

搜索引擎专栏收录该内容

2 篇文章 0 订阅

订阅专栏

mseg4j支持最多分词，是一款很优秀的中文分词器，是用Chih-Hao Tsai 的 MMSeg 算法( http://technology.chtsai.org/mmseg/ )（本人也不懂这些，自行谷歌吧。）实现的中文分词器，并实现 lucene 的analyzer 和 solr 的TokenizerFactory 以方便在Lucene和Solr中使用。

Solr中整合mmseg4j,需要如下几个步骤:

1、下载并解压mmseg4j-1.9.1.zip，把dist下面的所有jar文件拷贝到你应用服务器下的 solr /WEB-INF/lib中。有3个jar文件：mmseg4j-analysis-1.9.1.jar， mmseg4j-core-1.9.1.jar，mmseg4j-solr-1.9.1.jar。

2、如果是在mmseg4j-1.9.0前，则需要copy data目录到solr_home/solr中（与core平级），并改名为dic。（本人用的是1.9.1，之前的没有深究，就不瞎逼逼了）；

3.在mmseg4j-1.9.0后，如本例的mmseg4j-1.9.1中，就可以不用 dicPath 参数，可以使用 mmseg4j-core-1.9.0.jar 里的 words.dic ，在Schema.xml中加入如下配置

<fieldType name="text_mmseg4j_complex" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/>
</analyzer>
</fieldType>
<fieldType name="text_mmseg4j_maxword" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/>
</analyzer>
</fieldType>
<fieldType name="text_mmseg4j_simple" class="solr.TextField" positionIncrementGap="100" >
<analyzer>

<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="dic"/>
</analyzer>
</fieldType>

4、引用mmseg4j分词器

只需要在该schema.xml中加入如下配置便可引用对应的mmseg4j分词器

<field name="mmseg4j_complex_name" type="text_mmseg4j_complex" indexed="true" stored="true"/>
<field name="mmseg4j_maxword_name" type="text_mmseg4j_maxword" indexed="true" stored="true"/>
<field name="mmseg4j_simple_name" type="text_mmseg4j_simple" indexed="true" stored="true"/>
5、通过以上步骤就可以成功配置mmseg4j分词器到solr中了。

我是在这里找到的

如果没问题就应该没问题了、。