solr4.4 + mmseg4j-1.9.1中文分词

最新推荐文章于 2024-07-13 20:46:05 发布

tiankong6622

最新推荐文章于 2024-07-13 20:46:05 发布

阅读量263

点赞数

分类专栏：搜索引擎文章标签： solr 4.4 lucene

本文链接：https://blog.csdn.net/tiankong6622/article/details/84607804

版权

搜索引擎专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1、solr配置请参考solr4.4.0配置笔记.txt

2、mmseg4j-1.9.1下载地址 http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.zip
mmseg4j 1.8.3 只支持 lucene 2.9/3.0 接口和 solr1.4。其它没改动
mmseg4j 1.8.5 支持 lucene 3.1, solr3.1
mmseg4j 1.9.0 支持 lucene 4.0, solr4.0

3、在E:\private_project\solr\solr_home\solr文件夹下建立lib和dic两个文件夹

4、解压mmseg4j-1.9.1.zip，并将mmseg4j-1.9.1\dist文件夹下的3个jar复制到刚刚新建的lib文件夹下，即E:\private_project\solr\solr_home\solr\lib下面

5、解压mmseg4j-1.9.1\dist下面的mmseg4j-core-1.9.1.jar，将3个*.dic文件复制到E:\private_project\solr\solr_home\solr\dic下面

6、编辑E:\private_project\solr\solr_home\solr\collection1\conf下面的schema.xml，在合适的位置加上下面的代码

   <field name="simple" type="textSimple" indexed="true" stored="true"/>  
   <field name="complex" type="textComplex" indexed="true" stored="true"/>  
   <field name="MaxWord" type="textMaxWord" indexed="true" stored="true"/>

   <copyField source="simple" dest="text" />  
   <copyField source="complex" dest="text"/>  
   <copyField source="MaxWord" dest="text"/>

 <fieldType name="textComplex" class="solr.TextField">  
      <analyzer>  
	<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="E:/private_project/solr/solr_home/solr/dic"/>  
      </analyzer>  
   </fieldType>  
   <fieldType name="textMaxWord" class="solr.TextField">  
      <analyzer>  
	<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="E:/private_project/solr/solr_home/solr/dic"/>  
      </analyzer>  
   </fieldType>  
   <fieldType name="textSimple" class="solr.TextField">  
      <analyzer>  
	<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="E:/private_project/solr/solr_home/solr/dic"/>  
      </analyzer>  
   </fieldType>

注意dicPath为你的*.dic文件存放的路径

7、编辑E:\private_project\solr\solr_home\solr\collection1\conf下面的solrconfig.xml，在合适的位置加上下面的代码
<lib dir="E:/private_project/solr/solr_home/solr/lib" regex=".*\.jar" />
注意dir为从mmseg4j-1.9.1\dist下复制的那3个jar包的路径

8、重启tomcat，访问http://localhost:8080/solr/#/collection1/analysis ，“Analyse Fieldname / FieldType:”类型选择MaxWord，然后在“Field Value (Index)”
下面的文本框里输入“solr是一个伟大的开源的搜索引擎”，就能看到搜索效果

MMSeg 算法有两种分词方法：Simple和Complex，都是基于正向最大匹配。Complex 加了四个规则过虑。官方说：词语的正确识别率达到了 98.41%。mmseg4j 已经实现了这两种分词算法

tiankong6622

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
solr4.4 + mmseg4j-1.9.1中文分词

1、solr配置请参考solr4.4.0配置笔记.txt2、mmseg4j-1.9.1下载地址 http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.zip mmseg4j 1.8.3 只支持 lucene 2.9/3.0 接口和 solr1.4。其它没改动 mmseg4j 1.8.5 支持 lucene 3.1, solr3...
复制链接

扫一扫