1:schema.xml:
<!-- 中文分词mmseg4j --> <fieldtype name="text_mmseg4j_simple" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="/data1/SolrCloud/WordsConf/mmseg4j/words" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype> <fieldtype name="text_mmseg4j_complex" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/data1/SolrCloud/WordsConf/mmseg4j/words" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype> <fieldtype name="text_mmseg4j_maxWord" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="/data1/SolrCloud/WordsConf/mmseg4j/words" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype>
2: solrconfig.xml:
<!-- mmseg4j reload words handler --> <requestHandler name="/mmseg4j/reloadwords" class="com.chenlb.mmseg4j.solr.MMseg4jHandler"> <lst name="defaults"> <str name="dicPath">/data1/SolrCloud/WordsConf/mmseg4j/words</str> <str name="check">true</str> <str name="reload">true</str> </lst> </requestHandler>
3:在 /data1/SolrCloud/WordsConf/mmseg4j/words 目录下放入:
3.1: mmseg4j-core-1.10.0.jar 中的 chars.dic, units.dic, words,dic , 这三个都是官方词库,你可以更改以便覆盖官方配置, 也可以不更改.
3.2: 放入以文件名为words开头, .dic为文件结尾的UTF-8格式的文件, 如果是带BOM的UTF8文件, 第一行为空即可. 每行一个词.
4: 中文分词文件重新加载: 以下是单个节点的,如果涉及到多个节点或是SolrCloud,则每个节点都要执行以下访问方可使所有节点(可从zookeeper读取)都生效:
http://172.28.4.83:11010/solr/common_shard1_1_replica3/mmseg4j/reloadwords
=基本路径:http://172.28.4.83:11010/solr/common_shard1_1_replica3
+
handler路径:/mmseg4j/reloadwords
5:若有的节点加载但未生效, 执行以下reload命令:
curl 'http://172.28.4.83:11010/solr/admin/collections?action=RELOAD&name=common'