1、下载mmseg4j-1.9.1 并解压mmseg4j-1.9.1.zip,把dist下面的所有jar文件拷贝到你应用服务器下的 solr /WEB-INF/lib中。
有3个jar文件:mmseg4j-analysis-1.9.1.jar, mmseg4j-core-1.9.1.jar,mmseg4j-solr-1.9.1.jar。
顺便提下,如果是在mmseg4j-1.9.0前, 则需要copy data目录到solr_home/solr中(与core平级),并改名为dic。进入到你想使用mmseg4j分词器的core中(此处以solr自带的collection1为例),用编辑器打开collection1/conf/schema.xml配置文件,添加如下代码:
2
<
fieldType
name
=
"text_mmseg4j"
class
=
"solr.TextField"
>
3
<
analyzer
type
=
"index"
>
4
<
tokenizer
class
=
"com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode
=
"complex"
dicPath
=
"../dic"
/>
6
<
analyzer
type
=
"query"
>
7
<
tokenizer
class
=
"com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode
=
"complex"
dicPath
=
"../dic"
/>
2、 在mmseg4j-1.9.0后,如本例的mmseg4j-1.9.1中 ,就 可以不用 dicPath 参数,可以使用 mmseg4j-core-1.9.0.jar 里的 words.dic ,在Schema.xml中加入如下配置
2
<
fieldType
name
=
"text_mmseg4j_complex"
class
=
"solr.TextField"
positionIncrementGap
=
"100"
>
4
<
tokenizer
class
=
"com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode
=
"complex"
dicPath
=
"dic"
/>
7
<
fieldType
name
=
"text_mmseg4j_maxword"
class
=
"solr.TextField"
positionIncrementGap
=
"100"
>
9
<
tokenizer
class
=
"com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode
=
"max-word"
dicPath
=
"dic"
/>
12
<
fieldType
name
=
"text_mmseg4j_simple"
class
=
"solr.TextField"
positionIncrementGap
=
"100"
>
17
<
tokenizer
class
=
"com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode
=
"simple"
dicPath
=
"dic"
/>
3、引用mmseg4j分词器
只需要在该schema.xml中加入如下配置便可引用对应的mmseg4j分词器
1
<
field
name
=
"mmseg4j_complex_name"
type
=
"text_mmseg4j_complex"
indexed
=
"true"
stored
=
"true"
/>
2
<
field
name
=
"mmseg4j_maxword_name"
type
=
"text_mmseg4j_maxword"
indexed
=
"true"
stored
=
"true"
/>
3
<
field
name
=
"mmseg4j_simple_name"
type
=
"text_mmseg4j_simple"
indexed
=
"true"
stored
=
"true"
/>
通过以上步骤就可以成功配置mmseg4j分词器到solr中了。
注:以下异常在http://pan.baidu.com/s/1pJCyAd9 下载的文件已处理。
然后就可以打开Solr Admin的Page进行分词分析了。但当输入中文(华南理工大学)并点击“Analyse Values”进行分析时,会发现如下的错误: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
该原因是源码的一个bug引起的,需要修改上面下载的mmseg4j-analysis-1.9.1.zip解压后的mmseg4j-analysis目录下的类:MMSegTokenizer.java,修改reset()方法并加上下面注释中的这一句
1
public
void
reset()
throws
IOException {
修改后运行mvn clean package -DskipTests进行打包得到最新的mmseg4j-analysis-1.9.1.jar 并替换Tomcat下的solr下的WEB-INF/lib下的mmseg4j-analysis-1.9.1.jar。
重新启动Tomcat并访问Solr Admin Page,并在“Analysis”中输入中文进行分析,可以看到已经成功的进行分析。