首先看一下各类中文分词性能的对比:
由此可见,个人感觉ansj分词性能最佳,最贴合实际。
1.下载nlp_lang,下载地址:http://maven.ansj.org/org/nlpcn/nlp-lang/0.3/,得到jar包nlp-lang-0.3.jar
2.下载ansj-seg,http://maven.ansj.org/org/ansj/ansj_seg/
也可以下载源码自己编译,源码下载地址:https://github.com/NLPchina/ansj_seg ,下载完成后解压,执行如下指令:mvn clean install -DskipTests=true (生成的jar包在target目录下),生成jar包ansj_seg-2.0.8.jar
3.编译ansj_lucene4_plug
进入解压目录:ansj_seg-master\plug\ansj_lucene4_plug,执行指令mvn clean install -DskipTests=true ,生成jar包ansj_lucene4_plug-2.0.2.jar
说明:实现分词的Solr插件主要是实现TokenizerFactory类和Tokenizer类,前者负责接受Solr中schema.xml配置文件的调用,读取xml文件中的配置并返回对应的Tokenizer类,后者负责接受Solr传送过来的数据流,调用分词,产生最后分好词的Term流。在Ansj项目中作者提供了Ansj在Lucene下的插件,这个插件包含了Analyzer类的实现和Tokenizer类的实现,由于Solr是基于Lucene,Solr中的TokenizerFactory就相当于Lucene中的Analyzer,Tokenizer类是可以共用的。
4.重写TokenizerFactory
package com.iscas.AnsjTokenizerFactory;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import org.ansj.lucene.util.AnsjTokenizer;
import org.ansj.splitWord.analysis.IndexAnalysis;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.TokenizerFactory;
//import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.util.AttributeFactory;
public class AnsjTokenizerFactory extends TokenizerFactory{
private boolean pstemming;
private boolean isQuery;
private String stopwordsDir;
public Set<String> filter;
public AnsjTokenizerFactory(Map<String, String> args) {
super(args);
assureMatchVersion();
isQuery = getBoolean(args, "isQuery", true);
pstemming = getBoolean(args, "pstemming", false);
stopwordsDir = get(args,"words");
addStopwords(stopwordsDir);
}
//add stopwords list to filter
private void addStopwords(String dir) {
if (dir == null){
System.out.println("no stopwords dir");
return;
}
//read stoplist
System.out.println("stopwords: " + dir);
filter = new HashSet<String>();
File file = new File(dir);
InputStreamReader reader;
try {
reader = new InputStreamReader(new FileInputStream(file),"UTF-8");
BufferedReader br = new BufferedReader(reader);
String word = br.readLine();
while (word != null) {
filter.add(word);
word = br.readLine();
}
} catch (FileNotFoundException e) {
System.out.println("No stopword file found");
} catch (IOException e) {
System.out.println("stopword file io exception");
}
}
@Override
public Tokenizer create(AttributeFactory factory, Reader input) {
// TODO Auto-generated method stub
if(isQuery == true){
//query
return new AnsjTokenizer(new ToAnalysis(new BufferedReader(input)), input, filter, pstemming);
} else {
//index
return new AnsjTokenizer(new IndexAnalysis(new BufferedReader(input)), input, filter, pstemming);
}
}
}
工程
结构目录如下:
export生成jar包AnsjTokenizerFactory.jar,此处导出不用勾选lib。
isQuery
是用来判断使用分词的策略是检索时需要的比较精确的分词方式还是建立索引时所需要的比较不精确但是产生词语较多的分词方式,根据选择调用不同的分词器。
pstemming
是原作者提供的参数,用来判断是否需要处理英文名词的单复数,第三人称等。
words
是停止词的路径。
5.复制jar包
将上面生成的4个jar包放到%TOMCAT_HOME%/webapps\solr\WEB-INF\lib目录下
6.拷贝library.properties
将library.properties拷贝到%TOMCAT_HOME%/webapps\solr\WEB-INF\classes目录下,修改后的内容如下:
#redress dic file path
ambiguityLibrary=F:/solr_dic/ansj/ansj-library/ambiguity.dic
#path of userLibrary this is default library
userLibrary=F:/solr_dic/ansj/ansj-library/
#set real name
isRealName=true
其中userLibrary可以配置成具体文件,也可以配置成目录,系统会自动扫描目录下的.dic文件。如下图所示:
7.修改schema.xml
<fieldType name="text_ansj" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="com.iscas.AnsjTokenizerFactory.AnsjTokenizerFactory" isQuery="false" pstemming="true" words="F:/solr_dic/ansj/ansj-stopword/stopwords.dic"/>
</analyzer>
<analyzer type="query">
<tokenizer class="com.iscas.AnsjTokenizerFactory.AnsjTokenizerFactory" isQuery="false" pstemming="true" words="F:/solr_dic/ansj/ansj-stopword/stopwords.dic"/>
</analyzer>
</fieldType>
具体内容如下图:
8.重启tomcat进行测试
测试结果如下:
资源下载:http://download.csdn.net/detail/allthesametome/8904845
参考链接:
http://iamyida.iteye.com/blog/2220833
http://segmentfault.com/a/1190000000418637
http://www.cnblogs.com/likehua/p/4481219.html