solr安装ansj_seg分词

首先看一下各类中文分词性能的对比:


由此可见,个人感觉ansj分词性能最佳,最贴合实际。


1.下载nlp_lang,下载地址:http://maven.ansj.org/org/nlpcn/nlp-lang/0.3/,得到jar包nlp-lang-0.3.jar


2.下载ansj-seg,http://maven.ansj.org/org/ansj/ansj_seg/

也可以下载源码自己编译,源码下载地址:https://github.com/NLPchina/ansj_seg ,下载完成后解压,执行如下指令:mvn clean install -DskipTests=true  (生成的jar包在target目录下),生成jar包ansj_seg-2.0.8.jar


3.编译ansj_lucene4_plug

进入解压目录:ansj_seg-master\plug\ansj_lucene4_plug,执行指令mvn clean install -DskipTests=true  ,生成jar包ansj_lucene4_plug-2.0.2.jar

说明:实现分词的Solr插件主要是实现TokenizerFactory类和Tokenizer类,前者负责接受Solr中schema.xml配置文件的调用,读取xml文件中的配置并返回对应的Tokenizer类,后者负责接受Solr传送过来的数据流,调用分词,产生最后分好词的Term流。在Ansj项目中作者提供了Ansj在Lucene下的插件,这个插件包含了Analyzer类的实现和Tokenizer类的实现,由于Solr是基于Lucene,Solr中的TokenizerFactory就相当于Lucene中的Analyzer,Tokenizer类是可以共用的。


4.重写TokenizerFactory

package com.iscas.AnsjTokenizerFactory;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.ansj.lucene.util.AnsjTokenizer;
import org.ansj.splitWord.analysis.IndexAnalysis;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.TokenizerFactory;
//import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.util.AttributeFactory;

public class AnsjTokenizerFactory extends TokenizerFactory{
	private boolean pstemming;
	private boolean isQuery;
    private String stopwordsDir;
    public Set<String> filter;  

    public AnsjTokenizerFactory(Map<String, String> args) {
        super(args);
        assureMatchVersion();
        isQuery = getBoolean(args, "isQuery", true);
        pstemming = getBoolean(args, "pstemming", false);
        stopwordsDir = get(args,"words");
        addStopwords(stopwordsDir);
    }
    //add stopwords list to filter
    private void addStopwords(String dir) {
        if (dir == null){
            System.out.println("no stopwords dir");
            return;
        }
        //read stoplist
        System.out.println("stopwords: " + dir);
        filter = new HashSet<String>();
        File file = new File(dir); 
        InputStreamReader reader;
        try {
            reader = new InputStreamReader(new FileInputStream(file),"UTF-8");
            BufferedReader br = new BufferedReader(reader); 
            String word = br.readLine();  
            while (word != null) {
                filter.add(word);
                word = br.readLine(); 
            }  
        } catch (FileNotFoundException e) {
            System.out.println("No stopword file found");
        } catch (IOException e) {
            System.out.println("stopword file io exception");
        }      
    }

	@Override
	public Tokenizer create(AttributeFactory factory, Reader input) {
		// TODO Auto-generated method stub
		if(isQuery == true){
            //query
            return new AnsjTokenizer(new ToAnalysis(new BufferedReader(input)), input, filter, pstemming);
        } else {
            //index
            return new AnsjTokenizer(new IndexAnalysis(new BufferedReader(input)), input, filter, pstemming);
        }
	}     
}
工程 结构目录如下:

export生成jar包AnsjTokenizerFactory.jar,此处导出不用勾选lib。


isQuery是用来判断使用分词的策略是检索时需要的比较精确的分词方式还是建立索引时所需要的比较不精确但是产生词语较多的分词方式,根据选择调用不同的分词器。

pstemming是原作者提供的参数,用来判断是否需要处理英文名词的单复数,第三人称等。

words是停止词的路径。

5.复制jar包

将上面生成的4个jar包放到%TOMCAT_HOME%/webapps\solr\WEB-INF\lib目录下


6.拷贝library.properties

将library.properties拷贝到%TOMCAT_HOME%/webapps\solr\WEB-INF\classes目录下,修改后的内容如下:

#redress dic file path
ambiguityLibrary=F:/solr_dic/ansj/ansj-library/ambiguity.dic
#path of userLibrary this is default library
userLibrary=F:/solr_dic/ansj/ansj-library/
#set real name
isRealName=true
其中userLibrary可以配置成具体文件,也可以配置成目录,系统会自动扫描目录下的.dic文件。如下图所示:



7.修改schema.xml

<fieldType name="text_ansj" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
         <tokenizer class="com.iscas.AnsjTokenizerFactory.AnsjTokenizerFactory"  isQuery="false" pstemming="true" words="F:/solr_dic/ansj/ansj-stopword/stopwords.dic"/>
    </analyzer>

    <analyzer type="query">
        <tokenizer class="com.iscas.AnsjTokenizerFactory.AnsjTokenizerFactory" isQuery="false" pstemming="true" words="F:/solr_dic/ansj/ansj-stopword/stopwords.dic"/>
    </analyzer>
  </fieldType>

具体内容如下图:



8.重启tomcat进行测试

测试结果如下:



资源下载:http://download.csdn.net/detail/allthesametome/8904845


参考链接:

http://iamyida.iteye.com/blog/2220833

http://segmentfault.com/a/1190000000418637

http://www.cnblogs.com/likehua/p/4481219.html

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值