完成了paoding与lucene的集成

 理解疱丁分词的基本原理,编译原码,用ANT编译
E:/workspace/searchengine/paoding-analysis-2.0.4-beta


完成了中文分词的solr集成工作,集成到solr中去,
  注意:
  1)需要将solr的tomcat的connector改成可接受utf8
  <Connector port="8080" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8"/>
  2)需要将php的格式换成utf-8,在头上加 header("Content-Type: text/html; charset=UTF-8");   
  3)修改schema.xml
  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
  <!-- tokenizer class="solr.WhitespaceTokenizerFactory"/-->
        <!-- 加入中文分词 -->
        <tokenizer class="solr.ChineseTokenizerFactory" mode="most-words" />
        .....
  4)加入paoding时,需要在环境变量加PAODING_DIC_HOME 指定到/dic目录
  5)写一个类,org.apache.solr.analysis.ChineseTokenizerFactory包装疱丁分词,详见附录一

-------------------20080730---------
完成lucene与paoding的集成工作

E:/workspace/searchengine/indexdata>java org.apache.lucene.demo.IndexFiles %HERITRIX_HOME%/jobs/default3-20080730031933671/mirror/news.21315.com
 
  info = config paoding analysis from: E:/workspace/searchengine/luncene/test/lucene-2.3.2/bin/paoding-analysis.properties;E:/workspace/searchengine/luncene/test/lucene-2.3.2/file:/E:/workspace/searchengine/paoding-analysis-2.0.4-beta/paoding-analysis.jar!/paoding-analysis-default.properties;E:/workspace/searchengine/luncene/test/lucene-2.3.2/file:/E:/workspace/searchengine/paoding-analysis-2.0.4-beta/paoding-analysis.jar!/paoding-analyzer.properties;E:/workspace/searchengine/luncene/test/lucene-2.3.2/file:/E:/workspace/searchengine/paoding-analysis-2.0.4-beta/paoding-analysis.jar!/paoding-dic-home.properties;E:/workspace/searchengine/paoding-analysis-2.0.4-beta/dic/paoding-dic-names.properties;E:/workspace/searchengine/luncene/test/lucene-2.3.2/file:/E:/workspace/searchengine/paoding-analysis-2.0.4-beta/paoding-analysis.jar!/paoding-knives.properties;E:/workspace/searchengine/luncene/test/lucene-2.3.2/file:/E:/workspace/searchengine/paoding-analysis-2.0.4-beta/paoding-analysis.jar!/paoding-knives-user.properties
info = add knike: net.paoding.analysis.knife.CJKKnife
info = add knike: net.paoding.analysis.knife.LetterKnife
info = add knike: net.paoding.analysis.knife.NumberKnife
info = add knike: net.paoding.analysis.knife.CJKKnife
info = add knike: net.paoding.analysis.knife.LetterKnife
info = add knike: net.paoding.analysis.knife.NumberKnife
info = loading dictionaries from E:/workspace/searchengine/paoding-analysis-2.0.4-beta/dic
info = loaded success!
Indexing to directory 'index'...
adding E:/workspace/searchengine/heritrix/heritrix-1.14.0/target/heritrix-1.14.0/bin/heritrix-1.14.0/jobs/default3-20080730031933671/mirror/news.21315.com/2008/application/x-shockwave-flash


对于疱丁分词的封装

package org.apache.solr.analysis;

import java.io.Reader;
import java.util.Map;

import net.paoding.analysis.analyzer.PaodingTokenizer;
import net.paoding.analysis.analyzer.TokenCollector;
import net.paoding.analysis.analyzer.impl.MaxWordLengthTokenCollector;
import net.paoding.analysis.analyzer.impl.MostWordsTokenCollector;
import net.paoding.analysis.knife.PaodingMaker;

import org.apache.lucene.analysis.TokenStream;

/**
* 对于疱丁分词的封装
* @author tufei
*
*/
public class ChineseTokenizerFactory extends BaseTokenizerFactory {
/**
* most words
*/
public static final String MOST_WORDS_MODE = "most-words";
/**
* max word length
*/
public static final String MAX_WORD_LENGTH_MODE = "max-word-length";
private String mode = null;

public void setMode(String mode) {
if (mode == null || MOST_WORDS_MODE.equalsIgnoreCase(mode)
|| "default".equalsIgnoreCase(mode)) {
this.mode = MOST_WORDS_MODE;
} else if (MAX_WORD_LENGTH_MODE.equalsIgnoreCase(mode)) {
this.mode = MAX_WORD_LENGTH_MODE;
} else {
throw new IllegalArgumentException("Irregular Mode args setting:"
+ mode);
}
}

@Override
public void init(Map<String, String> args) {
super.init(args);
setMode((String) args.get("mode"));
}

public TokenStream create(Reader input) {
return new PaodingTokenizer(input, PaodingMaker.make(),
createTokenCollector());
}

private TokenCollector createTokenCollector() {
if (MOST_WORDS_MODE.equals(mode))
return new MostWordsTokenCollector();
if (MAX_WORD_LENGTH_MODE.equals(mode))
return new MaxWordLengthTokenCollector();
throw new Error("never happened");
}
}

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值