今天有人说在Solr1.4里使用paoding2.0.4(可能是终结版了),无法实现正常的高亮显示。这个问题说难不难,说简单也不简单,我上次配置的时候,也花了一点时间测试,才找对合适的参数。今天既然有人问起来,我干脆在这里简单写一下,免得以后忘记。
闲话少叙,开始干活。
在Solr1.4出来之前,Paoding就已经停止开发了,所以Paoding分词器没有考虑与Solr1.4整合的功能,值得庆幸的是在网上,有前辈已经给我们铺了一条比较平坦的路了,具体讲,就是Bory.Chan前辈给我们封装了一个工厂类,通过使用这个工厂类,就可以比较简单地实现Solr1.4和paoding的友好相处。这个工厂类的源码我就不贴了,贴个地址,表示对前辈的尊重:http://blog.chenlb.com/2009/12/solr-1-4-with-paoding.html
下载这个工厂类,编译。
然后再修改solr/conf/schemal.xml文件,在其中找到这一段:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumb ers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumb ers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> </fieldType>
改成如下:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="com.chenlb.solr.paoding.PaodingTokenizerFactory" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumb ers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="com.chenlb.solr.paoding.PaodingTokenizerFactory" mode="max-word-length"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumb ers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> </fieldType>
要注意的是,Bory.Chan给的例子是在“index”配置中,设置mode="max-word-length",而在“query”则不设置,经过我实际测试,发现这两个参数位置互换一下才能达到比较好的效果。
另外,这个工厂类只支持“max-word-length”和“非max-word-length”两种模式,而“max-word-length”这个单词容易写错了,所以我改造了一下PaodingTokenizerFactory,把这个参数设置成一个boolean型的参数。
import java.io.Reader;
import java.util.Map;
import net.paoding.analysis.analyzer.PaodingTokenizer;
import net.paoding.analysis.analyzer.TokenCollector;
import net.paoding.analysis.analyzer.impl.MaxWordLengthTokenCollector;
import net.paoding.analysis.analyzer.impl.MostWordsTokenCollector;
import net.paoding.analysis.knife.PaodingMaker;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;
import cn.com.pansky.otp3.lucene.analyzer.paoding.SolrPaodingTokenizer;
/**
* 基于Paoding分词器的分词工厂
* @author http://blog.chenlb.com/
* @Modified by BrokenStone 2010-08-03
*
*/
public class PaodingTokenizerFactory extends BaseTokenizerFactory {
public static final String MOST_WORDS_MODE = "most-words"; //最多切分 默认模式
public static final String MAX_WORD_LENGTH_MODE = "max-word-length"; //按最大切分
private boolean isMaxWordLength = false; //是否采用“按最大切分”模式分词,默认是按“最多切分”模式分词
public boolean isMaxWordLength() {
return isMaxWordLength;
}
public void setMaxWordLength(boolean isMaxWordLength) {
this.isMaxWordLength = isMaxWordLength;
}
public Tokenizer create(Reader input) {
return new SolrPaodingTokenizer(input, PaodingMaker.make(),
createTokenCollector());
}
private TokenCollector createTokenCollector() {
if(isMaxWordLength)
return new MaxWordLengthTokenCollector();
else
return new MostWordsTokenCollector();
}
}