java solr 分词器,solr 自定义分词器 | 学步园

今天需要将一个以逗号分隔的字段建立到索引库中去,没找到有现成的逗号分隔符分词器,于是看了看源码里空格分词器WhitespaceTokenizerFactory的写法。照葫芦画瓢写了一个逗号分词器:package com.besttone.analyzer;

import java.io.Reader;

import java.util.Map;

import org.apache.solr.analysis.BaseTokenizerFactory;

public class CommaTokenizerFactory extends BaseTokenizerFactory {

@Override

public void init(Map args) {

super.init(args);

assureMatchVersion();

}

public CommaTokenizer create(Reader input) {

return new CommaTokenizer(luceneMatchVersion, input);

}

}

package com.besttone.analyzer;

import java.io.Reader;

import org.apache.lucene.analysis.CharTokenizer;

import org.apache.lucene.util.AttributeSource;

import org.apache.lucene.util.Version;

public class CommaTokenizer extends CharTokenizer {

/**

* Construct a new WhitespaceTokenizer. * @param matchVersion Lucene version

* to match See {@link above}

*

* @param in

* the input to split up into tokens

*/

public CommaTokenizer(Version matchVersion, Reader in) {

super(matchVersion, in);

}

/**

* Construct a new WhitespaceTokenizer using a given {@link AttributeSource}

* .

*

* @param matchVersion

* Lucene version to match See

* {@link above}

* @param source

* the attribute source to use for this {@link Tokenizer}

* @param in

* the input to split up into tokens

*/

public CommaTokenizer(Version matchVersion, AttributeSource source,

Reader in) {

super(matchVersion, source, in);

}

/**

* Construct a new WhitespaceTokenizer using a given

* {@link org.apache.lucene.util.AttributeSource.AttributeFactory}.

*

* @param matchVersion

* Lucene version to match See

* {@link above}

* @param factory

* the attribute factory to use for this {@link Tokenizer}

* @param in

* the input to split up into tokens

*/

public CommaTokenizer(Version matchVersion, AttributeFactory factory,

Reader in) {

super(matchVersion, factory, in);

}

/**

* Construct a new CommaTokenizer.

*

* @deprecated use {@link #CommaTokenizer(Version, Reader)} instead. This

* will be removed in Lucene 4.0.

*/

@Deprecated

public CommaTokenizer(Reader in) {

super(in);

}

/**

* Construct a new CommaTokenizer using a given {@link AttributeSource}.

*

* @deprecated use {@link #CommaTokenizer(Version, AttributeSource, Reader)}

* instead. This will be removed in Lucene 4.0.

*/

@Deprecated

public CommaTokenizer(AttributeSource source, Reader in) {

super(source, in);

}

/**

* Construct a new CommaTokenizer using a given

* {@link org.apache.lucene.util.AttributeSource.AttributeFactory}.

*

* @deprecated use

* {@link #CommaTokenizer(Version, AttributeSource.AttributeFactory, Reader)}

* instead. This will be removed in Lucene 4.0.

*/

@Deprecated

public CommaTokenizer(AttributeFactory factory, Reader in) {

super(factory, in);

}

/**

* Collects only characters which do not satisfy

* {@link Character#isWhitespace(int)}.

*/

@Override

protected boolean isTokenChar(int c) {

// return !Character.isWhitespace(c);

// 44表示逗号

return !(c == 44);

}

}

其实很简单,只要继承一下solr提供的通用字符分词器:CharTokenizer,然后实现自己的isTokenChar方法:

protected boolean isTokenChar(int c) {

// return !Character.isWhitespace(c);

// 44表示逗号

return !(c == 44);

}

判断是否等于44,如果等于就返回false,否则返回true。返回false表示分词。44是逗号的asc码值,比如a的asc码值为97,如果不知道一个字符对应的值为多少,可以这样:

char[] c = new char[]{'a',',','b'};

Character.codePointAt(c, 1);

获得char数组里index为1的字符的asc码值。

然后打包成jar,放到solr_home/lib下面,或者其他地方也行,但是要在solrconfig.xml里配置lib的路径或者solr.xml里配置sharelib的路径都行,总之就是要solr启动时加载这个jar包。

然后就可以在solr控制台的analysis页面测试一下分词效果了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值