为了毕设完成为重,就将分词器的未登录词(新词)识别给延后了,而是在现在分词器满足实际需要的情况下,写了下相应的分词接口用于替换掉lucene的analyzer,使lucene更能适合于中文的毕设应用。
1、参考lucene的StandAnalyzer去实现lucene的分析器接口Analyzer即可。
整个过程是典型的模板方法的实现,只要实现其抽像的
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) 就可以了。其实现代码为:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
/**
* 该类负责天亮中文分词器与lucene的对接
*
* @author zel
*
*/
public
class
SkyLightAnalyzer
extends
Analyzer {
private
Version matchVersion;
public
SkyLightAnalyzer(Version matchVersion) {
this
.matchVersion = matchVersion;
}
@Override
protected
TokenStreamComponents createComponents(String fieldName,
Reader reader) {
// TODO Auto-generated method stub
final
SkyLightTokenizer tokenizer =
new
SkyLightTokenizer(matchVersion,
reader);
return
new
TokenStreamComponents(tokenizer);
}
}
|
2、analyzer是分析器,并不是真正意义上的分词器,只是很多人不较真或是不了解真正的含意而误认为此了。
analyzer(分析器)包括分词器(tokenizer)和过滤器(filter)两部分,相当于一个分词管道,一层一层又一层。
lucene的StandardAnalyzer对应的分词器即StandardTokenizer,其继承了Tokenizer抽象类。本分词器的tokenizer跟它是相似的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
|
/**
* 真正的分词器,这里的token相当于以往的termUnit,即一个分词单元
*
* @author zel
*
*/
public
class
SkyLightTokenizer
extends
Tokenizer {
protected
SkyLightTokenizer(Reader input) {
super
(input);
// TODO Auto-generated constructor stub
}
public
SkyLightTokenizer(Version matchVersion, Reader input) {
super
(input);
init(matchVersion);
}
private
final
void
init(Version matchVersion) {
if
(matchVersion.onOrAfter(Version.LUCENE_40)) {
this
.scanner =
new
SkyLightTokenizerImpl(
null
);
}
}
private
int
skippedPositions =
0
;
private
SkyLightTokenizerInterface scanner;
// this tokenizer generates three attributes:
// term offset, positionIncrement and type
private
final
CharTermAttribute termAtt = addAttribute(CharTermAttribute.
class
);
private
final
OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.
class
);
private
final
PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.
class
);
private
final
TypeAttribute typeAtt = addAttribute(TypeAttribute.
class
);
@Override
public
boolean
incrementToken()
throws
IOException {
clearAttributes();
// TODO Auto-generated method stub
TermUnit termUnit = scanner.getNextToken();
if
(termUnit ==
null
) {
// 即已结束,不用再往下取token了
return
false
;
}
skippedPositions=
this
.scanner.getTokenLength();
posIncrAtt.setPositionIncrement(
1
);
scanner.setTermUnitValueToTextAttr(termAtt, termUnit);
offsetAtt.setOffset(termUnit.getOffset(), termUnit.getOffset()
+ skippedPositions);
return
true
;
}
@Override
public
final
void
end()
throws
IOException {
// System.out.println("end()方法中");
super
.end();
// set final offset
int
finalOffset = correctOffset(scanner.getCurrentPos()
+ scanner.getTokenLength());
offsetAtt.setOffset(finalOffset, finalOffset);
// adjust any skipped tokens
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()
+ scanner.getTokenLength());
}
@Override
public
void
reset()
throws
IOException {
scanner.reset(input);
skippedPositions =
0
;
}
}
|
而lucene原版的StandardTokenizer写的比上边所贴代码要复杂一点,主要是分词方法的不同,无其它不同。
3、上边2中所贴代码,是跟lucene风格相似的,它也没实现真正的分词,而是直接调用分词结果后进行相应的对象封装,其真正的分词操作在StandardTokenizerImpl中,我们也是相同的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
|
/**
* 实现流的方式的实现类
*
* @author zel
*
*/
public
class
SkyLightTokenizerImpl
implements
SkyLightTokenizerInterface {
private
Reader reader;
private
int
offset;
private
int
length;
static
com.zel.core.analyzer.StandardAnalyzer analyzer =
null
;
private
TermUnitStream termUnitStream =
null
;
static
{
analyzer =
new
StandardAnalyzer();
}
public
SkyLightTokenizerImpl(Reader in) {
this
.reader = in;
}
@Override
public
TermUnit getNextToken()
throws
IOException {
// TODO Auto-generated method stub
TermUnit termUnit = termUnitStream.getNextTermUnit();
// System.out.println("come getNextToken");
// System.out.println("####" + termUnit);
if
(termUnit ==
null
) {
this
.offset =
this
.offset +
this
.length;
this
.length =
0
;
return
null
;
}
else
{
this
.offset = termUnit.getOffset();
this
.length = termUnit.getLength();
}
return
termUnit;
}
@Override
public
void
setTermUnitValueToTextAttr(CharTermAttribute t,
TermUnit termUnit) {
// TODO Auto-generated method stub
t
.copyBuffer(termUnit.getValue().toCharArray(),
0
, termUnit
.getLength());
}
@Override
public
void
reset(Reader reader) {
// System.out.println("come reset");
// TODO Auto-generated method stub
this
.reader = reader;
// 流对象重置后,立即分词
termUnitStream = analyzer.getSplitResult(reader);
}
@Override
public
int
getCurrentPos() {
// TODO Auto-generated method stub
return
offset;
}
@Override
public
int
getTokenLength() {
// TODO Auto-generated method stub
return
length;
}
}
|
4、分词效果测试用例。
4.1 建索引和搜索必须要进行分词器的统一替换才可以保证索引和搜索结果一致。
4.2 将分词的jar包导入到lucene4.5的开发包中,并在其提供的demo包的中IndexFiles.java和SearchFiles.java中直接修改分词初始化代码即可。
4.3 在要索引的文件中,加入若干中文,索引完成后再去搜索操作会发现,搜索如“成功”之类的词的在QueryParser.parse后,得到的"成功“而不是”成 功“的结果,说明分词成功了。
希望有相似需求的同学们也可以多试试,挺有意思的,多向优秀开源软件学习如何做接口、组件以及模块的可自由替换的优秀作风。