采用的是Lucene3.0.2的核心包和高亮显示包,程序主要代码如下:
Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(
"<font color=/"red/">", "</font>"), new QueryScorer(
query));
highlighter.setTextFragmenter(new SimpleFragmenter(50));
TermPositionVector termFreqVector = (TermPositionVector)reader.getTermFreqVector(id, fieldName);
TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector);
String content = hitDoc.get(fieldName);
String result = highlighter.getBestFragments(tokenStream, content, 5,"...");
测试发现:
正确高亮的检索:复件 索引
题名:复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt
错误高亮的检索:索引 文档
例1题名:索引测试新建文档1.txt
查看tokens结果:[(1,8,9), (1.txt,8,13), (文档,6,8), (新建,4,6), (测试,2,4), (索引,0,2), (txt,10,13)]
索引测试新建文档1.txt
例2题名:复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt
跟踪debug了一下高亮显示源代码发现:
for (boolean next = tokenStream.incrementToken(); next && (offsetAtt.startOffset()< maxDocCharsToAnalyze);
next = tokenStream.incrementToken())
{
if( (offsetAtt.endOffset()>text.length())
||
(offsetAtt.startOffset()>text.length())
)
{
throw new InvalidTokenOffsetsException("Token "+ termAtt.term()
+" exceeds length of provided text sized "+text.length());
}
if((tokenGroup.numTokens>0)&&(tokenGroup.isDistinct()))
{
//the current token is distinct from previous tokens -
// markup the cached token group info
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(endOffset, lastEndOffset);
tokenGroup.clear();
//check if current token marks the start of a new fragment
if(textFragmenter.isNewFragment())
{
currentFrag.setScore(fragmentScorer.getFragmentScore());
//record stats for a new fragment
currentFrag.textEndPos = newText.length();
currentFrag =new TextFragment(newText, newText.length(), docFrags.size());
fragmentScorer.startFragment(currentFrag);
docFrags.add(currentFrag);
}
}
tokenGroup.addToken(fragmentScorer.getTokenScore());
// if(lastEndOffset>maxDocBytesToAnalyze)
// {
// break;
// }
}
currentFrag.setScore(fragmentScorer.getFragmentScore());
if(tokenGroup.numTokens>0)
{
//flush the accumulated text (same code as in above loop)
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(lastEndOffset,endOffset);
}
* 因为高亮显示的方法里是按位置信息,当当前匹配的term小于前面最大的最后位置时才去高亮,
* 不然则在最后获取到最小匹配的term的首位置到最后匹配的term的末位置的字符串全部高亮起来了。】
分析如下:
正确高亮的检索:复件 索引
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt
查看tokens结果:[(复件,0,2), (12,4,6), (1,16,17), (1.txt,16,21), (文档,14,16), (新建,12,14), (测试,10,12), (索引,8,10), (txt,18,21)]
复件 (12) 索引测试新建文档1.txt
最后想是修改高亮显示类的代码还是在获取tokens时按位置排序再去做高亮呢?
查看了一下API发现:
public static TokenStream getTokenStream(TermPositionVector tpv,
boolean tokenPositionsGuaranteedContiguous)
GuaranteedContiguous:就是保证连续性的意思,英语太烂了,O(∩_∩)O哈哈~
TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector); 改为:
TokenStream tokenStream = TokenSources.getTokenStream(termFreqVector,true);
就ok啦。