改变索引打分的四种方式

最新推荐文章于 2022-10-22 17:25:22 发布

zhyf918

最新推荐文章于 2022-10-22 17:25:22 发布

阅读量798

点赞数

分类专栏：技术分享

技术分享专栏收录该内容

28 篇文章 0 订阅

订阅专栏

lucene索引：（2）

(6) float coord(int overlap, int maxOverlap)

一次搜索可能包含多个搜索词，而一篇文档中也可能包含多个搜索词，此项表示，当一篇文档中包含的搜索词越多，则此文档则打分越高。

public void TestCoord() throws Exception {
MySimilarity sim = new MySimilarity();
File indexDir = new File("TestCoord");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), newStandardAnalyzer(Version.LUCENE_CURRENT), true,IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED);
doc2.add(f2);
writer.addDocument(doc2);
for(int i = 0; i < 10; i++){
Document doc3 = new Document();
Field f3 = new Field("contents", "world", Field.Store.NO, Field.Index.ANALYZED);
doc3.add(f3);
writer.addDocument(doc3);
}
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(sim);
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse("common world");
TopDocs docs = searcher.search(query, 2);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

class MySimilarity extends Similarity {

@Override
public float coord(int overlap, int maxOverlap) {
return 1;
}

}

如上面的实例，当coord返回1，不起作用的时候，文档一虽然包含了两个搜索词common和world，但由于world的所在的文档数太多，而文档二包含common的次数比较多，因而文档二分数较高：

docid : 1 score : 1.9059997
docid : 0 score : 1.2936771

而当coord起作用的时候，文档一由于包含了两个搜索词而分数较高：

class MySimilarity extends Similarity {

@Override
public float coord(int overlap, int maxOverlap) {
return overlap / (float)maxOverlap;
}

}

docid : 0 score : 1.2936771
docid : 1 score : 0.95299983

(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)

由于Lucene引入了payload，因而可以存储一些自己的信息，用户可以根据自己存储的信息，来影响Lucene的打分。

payload的定义

我们知道，索引是以倒排表形式存储的，对于每一个词，都保存了包含这个词的一个链表，当然为了加快查询速度，此链表多用跳跃表进行存储。

Payload信息就是存储在倒排表中的，同文档号一起存放，多用于存储与每篇文档相关的一些信息。当然这部分信息也可以存储域里(storedField)，两者从功能上基本是一样的，然而当要存储的信息很多的时候，存放在倒排表里，利用跳跃表，有利于大大提高搜索速度。

Payload的存储方式如下图：

由payload的定义，我们可以看出，payload可以存储一些不但与文档相关，而且与查询词也相关的信息。比如某篇文档的某个词有特殊性，则可以在这个词的这个文档的position信息后存储payload信息，使得当搜索这个词的时候，这篇文档获得较高的分数。

要利用payload来影响查询需要做到以下几点，下面举例用标记的词在payload中存储1，否则存储0：

首先要实现自己的Analyzer从而在Token中放入payload信息：

class BoldAnalyzer extends Analyzer {

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
result = new BoldFilter(result);
return result;
}

}

class BoldFilter extends TokenFilter {
public static int IS_NOT_BOLD = 0;
public static int IS_BOLD = 1;

private TermAttribute termAtt;
private PayloadAttribute payloadAtt;

protected BoldFilter(TokenStream input) {
super(input);
termAtt = addAttribute(TermAttribute.class);
payloadAtt = addAttribute(PayloadAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {

final char[] buffer = termAtt.termBuffer();
final int length = termAtt.termLength();

String tokenstring = new String(buffer, 0, length);
if (tokenstring.startsWith("") && tokenstring.endsWith("")) {
tokenstring = tokenstring.replace("", "");
tokenstring = tokenstring.replace("", "");
termAtt.setTermBuffer(tokenstring);
payloadAtt.setPayload(new Payload(int2bytes(IS_BOLD)));
} else {
payloadAtt.setPayload(new Payload(int2bytes(IS_NOT_BOLD)));
}
return true;
} else
return false;
}

public static int bytes2int(byte[] b) {
int mask = 0xff;
int temp = 0;
int res = 0;
for (int i = 0; i < 4; i++) {
res <<= 8;
temp = b[i] & mask;
res |= temp;
}
return res;
}

public static byte[] int2bytes(int num) {
byte[] b = new byte[4];
for (int i = 0; i < 4; i++) {
b[i] = (byte) (num >>> (24 - i * 8));
}
return b;
}

}

然后，实现自己的Similarity，从payload中读出信息，根据信息来打分。

class PayloadSimilarity extends DefaultSimilarity {

@Override
public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) {
int isbold = BoldFilter.bytes2int(payload);
if(isbold == BoldFilter.IS_BOLD){
System.out.println("It is a bold char.");
} else {
System.out.println("It is not a bold char.");
}
return 1;
}
}

最后，查询的时候，一定要用PayloadXXXQuery(在此用PayloadTermQuery，在Lucene 2.4.1中，用BoostingTermQuery)，否则scorePayload不起作用。

public void testPayloadScore() throws Exception {
PayloadSimilarity sim = new PayloadSimilarity();
File indexDir = new File("TestPayloadScore");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), newBoldAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc2.add(f2);
writer.addDocument(doc2);
writer.close();

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(sim);
PayloadTermQuery query = new PayloadTermQuery(new Term("contents", "hello"), new MaxPayloadFunction());
TopDocs docs = searcher.search(query, 10);
for (ScoreDoc doc : docs.scoreDocs) {
System.out.println("docid : " + doc.doc + " score : " + doc.score);
}
}

如果scorePayload函数始终是返回1，则结果如下，不起作用。

It is not a bold char.
It is a bold char.
docid : 0 score : 0.2101998
docid : 1 score : 0.2101998

如果scorePayload函数如下：

class PayloadSimilarity extends DefaultSimilarity {

@Override
public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) {
int isbold = BoldFilter.bytes2int(payload);
if(isbold == BoldFilter.IS_BOLD){
System.out.println("It is a bold char.");
return 10;
} else {
System.out.println("It is not a bold char.");
return 1;
}
}
}

则结果如下，同样是包含hello，包含加粗的文档获得较高分：

It is not a bold char.
It is a bold char.
docid : 1 score : 2.101998
docid : 0 score : 0.2101998

继承并实现自己的collector

以上各种方法，已经把Lucene score计算公式的所有变量都涉及了，如果这还不能满足您的要求，还可以继承实现自己的collector。

在Lucene 2.4中，HitCollector有个函数public abstract void collect(int doc, float score)，用来收集搜索的结果。

其中TopDocCollector的实现如下：

public void collect(int doc, float score) {
if (score > 0.0f) {
totalHits++;
if (reusableSD == null) {
reusableSD = new ScoreDoc(doc, score);
} else if (score >= reusableSD.score) {
reusableSD.doc = doc;
reusableSD.score = score;
} else {
return;
}
reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);
}
}

此函数将docid和score插入一个PriorityQueue中，使得得分最高的文档先返回。

我们可以继承HitCollector，并在此函数中对score进行修改，然后再插入PriorityQueue，或者插入自己的数据结构。

比如我们在另外的地方存储docid和文档创建时间的对应，我们希望当文档时间是一天之内的分数最高，一周之内的分数其次，一个月之外的分数很低。

我们可以这样修改：

public static long milisecondsOneDay = 24L * 3600L * 1000L;

public static long millisecondsOneWeek = 7L * 24L * 3600L * 1000L;

public static long millisecondsOneMonth = 30L * 24L * 3600L * 1000L;

public void collect(int doc, float score) {
if (score > 0.0f) {

long time = getTimeByDocId(doc);

if(time < milisecondsOneDay) {

score = score * 1.0;

} else if (time < millisecondsOneWeek){

score = score * 0.8;

} else if (time < millisecondsOneMonth) {

score = score * 0.3;

} else {

score = score * 0.1;

}

totalHits++;
if (reusableSD == null) {
reusableSD = new ScoreDoc(doc, score);
} else if (score >= reusableSD.score) {
reusableSD.doc = doc;
reusableSD.score = score;
} else {
return;
}
reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);
}
}

在Lucene 3.0中，Collector接口为void collect(int doc)，TopScoreDocCollector实现如下：

public void collect(int doc) throws IOException {
float score = scorer.score();
totalHits++;
if (score <= pqTop.score) {
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
}

同样可以用上面的方式影响其打分。

zhyf918

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
改变索引打分的四种方式

lucene索引：（2）(6) float coord(int overlap, int maxOverlap)一次搜索可能包含多个搜索词，而一篇文档中也可能包含多个搜索词，此项表示，当一篇文档中包含的搜索词越多，则此文档则打分越高。public void TestCoord() throws Exception { MySimilarity sim = new MySimil
复制链接

扫一扫

专栏目录