Lucene - CustomScoreQuery 自定义排序

在某些场景需要做自定义排序(非单值字段排序、非文本相关度排序),除了自己重写collect、weight,可以借助CustomScoreQuery。

场景:根据tag字段中标签的数量进行排序(tag字段中,标签的数量越多得分越高)

复制代码
public class CustomScoreTest {
    public static void main(String[] args) throws IOException {
        Directory dir = new RAMDirectory();
        Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_4_9);
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
        IndexWriter writer = new IndexWriter(dir, conf);
        Document doc1 = new Document();
        FieldType type1 = new FieldType();
        type1.setIndexed(true);
        type1.setStored(true);
        type1.setStoreTermVectors(true);
        Field field1 = new Field("f1", "fox", type1);
        doc1.add(field1);
        Field field2 = new Field("tag", "fox1 fox2 fox3 ", type1);
        doc1.add(field2);
        writer.addDocument(doc1);
        //
        field1.setStringValue("fox");
        field2.setStringValue("fox1");
        doc1 = new Document();
        doc1.add(field1);
        doc1.add(field2);
        writer.addDocument(doc1);
        //
        field1.setStringValue("fox");
        field2.setStringValue("fox1 fox2 fox3 fox4");
        doc1 = new Document();
        doc1.add(field1);
        doc1.add(field2);
        writer.addDocument(doc1);
        //
        writer.commit();
        //
        IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir));
        Query query = new MatchAllDocsQuery();
        CountingQuery customQuery = new CountingQuery(query);
        int n = 10;
        TopDocs tds = searcher.search(query, n);
        ScoreDoc[] sds = tds.scoreDocs;
        for (ScoreDoc sd : sds) {
            System.out.println(searcher.doc(sd.doc));
        }
    }
}
复制代码

测试结果:

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 >>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1>>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 fox4>>

 

自定义打分:

复制代码
public class CountingQuery extends CustomScoreQuery {

    public CountingQuery(Query subQuery) {
        super(subQuery);
    }

    protected CustomScoreProvider getCustomScoreProvider(AtomicReaderContext context) throws IOException {
        return new CountingQueryScoreProvider(context, "tag");
    }
}
复制代码

 

复制代码
public class CountingQueryScoreProvider extends CustomScoreProvider {

    String field;

    public CountingQueryScoreProvider(AtomicReaderContext context) {
        super(context);
    }

    public CountingQueryScoreProvider(AtomicReaderContext context, String field) {
        super(context);
        this.field = field;
    }

    public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {
        IndexReader r = context.reader();
        Terms tv = r.getTermVector(doc, field);
        TermsEnum termsEnum = null;
        int numTerms = 0;
        if (tv != null) {
            termsEnum = tv.iterator(termsEnum);
            while ((termsEnum.next()) != null) {
                numTerms++;
            }
        }
        return (float) (numTerms);
    }

}
复制代码

 

使用:

CountingQuery customQuery = new CountingQuery(query);

测试结果如下:

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 fox4>>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 >>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1>>

 

//-----------------------

weight/score/similarity

collector

 

主要参考

http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

快照:

One item stands out on that list as a little low-level but not quite as bad as building a custom Lucene query: CustomScoreQuery. When you implement your own Lucene query, you’re taking control of two things:

Matching – what documents should be included in the search results
Scoring – what score should be assigned to a document (and therefore what order should they appear in)
Frequently you’ll find that existing Lucene queries will do fine with matching but you’d like to take control of just the scoring/ordering. That’s what CustomScoreQuery gives you – the ability to wrap another Lucene Query and rescore it.

For example, let’s say you’re searching our favorite dataset – SciFi Stackexchange, A Q&A site dedicated to nerdy SciFi and Fantasy questions. The posts on the site are tagged by topic: “star-trek”, “star-wars”, etc. Lets say for whatever reason we want to search for a tag and order it by the number of tags such that questions with the most tags are sorted to the top.

In this example, a simple TermQuery could be sufficient for matching. To identify the questions tagged Star Trek with Lucene, you’d simply run the following query:

Term termToSearch = new Term(“tag”, “star-trek”);
TermQuery starTrekQ = new TermQuery(termToSearch);
searcher.search(starTrekQ);

 


If we examined the order of the results of this search, they’d come back in default TF-IDF order.

With CustomScoreQuery, we can intercept the matching query and assign a new score to it thus altering the order.

Step 1 Override CustomScoreQuery To Create Our Own Custom Scored Query Class:

(note this code can be found in this github repo)

复制代码
public class CountingQuery extends CustomScoreQuery {

public CountingQuery(Query subQuery) {
super(subQuery);
}

protected CustomScoreProvider getCustomScoreProvider(
AtomicReaderContext context) throws IOException {
return new CountingQueryScoreProvider("tag", context);
}
}
复制代码

 

Notice the code for “getCustomScoreProvider” this is where we’ll return an object that will provide the magic we need. It takes an AtomicReaderContext, which is a wrapper on an IndexReader. If you recall, this hooks us in to all the data structures available for scoring a document: Lucene’s inverted index, term vectors, etc.

Step 2 Create CustomScoreProvider

The real magic happens in CustomScoreProvider. This is where we’ll rescore the document. I’ll show you a boilerplate implementation before we dig in

复制代码
public class CountingQueryScoreProvider extends CustomScoreProvider {

String _field;

public CountingQueryScoreProvider(String field, AtomicReaderContext context) {
super(context);
_field = field;
}

public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {
return (float)(1.0f);
}
}
复制代码

 

This CustomScoreProvider rescores all documents by returning a 1.0 score for them, thus negating their default relevancy sort order.

Step 3 Implement Rescoring

With TermVectors on for our field, we can simply loop through and count the tokens in the field:

复制代码
public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException 
{
IndexReader r = context.reader();
Terms tv = r.getTermVector(doc, _field);
TermsEnum termsEnum = null;
termsEnum = tv.iterator(termsEnum);
int numTerms = 0;
while((termsEnum.next()) != null) {
numTerms++;
}
return (float)(numTerms);
}
复制代码

 


And there you have it, we’ve overridden the score of another query! If you’d like to see a full example, see my “lucene-query-example” repository that has this as well as my custom Lucene query examples.

CustomScoreQuery Vs A Full-Blown Custom Query

Creating a CustomScoreQuery is a much easier thing to do than implementing a complete query. There are A LOT of ins-and-outs for implementing a full-blown Lucene query. So when creating a custom matching behavior isn’t important and you’re only rescoring another Lucene query, CustomScoreQuery is a clear winner. Considering how frequently Lucene based technologies are used for “fuzzy” analytics, I can see using CustomScoreQuery a lot when the regular tricks don’t pan out.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值