Project2--Lucene的Ranking机制浅析

最新推荐文章于 2024-07-13 20:46:05 发布

wbia2010lkl

最新推荐文章于 2024-07-13 20:46:05 发布

阅读量2.6k

点赞数

分类专栏： Lucene 文章标签： lucene float query 文档 cache byte

本文链接：https://blog.csdn.net/wbia2010lkl/article/details/6033301

版权

Lucene 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1. 原理

首先，Lucene采用了空间向量模型（VSM）来进行检索。

其次，Lucene的打分机制是根据以下公式：

score(q,d)=coord(q,d)xqueryNorm(q)x∑(tf (t ind )xidf(t)²xt.getBoost()xnorm(t,d)))

其中coord表示一篇文档所包含的搜索词越多，此文档的分数越高；

queryNorm计算每个查询条目的方差和，其结果对排序没有影响

2. 如何计算各个部分的值

a. tf和idf

tf表示某个term在文档中出现的词频，idf表示term在几个文档中出现过

在DefaultSimilarity类中

public float tf(float freq) { return (float)Math.sqrt(freq); }

由此返回tf值

public float idf(int docFreq, int numDocs) { return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0); }

由此得到idf值

b. queryNorm

其计算公式为

public float queryNorm(float sumOfSquaredWeights) { return (float)(1.0 / Math.sqrt(sumOfSquaredWeights)); }

3. 计算分数过程（Score）

在VSM中，向量之间的夹角被用来衡量查询词和文档之间的相似度，在lucene中向量的点乘积被称作Weight

即weight=tf × idf

Weight在Lucene中是一个抽象类，在抽象类Query中具有一个以Searcher为参数的函数weight，返回Weight

public Weight weight(Searcher searcher) throws IOException { Query query = searcher.rewrite(this); Weight weight = query.createWeight(searcher); float sum = weight.sumOfSquaredWeights(); float norm = getSimilarity(searcher).queryNorm(sum); if (Float.isInfinite(norm) || Float.isNaN(norm)) norm = 1.0f; weight.normalize(norm); return weight; }

在Query的子类TermQuery中有继承与Weight的内部类TermWeight

在该类中通过一系列计算，最终通过normalize函数得到value值

public void normalize(float queryNorm) { this.queryNorm = queryNorm; queryWeight *= queryNorm; // normalize query weight value = queryWeight * idf; // idf for document }

最终的得分在TermScore类中取得

在它的构造函数中初始化上述数组得到具体分数，代码如下：

TermScorer(Weight weight, TermDocs td, Similarity similarity, byte[] norms) { super(similarity); this.weight = weight; this.termDocs = td; this.norms = norms; this.weightValue = weight.getValue(); for (int i = 0; i < SCORE_CACHE_SIZE; i++) scoreCache[i] = getSimilarity().tf(i) * weightValue; }

4. 总结

Lucene的打分机制大致上就是以上这些，具体的细节请参考

http://topic.csdn.net/u/20100308/21/3386acef-d853-4738-9941-2a8b0ee157ca.html

wbia2010lkl

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Project2--Lucene的Ranking机制浅析

1. 原理首先，Lucene采用了空间向量模型（VSM）来进行检索。其次，Lucene的打分机制是根据以下公式：score(q,d)=coord(q,d)xqueryNorm(q)x∑(tf (t ind )xidf(t)2 xt.getBoost()xnorm(t,d)))其中coord表示一篇文档所包含的搜索词越多，此文档的分数越高；queryNorm计算每个查询条目的方差和，其结果对排序没有影响2. 如何计算各个部分的值a. tf和idftf表示某个term在文档中出现的词频，idf表示term在几个
复制链接

扫一扫