lucene从4.0开始就提供了多个打分模型,包括TF-IDF,BM25,DRF等。默认的实现还是基于经典的TFIDF模型。下面对solr edismax查询中涉及到的一些公式进行介绍。
tf(float freq):词频,freq为某个词在该文档的相应field中出现的次数, 默认为Math.sqrt(freq):
idf(long docFreq, long numDocs):逆文档频数,docFreq为term总共在几个文档中出现,numDocs为文档总数.默认为(Math.log(numDocs/(double)(docFreq+1)) + 1.0)
queryNorm(float sumOfSquaredWeights):不影响排序,每个命中的文档都会乘以该因子。仅仅使得不同的query之间的分数可以比较。在不同的查询条件,或者索引不同的情况下(比如分布式查询),需要考虑到该因素的影响。默认为(float)(1.0 / Math.sqrt(sumOfSquaredWeights))
lengthNorm():创建索引时候写入nrm文件中,默认是1.0 / Math.sqrt(numTerms) numTerms为添加的field中term的总数,采用1个byte保存
queryBoost:查询设置的权重
lucene中可以在建立索引时候添加权重,也可以通过查询时动态设置权重。建立索引的时候可以通过设置
Field.setBoost: 字段权重
Document.setBoost: 文档权重,在lucene4中已经去掉.solr4中保留了此参数.实际上也是通过设置Field.setBoost来完成,在4中如果同时设置了Field.setBoost以及Document.setBoost,则最后该字段的权重为Field.setBoost*Document.setBoost.相关代码可以查看solr中DocumentBuilder.toDocument.
通过field.setboost设置的权重,主要是通过lengthNorm()来实现。
public float lengthNorm(FieldInvertState state) {
final int numTerms; // 分词出来的term数量,state.getBoost() 就是建立索引时候设置的权重
if (discountOverlaps) // 是否忽略指定的term数量
numTerms = state.getLength() - state.getNumOverlap();
else
numTerms = state.getLength();
return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}
lucene中采用1个byte来存储lengthNorm的值,默认是采用3个bit来区分这个值,最小单位为0.125.最多存储了256个不同的boost。当boost设置成20或者23时候,是没有区别的。所以会造成精度上的损失.
关于norm的精度计算方式可以查看smallfloat.
/** Cache of decoded bytes. */
private static final float[] NORM_TABLE = new float[256];
static {
for (int i = 0; i < 256; i++) {
NORM_TABLE[i] = SmallFloat.byte315ToFloat((byte)i);
}
}
edismax中打分的主要流程:
search(leafContexts, createNormalizedWeight(wrapFilter(query, filter)), results)
createNormalizedWeight()
DisjunctionMaxWeight.score()
DisjunctionMaxScorer.score() ---- 返回最终的分数
createNormalizedWeight:
public Weight createNormalizedWeight(Query query) throws IOException {
query = rewrite(query); // DisjunctionMaxQuery
Weight weight = query.createWeight(this); // DisjunctionMaxWeight 计算idf,各个子weight的queryWeight
float v = weight.getValueForNormalization();// 返回queryNorm中的参数
float norm = getSimilarity().queryNorm(v); // 计算queryNorm
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
weight.normalize(norm, 1.0f); // 计算各个子weight的weightValue
return weight;
}
getValueForNormalization:
/**
* queryNorm(float sumOfSquaredWeights)计算
*
* @return
* sumOfSquaredWeights的值
*/
public float getValueForNormalization() throws IOException {
/**
* sum 各个weight的总得分, max 所有weight中最大的一个
*/
float max = 0.0f, sum = 0.0f;
for (Weight currentWeight : weights) {
float sub = currentWeight.getValueForNormalization(); // sub=queryWeight * queryWeight
sum += sub;
max = Math.max(max, sub);
}
float boost = getBoost(); // 1.0f DisjunctionMaxQuery中没有设置boost,只是在子query中进行了设置
return (((sum - max) * tieBreakerMultiplier * tieBreakerMultiplier) + max) * boost * boost; // sumOfSquaredWeights计算公式
}
DisjunctionMaxWeight.score()
public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder,
boolean topScorer, Bits acceptDocs) throws IOException {
Scorer[] scorers = new Scorer[weights.size()];
int idx = 0;
for (Weight w : weights) {
// we will advance() subscorers
Scorer subScorer = w.scorer(context, true, false, acceptDocs); // 返回每个weight的Scorer对象
if (subScorer != null) {
scorers[idx++] = subScorer; // 放入Scorer数组中
}
}
if (idx == 0) return null; // all scorers did not have documents
DisjunctionMaxScorer result = new DisjunctionMaxScorer(this, tieBreakerMultiplier, scorers, idx);
return result; // 返回DisjunctionMaxScorer对象
}
DisjunctionMaxScorer.score()
public float score() throws IOException {
int doc = subScorers[0].docID(); // 获取文档号
scoreSum = scoreMax = subScorers[0].score(); // 获取第一个weight的评分
int size = numScorers; // 需要计算的score总数
/**
* 遍历subScorers[] 计算scoreSum和scoreMax
* scoreMax为每个subScorers中分数最大的一个
*/
scoreAll(1, size, doc);
scoreAll(2, size, doc);
/**
* 返回最终的分数
*
* tieBreakerMultiplier: 默认为0,solr中可以通过tie参数来设置
* 假设有两个字段content,title. q参数同时在content,title中命中,最后得分如下
* content title scoreMax scoreSum
* doc1: 0.1 0.5 0.5 0.6
* doc2: 0.5 0.3 0.5 0.8
*
* doc1和doc2在默认的情况下得分是相同的,实际的情况是我们希望doc2获取更高的分数,
* 这时可以通过设置tieBreakerMultiplier为0.1,来使得doc2获取更高的分数,最后新的score为:
* score(doc1) = 0.51
* score(doc2) = 0.53
*/
return scoreMax + (scoreSum - scoreMax) * tieBreakerMultiplier;
}
DisjunctionMaxScorer.scoreAll
private void scoreAll(int root, int size, int doc) throws IOException {
if (root < size && subScorers[root].docID() == doc) {
float sub = subScorers[root].score(); // 获取tf的分数,在freq相同的情况下还有两个影响的分数就是norms.get[doc],weightValue
scoreSum += sub;
scoreMax = Math.max(scoreMax, sub); // scoreMax为每个评分器中最大的一个
scoreAll((root<<1)+1, size, doc);
scoreAll((root<<1)+2, size, doc);
}
}
最终的tf-idf分数计算,ExactTFIDFDocScorer.score
@Override
public float score(int doc, int freq) {
final float raw = tf(freq)*weightValue; // compute tf(f)*weight
// decodeNormValue((byte)norms.get(doc) 即为lengthNorm的值
return norms == null ? raw : raw * decodeNormValue((byte)norms.get(doc)); // normalize for field
}
for (Weight currentWeight : weights) {
queryWeight = idf * queryBoost // normalize之前
float sub = queryWeight * queryWeight
sum += sub;
max = Math.max(max, sub);
}
sumOfSquaredWeights = (((sum - max) * tieBreakerMultiplier * tieBreakerMultiplier) + max) * 1.0 * 1.0;
queryNorm = (float)(1.0 / Math.sqrt(sumOfSquaredWeights))
for(Score subScore : Scores) {
queryWeight = idf * queryBoost * queryNorm() * 1.0 // normalize之后
weightValue = queryWeight * idf
if (omitNorms) {
subscore = tf(freq)*weightValue
} else {
subscore = tf(freq)*weightValue *lengthNorm
}
scoreSum += subscore;
scoreMax = Math.max(scoreMax, subscore);
}
最后的分数为: scoreMax + (scoreSum - scoreMax) * tieBreakerMultiplier