读了Lucene打分文档之后,有必要了解一下TFIDFSimilarity中的调用堆栈。
public abstract class TFIDFSimilarity extends Similarity {
public TFIDFSimilarity() {}
@Override
// overlap overlap
// maxOverlap
public abstract float coord(int overlap, int maxOverlap);
@Override
public abstract float queryNorm(float sumOfSquaredWeights);
public abstract float tf(float freq);
public abstract float idf(long docFreq, long numDocs);
public abstract float lengthNorm(FieldInvertState state);
@Override
public final long computeNorm(FieldInvertState state) {
float normValue = lengthNorm(state);
return encodeNormValue(normValue);
}
public abstract float decodeNormValue(long norm);
public abstract long encodeNormValue(float f);
public abstract float sloppyFreq(int distance);
public abstract float scorePayload(int doc, int start, int end, BytesRef payload);
@override
public final SimScorer simScorer(SimWeight stats, AtomicReaderContext context) throws IOException {
IDFStats idfstats = (IDFStats)stats;
return new TFIDFSimScorer(idfstats, context.reader().getNormValues(idfstats.field));
}
@Override
public final SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) {
return new IDFStats(collectionStats.field(), idf, queryBoost);
}
}
几个接口都非常清楚明白了。
两个内部类TFIDFSimScorer(继承至SimScorer)和IDFStats(继承自SimWeight),其中IDFStats的value提供value,这个value值是在提供getValueForNormalization供上层(对BooleanQuery来说)得到queryNorm后再调用normalize之后才得到的。这个value只是经过normalize的idf值,在打分时调用TFIDFSimScorer.score(int doc, float freq)会将tf乘以value,然后再乘以norm值(如果没有norm值就乘以1.0),如果为了更贴近vsm模型,norm值还是要的,不过缺点还是不够精确(精度丢失)。另外在获取normalized的idf时并没有乘以一个term boost,所以如果要区分不同field,不同term的权重,TFIDFSimilarity还有不足。按照这个score的节奏,上层的BooleanScorer只要将各个term,phrase query的score累加起来就可以。
DefalutSimilarity就是对上面的抽象方法进行实现了,注意其中有个discountOverlaps,设置是否对重复的term不做计算,lengthNorm = state.getBoost() * 1.0 / Math.sqrt(numTerms),state.getBoost()应该是field boost吧?再看sloppyFreq,则是1 / (1 + distance),scorePayload总是返回1