则矢量A和B的内积表示为:
A·B=a1×b1+a2×b2+……+an×bn
A·B = |A| × |B| × cosθ
|A|=(a1^2+a2^2+...+an^2)^(1/2);
|B|=(b1^2+b2^2+...+bn^2)^(1/2).
其中,|A| 和 |B| 分别是向量A和B的模,是θ向量A和向量B的夹角(θ∈[0,π])
把w的公式代入,则为
Vq*Vd = tf(t1, q)*idf(t1, q)*tf(t1, d)*idf(t1, d) + tf(t2, q)*idf(t2, q)*tf(t2, d)*idf(t2, d) + …… + tf(tn ,q)*idf(tn, q)*tf(tn, d)*idf(tn, d)
在这里有三点需要指出:
- 由于是点积,则此处的t1, t2, ……, tn只有查询语句和文档的并集有非零值,只在查询语句出现的或只在文档中出现的Term的项的值为零。
- 在查询的时候,很少有人会在查询语句中输入同样的词,因而可以假设tf(t, q)都为1
- idf是指Term在多少篇文档中出现过,其中也包括查询语句这篇小文档,因而idf(t, q)和idf(t, d)其实是一样的,是索引中的文档总数加一,当索引中的文档总数足够大的时候,查询语句这篇小文档可以忽略,因而可以假设idf(t, q) = idf(t, d) = idf(t)
基于上述三点,点积公式为:
Vq*Vd = tf(t1, d) * idf(t1) * idf(t1) + tf(t2, d) * idf(t2) * idf(t2) + …… + tf(tn, d) * idf(tn) * idf(tn)
我们现在假设t等于1,那么Vq * Vd =
(2)分母部分推导
#|Vq|推导过程
查询语句中tf都为1,idf都忽略查询语句这篇小文档
上面这个是数学模型,实际上Lucene这里稍微做了下改变:
对用Lucene公式的queryNorm(q)
#|Vd|推导
在默认状况下,Lucene采用DefaultSimilarity,认为在计算文档的向量长度的时候,每个Term的权重就不再考虑在内了,而是全部为一,
所以推导如下:
上面这个是数学模型,实际是Lucene这里做了下改动
=
对应Lucene公式的lengthNorm(t.field in d)
经过合并后
对于搜索词来说所有的调节因子都是一样的,所以推导后
因为t in q词的个数等于t in d词的个数,因此合并之后等于:
替换后等于
再加上再加上各种boost和coord,则可得出Lucene的打分计算公式,
至此,推导过程完毕。
5.评分公式的Lucene实现
Lucene的推导过程已经完毕,讨论下评分公式Lucene具体的实现
(1)tf
含义:就是该词的域的频率
Lucene默认实现:
(2)idf
含义:逆文档频率
Lucene默认实现:
(3)lengthNorm(t.field in d)
含义:搜索词在文档的协调因子
Lucene默认实现:
lengthNorm(field) = (1.0 / Math.sqrt(numTerms))
lucene还要对这个值进行一步规格化处理。处理过程大概是定义一个因子数组,
然后找对最接近这个因子的数值。
(4)queryNorm(q)
含义:查询调节因子
Lucene默认实现:
queryNorm(q) = 1.0 / Math.sqrt(sumOfSquaredWeights)
sumOfSquaredWeights = (idf * boost)^2
(5)boost(t.field in d)
含义:查询词在文档的权重
(6)coord(q d)
含义:搜索词匹配文档词的比率
lucene默认实现:
coord(q d) = match d word count/q word count
6.验证评分公式
(1)准备数据
#索引数据
#分词
采用StandardAnalyzer分词
#查询词
依云t
分词结果是 依-云-t,三个
(2)测试方法
通过Lucene的explainnation类打印评分结果
(3)查看结果
5169----依云矿泉水喷雾50ml
0.09081932 = (MATCH) product of:
0.13622898 = (MATCH) sum of:
0.06811449 = (MATCH) weight(title:依 in 0), product of:
0.30599588 = queryWeight(title:依), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.42957768 = queryNorm
0.22259936 = (MATCH) fieldWeight(title:依 in 0), product of:
1.0 = tf(termFreq(title:依)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.3125 = fieldNorm(field=title, doc=0)
0.06811449 = (MATCH) weight(title:云 in 0), product of:
0.30599588 = queryWeight(title:云), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.42957768 = queryNorm
0.22259936 = (MATCH) fieldWeight(title:云 in 0), product of:
1.0 = tf(termFreq(title:云)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.3125 = fieldNorm(field=title, doc=0)
0.6666667 = coord(2/3)
5170----依云矿泉水喷雾150ml
0.09081932 = (MATCH) product of:
0.13622898 = (MATCH) sum of:
0.06811449 = (MATCH) weight(title:依 in 1), product of:
0.30599588 = queryWeight(title:依), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.42957768 = queryNorm
0.22259936 = (MATCH) fieldWeight(title:依 in 1), product of:
1.0 = tf(termFreq(title:依)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.3125 = fieldNorm(field=title, doc=1)
0.06811449 = (MATCH) weight(title:云 in 1), product of:
0.30599588 = queryWeight(title:云), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.42957768 = queryNorm
0.22259936 = (MATCH) fieldWeight(title:云 in 1), product of:
1.0 = tf(termFreq(title:云)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.3125 = fieldNorm(field=title, doc=1)
0.6666667 = coord(2/3)
5171----依云矿泉水喷雾你你你你你你你你300ml
0.072655454 = (MATCH) product of:
0.10898318 = (MATCH) sum of:
0.05449159 = (MATCH) weight(title:依 in 2), product of:
0.30599588 = queryWeight(title:依), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.42957768 = queryNorm
0.17807949 = (MATCH) fieldWeight(title:依 in 2), product of:
1.0 = tf(termFreq(title:依)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.25 = fieldNorm(field=title, doc=2)
0.05449159 = (MATCH) weight(title:云 in 2), product of:
0.30599588 = queryWeight(title:云), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.42957768 = queryNorm
0.17807949 = (MATCH) fieldWeight(title:云 in 2), product of:
1.0 = tf(termFreq(title:云)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.25 = fieldNorm(field=title, doc=2)
0.6666667 = coord(2/3)
通过结果能够印证评分公式的计算过程