Lucene算法公式如下
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )
- tf(t in d ), = frequency½
- idf(t) = 1 +log(文档总数/(包含t的文档数+1))
- coord(q,d) 评分因子,。越多的查询项在一个文档中,说明些文档的匹配程序越高,比如说,查询"A B C",那么同时包含A/B/C3个词的文档 是3分,只包含A/B的文档是2分,coord可以在query中关掉的
- queryNorm(q)查询的标准查询,使不同查询之间可以比较
- t.getBoost() 和 norm(t,d) 都是提供的可编程接口,可以调整 field/文档/query项 的权重
- queryNorm(q) = 1 / (sumOfSquaredWeights )
sumOfSquaredWeights = q.getBoost()2 • ∑ ( idf(t) • t.getBoost() )2
-
norm(t,d) = d.getBoost() • lengthNorm(f) • f.getBoost()
-
lengthNorm(field) = (1.0 / Math.sqrt(numTerms)):一个域中包含的Term总数越多,也即文档越长,此值越小,文档越短,此值越大。
各种编程插口显得很麻烦,可以不使用,所以我们可以把Lucence的算分公式进行简化
score(q,d) = coord(q,d) · ∑ ( tf(t in d) · idf(t)2 )
"id": "30",
"sName": "万科 绿地 万科 徐 徐 徐 万科 绿地",
"text_search": [ "万科 绿地 万科 徐 徐 徐 万科 绿地", "万科 绿地 万科 徐 徐 徐 万科 绿地","万科 绿地 万科 徐 徐 徐 万科 绿地","万科海洋公园","万科海洋公园", "万科海洋绿地公园" ],
"sAlias": "万科海洋公园",
"sAddress": "万科海洋绿地公园",
"_version_": 1554657424083779600 },debug": {
"rawquerystring": "sName:徐",
"querystring": "sName:徐",
"parsedquery": "(+sName:徐 FunctionQuery(product(norm(sName),tf(sName,徐),idf(sName,徐))))/no_coord",
"parsedquery_toString": "+sName:徐 product(norm(sName),tf(sName,徐),idf(sName,徐))","explain": {
"28": "\n1.2036215 = (MATCH) sum of:\n 0.77542824 = (MATCH) weight(sName:徐 in 0) [DefaultSimilarity], result of:\n 0.77542824 = score(doc=0,freq=5.0), product of:\n 0.87540054 = queryWeight, product of:\n 1.8109303 = idf(docFreq=3, maxDocs=9)\n 0.48339826 = queryNorm\n 0.8857982 = fieldWeight in 0, product of:\n 2.236068 = tf(freq=5.0), with freq of:\n 5.0 = termFreq=5.0\n 1.8109303 = idf(docFreq=3, maxDocs=9)\n 0.21875 = fieldNorm(doc=0)\n 0.4281933 = (MATCH) FunctionQuery(product(norm(sName),tf(sName,徐),idf(sName,徐))), product of:\n 0.8857982 = product(norm(sName)=0.21875,tf(sName,徐)=2.236068,idf(sName,徐)=1.8109302520751953)\n 1.0 = boost\n 0.48339826 = queryNorm\n",
"29": "\n0.761237 = (MATCH) sum of:\n 0.49042383 = (MATCH) weight(sName:徐 in 0) [DefaultSimilarity], result of:\n 0.49042383 = score(doc=0,freq=2.0), product of:\n 0.87540054 = queryWeight, product of:\n 1.8109303 = idf(docFreq=3, maxDocs=9)\n 0.48339826 = queryNorm\n 0.56022793 = fieldWeight in 0, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 1.8109303 = idf(docFreq=3, maxDocs=9)\n 0.21875 = fieldNorm(doc=0)\n 0.27081323 = (MATCH) FunctionQuery(product(norm(sName),tf(sName,徐),idf(sName,徐))), product of:\n 0.560228 = product(norm(sName)=0.21875,tf(sName,徐)=1.4142135,idf(sName,徐)=1.8109302520751953)\n 1.0 = boost\n 0.48339826 = queryNorm\n",
"30": "\n
0.9323212 = (MATCH) sum of:\n
0.6006441 = (MATCH) weight(sName:徐 in 0) [DefaultSimilarity], result of:\n
//1.
0.6006441 = score(doc=0,freq=3.0),
product of:\n
// 1.1. 以下为 coord(q,d) · queryNorm(q) 得分
0.87540054 = queryWeight,
product of:\n 1.8109303 = idf(docFreq=3, maxDocs=9)\n 0.48339826 = queryNorm\n
//1.2.以下为bf函数的得分 ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )
0.6861363 = fieldWeight in 0, product of:\n
1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n
1.8109303 = idf(docFreq=3, maxDocs=9)\n
0.21875 = fieldNorm(doc=0)\n
//2.以下为bf函数的得分 ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )*queryNorm
0.3316771 = (MATCH) FunctionQuery(product(norm(sName),tf(sName,徐),idf(sName,徐))),
product of:\n
0.6861363 = product(
norm(sName)=0.21875,
tf(sName,徐)=1.7320508,
idf(sName,徐)=1.8109302520751953)\n
1.0 = boost\n
0.48339826 = queryNorm\n"
0.48339826 = (MATCH) FunctionQuery(product(const(1))), product of:\n 1.0 = product(const(1))\n 1.0 = boost\n 0.48339826 = queryNorm\n"