# [译] Practical BM25 - Part 3: 怎样选取 Elasticsearch 的 b 和 k1 参数

13 篇文章 0 订阅

《Practical BM25》系列文章来自于 elastic 官方博客，共分为三部分，讲解了 Elasticsearch 的默认相似度算法 BM25 的原理。本篇为第三部分的中文翻译，原文链接 Practical BM25 - Part 3: Considerations for Picking b and k1 in Elasticsearch

## 选取 b 和 k1

Elasticsearch能够如此强大部分是因为你可以基于这些基础，构建出非常健壮的搜索体验。假设你已经把能做的都做了，就是想追求一下极致，那么如何选取b和k1？

b需要在0到1之间。有些实验测试了增量为0.1左右时的各个值，大部分实验得出的结论是b在0.3-0.9这个范围内可以获得最优的效果(Lipani, Lupu, Hanbury, Aizawa (2015); Taylor, Zaragoza, Craswell, Robertson, Burges (2006); Trotman, Puurula, Burgess (2014); etc.)

k1通常在0到3这个范围内，尽管没人阻止你把它设置得更高。有些实验以0.1到0.2为增量对k1的值进行测试，得出结论是k1在0.5-2.0这个范围内效果最优。

## Explain API

GET /people3/_doc/4/_explain
{
"query": {
"match": {
"title": "shane connelly"
}
}
}

{
"_index": "people3",
"_type": "_doc",
"_id": "4",
"matched": true,
"explanation": {
"value": 0.71437943,
"description": "sum of:",
"details": [
{
"value": 0.102611035,
"description": "weight(title:shane in 3) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.102611035,
"description": "score(doc=3,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.074107975,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 6,
"description": "docFreq",
"details": []
},
{
"value": 6,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.3846153,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 5,
"description": "parameter k1",
"details": []
},
{
"value": 1,
"description": "parameter b",
"details": []
},
{
"value": 3,
"description": "avgFieldLength",
"details": []
},
{
"value": 2,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.61176836,
"description": "weight(title:connelly in 3) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.61176836,
"description": "score(doc=3,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.44183275,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 4,
"description": "docFreq",
"details": []
},
{
"value": 6,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.3846153,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 5,
"description": "parameter k1",
"details": []
},
{
"value": 1,
"description": "parameter b",
"details": []
},
{
"value": 3,
"description": "avgFieldLength",
"details": []
},
{
"value": 2,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
}
}

## 总结

BM25并不是唯一的评分算法！还有传统的TF/IDF, 随机性差异模型(divergence from randomness)(https://en.wikipedia.org/wiki/Divergence-from-randomness_model) 等等——甚至还有基于超链接的变种pagerank——你更可以将这些算法中的一些组合起来使用！多年来，已经出现了许多对于核心BM25算法的变种。例如，其中有一些BM25变种在学术上尝试自动选取/建议/估计k1和b值。事实上，有原因/证据表明，至少以term-by-term为基础，k1是可以有最优解的(Lv, ChengXiang (2011))。在这样的情况下，问一句“为什么用BM25？”或者“为什么BM25选了k1 = 1.2和b = 0.75这两个值？”

"本研究中检验了9种排序函数，2种相关性反馈方法，5种词干提取算法，和2个停止词列表。我们发现停止词是无效的，词干提取是有效的，相关性反馈是有效的，不使用停止词，词干提取和使用反馈能够优化任意排序函数的效果。但没有明确的证据表明某个排序函数要系统性地比其他排序函数优秀。"

## 参考

Lipani, A., Lupu, M., Hanbury, A., Aizawa, A. (2015). Verboseness Fission for BM25 Document Length Normalization. Association for Computing Machinery

Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C. (2006). Optimisation methods for ranking functions with multiple parameters. Association for Computing Machinery

Trotman, A., Puurula, A., Burgess, B. (2014). Improvements to BM25 and Language Models Examined. Association for Computing Machinery

Lv, Y., ChengXiang, Z. (2011). Adaptive term frequency normalization for BM25. Association for Computing Machinery

• 0
点赞
• 4
收藏
觉得还不错? 一键收藏
• 0
评论
06-18 2131
06-07 5126
01-15 455
06-15 4749
10-19 740
12-04 368
02-06 1499
05-18 809

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。