1. 问题回顾
前面第一章,我们介绍了地图兴趣点检索的基本流程,以及如何用elasticsearch+ik搭建一个简单的demo。在运行demo时我们用“通州区万达广场“去搜索,结果排第一位的结果竟然是位于朝阳区的”建国路万达广场“。第二章,我们对ES的相关性打分原理进行了探索,了解了整体的打分策略。本文我们将利用ES提供的接口来调整打分规则,让搜索的结果符合我们的预期。
首先通过ES的explain参数来输出一下结果,具体分析一下为何第2名明显更符合常理的地址得分比较低。
get http://localhost:9200/idx_default/_search?explain=true
{
"query": {
"match": {
"address": {
"query": "通州区万达广场"
}
}
}
}
结果如下(只摘出前两名)
{
"_shard": "[idx_default][0]",
"_node": "Crj7_cZOQT6w9sG0ryBbzQ",
"_index": "idx_default",
"_type": "_doc",
"_id": "138069",
"_score": 17.299044,
"_source": {
"address": "建国路万达广场",
"name": "恒大山水城",
"location": "39.90867476611688,116.46468505121267"
},
"_explanation": {
"value": 17.299044,
"description": "sum of:",
"details": [
{
"value": 10.175069,
"description": "weight(address:万达 in 138410) [PerFieldSimilarity], result of:",
"details": [
{
"value": 10.175069,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 7.7361317,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 89,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 204918,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.59784806,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 3.0,
"description": "dl, length of field",
"details": []
},
{
"value": 7.245098,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 7.1239743,
"description": "weight(address:广场 in 138410) [PerFieldSimilarity], result of:",
"details": [
{
"value": 7.1239743,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 5.416376,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 910,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 204918,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.59784806,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 3.0,
"description": "dl, length of field",
"details": []
},
{
"value": 7.245098,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
},
{
"_shard": "[idx_default][0]",
"_node": "Crj7_cZOQT6w9sG0ryBbzQ",
"_index": "idx_default",
"_type": "_doc",
"_id": "28730",
"_score": 16.216942,
"_source": {
"address": "北京市通州区新华西街58号万达广场F2",
"name": "手寓工坊(万达广场店)",
"location": "39.904175142894765,116.63712318703388"
},
"_explanation": {
"value": 16.216942,
"description": "sum of:",
"details": [
{
"value": 2.879858,
"description": "weight(address:通州区 in 28165) [PerFieldSimilarity], result of:",
"details": [
{
"value": 2.879858,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 2.8400025,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 11972,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 204918,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.46092433,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 7.0,
"description": "dl, length of field",
"details": []
},
{
"value": 7.245098,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 7.844697,
"description": "weight(address:万达 in 28165) [PerFieldSimilarity], result of:",
"details": [
{
"value": 7.844697,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 7.7361317,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 89,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 204918,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.46092433,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 7.0,
"description": "dl, length of field",
"details": []
},
{
"value": 7.245098,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 5.4923873,
"description": "weight(address:广场 in 28165) [PerFieldSimilarity], result of:",
"details": [
{
"value": 5.4923873,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 5.416376,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 910,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 204918,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.46092433,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 7.0,
"description": "dl, length of field",
"details": []
},
{
"value": 7.245098,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
从结果可见**”建国路万达广场“(后面简称“建国路地址”)得分为17.299044,而”北京市通州区新华西街58号万达广场F2“**(后面简称“通州区地址”)只有16.216942。上一章提到,建国路地址位于朝阳区,显然与我们的查询条件相差比较远,而通州区地址更符合预期。为什么会得到现在的结果,可以在_explanation内部找到答案,下面利用上一章学习的score模型我们来分析一下原因。
2. 原因分析
先来看看_explanation的结构,它是一个JSON对象,下面有三个属性"value"、“description”、“details”,分别表示“得分”,”计算公式“和公式中的所有”变量值“,其中details为一个数组,数组内的元素也是类似结构的JSON对象。这样的JSON对象有4层,第1层是总体得分对象;第2层是分词得分对象;第3层是子项得分对象,比如某个词条的idf得分;第4层是子项变量对象,比如某个词条idf公式内的变量N的值。下面是总得分的计算公式:
最终总得分
=
∑
i
n
每个词条得分
最终总得分=\sum_{i}^{n}每个词条得分
最终总得分=i∑n每个词条得分
再具体分析单个词条,以“建国路万达广场”中的“万达”词条为例。我们找到“万达”JSON对象,再看它的details为“score(freq=1.0), computed as boost * idf * tf from:”,里面需要三个值:
**boost **是一个查询的权重项,我们可以在创建索引时,通过mapping对指定的field设定boost值,当我们进行多字段混合查询时可以区分不同field的权重。
**idf **即逆文档词频,描述为:“idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:”。具体含义参见上一章
**tf **即词频,描述为:“tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:”。具体含义参见上一章
了解了ES中得分的计算方式及结果的含义,我们来分析下,为什么建国路地址比通州区地址的得分要高。把JSON结果变换为如下词条得分表,每一行代表左侧的词条在两个地址中的得分。其中词条“通州区”只存在于通州区地址。但即使多了一个词条,通州区地址的得分仍然更低。
通过对“万达”、“广场”两个词在地址中的得分进一步分析,可以发现具体原因。下面表格表示计算“万达”和“广场”两个词时,每个计算项的得分。可以明显发现,通州区地址在tf词频得分上都更低,其他项则相同。
进一步看tf的计算情况,发现区别只在于dl,他的description为“dl, length of field”,即地址的长度。
回顾一下上一篇介绍的tf的计算公式(这个公式和ES默认的计算公式略有不同,ES版本分子上的k+1被省略了,但整体效果相同):
T
F
=
(
k
+
1
)
⋅
f
i
k
⋅
(
1
−
b
+
b
⋅
d
l
a
v
g
(
d
l
)
)
+
f
i
TF=\frac{(k+1)\cdot f_i}{k\cdot(1-b+b\cdot\frac{dl}{avg(dl)})+f_i}
TF=k⋅(1−b+b⋅avg(dl)dl)+fi(k+1)⋅fi
其中dl为当前文档的长度,avg(dl)为文档库中文档的平均长度。显然这里avg(dl)大家是相同的,而dl越大tf的得分就越低。所以分析后得到的原因是通州区地址,即“北京市通州区新华西街58号万达广场F2”太长了。虽然它覆盖的词条更多(多了一个通州区),但是dl会影响每个词条的得分。下面我们看看有什么参数可以调节从而减少dl的影响。
3. 调整参数
上一篇文章最后我们介绍了tf公式内有一个参数b,提到了它是BM25让我们调节文档长度影响程度的因子,当b=0时,分母变为k+fi,完全消除了文档长度影响。当b值更高时,长度因素则会对TF得分有更大的影响。显然本文我们希望降低,甚至消除长度的影响,因为地址库里面所有地址长度差异不大,我希望它们公平竞争,谁匹配的词多谁得分高。
ES提供了非常方便的接口,只需要在创建索引时,在settings内部定义一下b的值。具体命令如下:
put http://localhost:9200/idx_default
{
"settings": {
"index": {
"similarity": {
"BM25_b_0": {
"type": "BM25",
"b": "0.0"
}
}
}
},
"mappings": {
"poipo": {
"properties": {
"location": {
"type": "geo_point"
},
"address": {
"type": "text",
"similarity": "BM25_b_0"
}
}
}
}
}
BM25_b_0是我们定义的相似性计算模型,type指定了它是一个BM25模型,b则指明我们要覆盖此变量让其值变为0。然后在下面mappings中指定address字段的similarity为新模型。至此我们完成了新索引的构建,重新导入数据后再次查询。结果如下:
{
"_index": "idx_default",
"_type": "poipo",
"_id": "56963",
"_score": 17.46982,
"_source": {
"address": "北京市通州区新华街道建国路93号院万达广场11号楼",
"location": {
"lon": 116.6574382584145,
"lat": 39.92313729883979
}
}
},
{
"_index": "idx_default",
"_type": "poipo",
"_id": "87454",
"_score": 16.99757,
"_source": {
"address": "北京市通州区北苑街道手寓工坊(万达广场店)",
"location": {
"lon": 116.64295933891906,
"lat": 39.905244856754514
}
}
}
...
这里只列举前两个结果,显然都是通州区的万达广场,说明我们的参数调整已经发挥作用。
本文我们利用一个例子说明了如何查看ES查询结果及详情,并通过分析得分的计算细节,找出了错误排名的原因。最后,利用ES提供的参数调整接口实现了模型的修改。这个调参的案例比较粗暴的将长度因子进行了剔除,后面章节我们会尝试从词条的优先级入手探讨更细粒度的调参策略。