地图兴趣点搜索三(ES相关性得分参数调整)

1. 问题回顾

​ 前面第一章,我们介绍了地图兴趣点检索的基本流程,以及如何用elasticsearch+ik搭建一个简单的demo。在运行demo时我们用“通州区万达广场“去搜索,结果排第一位的结果竟然是位于朝阳区的”建国路万达广场“。第二章,我们对ES的相关性打分原理进行了探索,了解了整体的打分策略。本文我们将利用ES提供的接口来调整打分规则,让搜索的结果符合我们的预期。

首先通过ES的explain参数来输出一下结果,具体分析一下为何第2名明显更符合常理的地址得分比较低。

get http://localhost:9200/idx_default/_search?explain=true
{

  "query": {"match": {"address": {"query": "通州区万达广场"}}

  }

}

结果如下(只摘出前两名)

{
                "_shard": "[idx_default][0]",
                "_node": "Crj7_cZOQT6w9sG0ryBbzQ",
                "_index": "idx_default",
                "_type": "_doc",
                "_id": "138069",
                "_score": 17.299044,
                "_source": {
                    "address": "建国路万达广场",
                    "name": "恒大山水城",
                    "location": "39.90867476611688,116.46468505121267"
                },
                "_explanation": {
                    "value": 17.299044,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 10.175069,
                            "description": "weight(address:万达 in 138410) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 10.175069,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 7.7361317,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 89,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.59784806,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 3.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 7.1239743,
                            "description": "weight(address:广场 in 138410) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 7.1239743,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 5.416376,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 910,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.59784806,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 3.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
{
                "_shard": "[idx_default][0]",
                "_node": "Crj7_cZOQT6w9sG0ryBbzQ",
                "_index": "idx_default",
                "_type": "_doc",
                "_id": "28730",
                "_score": 16.216942,
                "_source": {
                    "address": "北京市通州区新华西街58号万达广场F2",
                    "name": "手寓工坊(万达广场店)",
                    "location": "39.904175142894765,116.63712318703388"
                },
                "_explanation": {
                    "value": 16.216942,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 2.879858,
                            "description": "weight(address:通州区 in 28165) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 2.879858,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 2.8400025,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 11972,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.46092433,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 7.844697,
                            "description": "weight(address:万达 in 28165) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 7.844697,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 7.7361317,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 89,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.46092433,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 5.4923873,
                            "description": "weight(address:广场 in 28165) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 5.4923873,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 5.416376,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 910,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.46092433,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }

​ 从结果可见**”建国路万达广场“后面简称“建国路地址”)得分为17.299044,而”北京市通州区新华西街58号万达广场F2“**(后面简称“通州区地址”)只有16.216942。上一章提到,建国路地址位于朝阳区,显然与我们的查询条件相差比较远,而通州区地址更符合预期。为什么会得到现在的结果,可以在_explanation内部找到答案,下面利用上一章学习的score模型我们来分析一下原因。

2. 原因分析

​ 先来看看_explanation的结构,它是一个JSON对象,下面有三个属性"value"、“description”、“details”,分别表示“得分”,”计算公式“和公式中的所有”变量值“,其中details为一个数组,数组内的元素也是类似结构的JSON对象。这样的JSON对象有4层,第1层是总体得分对象;第2层是分词得分对象;第3层是子项得分对象,比如某个词条的idf得分;第4层是子项变量对象,比如某个词条idf公式内的变量N的值。下面是总得分的计算公式:
最终总得分 = ∑ i n 每个词条得分 最终总得分=\sum_{i}^{n}每个词条得分 最终总得分=in每个词条得分
​ 再具体分析单个词条,以“建国路万达广场”中的“万达”词条为例。我们找到“万达”JSON对象,再看它的details为“score(freq=1.0), computed as boost * idf * tf from:”,里面需要三个值:

**boost **是一个查询的权重项,我们可以在创建索引时,通过mapping对指定的field设定boost值,当我们进行多字段混合查询时可以区分不同field的权重。

**idf **即逆文档词频,描述为:“idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:”。具体含义参见上一章

**tf **即词频,描述为:“tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:”。具体含义参见上一章

​ 了解了ES中得分的计算方式及结果的含义,我们来分析下,为什么建国路地址比通州区地址的得分要高。把JSON结果变换为如下词条得分表,每一行代表左侧的词条在两个地址中的得分。其中词条“通州区”只存在于通州区地址。但即使多了一个词条,通州区地址的得分仍然更低。

在这里插入图片描述

​ 通过对“万达”、“广场”两个词在地址中的得分进一步分析,可以发现具体原因。下面表格表示计算“万达”和“广场”两个词时,每个计算项的得分。可以明显发现,通州区地址在tf词频得分上都更低,其他项则相同。

在这里插入图片描述

进一步看tf的计算情况,发现区别只在于dl,他的description为“dl, length of field”,即地址的长度。

在这里插入图片描述

回顾一下上一篇介绍的tf的计算公式(这个公式和ES默认的计算公式略有不同,ES版本分子上的k+1被省略了,但整体效果相同):
T F = ( k + 1 ) ⋅ f i k ⋅ ( 1 − b + b ⋅ d l a v g ( d l ) ) + f i TF=\frac{(k+1)\cdot f_i}{k\cdot(1-b+b\cdot\frac{dl}{avg(dl)})+f_i} TF=k(1b+bavg(dl)dl)+fi(k+1)fi
其中dl为当前文档的长度,avg(dl)为文档库中文档的平均长度。显然这里avg(dl)大家是相同的,而dl越大tf的得分就越低。所以分析后得到的原因是通州区地址,即“北京市通州区新华西街58号万达广场F2”太长了。虽然它覆盖的词条更多(多了一个通州区),但是dl会影响每个词条的得分。下面我们看看有什么参数可以调节从而减少dl的影响。

3. 调整参数

​ 上一篇文章最后我们介绍了tf公式内有一个参数b,提到了它是BM25让我们调节文档长度影响程度的因子,当b=0时,分母变为k+fi,完全消除了文档长度影响。当b值更高时,长度因素则会对TF得分有更大的影响。显然本文我们希望降低,甚至消除长度的影响,因为地址库里面所有地址长度差异不大,我希望它们公平竞争,谁匹配的词多谁得分高。

​ ES提供了非常方便的接口,只需要在创建索引时,在settings内部定义一下b的值。具体命令如下:

put http://localhost:9200/idx_default
{
    "settings": {
        "index": {
            "similarity": {
                "BM25_b_0": {
                    "type": "BM25",
                    "b": "0.0"
                }
            }
        }
    },
    "mappings": {
        "poipo": {
            "properties": {
                "location": {
                    "type": "geo_point"
                },
                "address": {
                    "type": "text",
                    "similarity": "BM25_b_0"
                }
            }
        }
    }
}

​ BM25_b_0是我们定义的相似性计算模型,type指定了它是一个BM25模型,b则指明我们要覆盖此变量让其值变为0。然后在下面mappings中指定address字段的similarity为新模型。至此我们完成了新索引的构建,重新导入数据后再次查询。结果如下:

{
                "_index": "idx_default",
                "_type": "poipo",
                "_id": "56963",
                "_score": 17.46982,
                "_source": {
                    "address": "北京市通州区新华街道建国路93号院万达广场11号楼",
                    "location": {
                        "lon": 116.6574382584145,
                        "lat": 39.92313729883979
                    }
                }
            },
            {
                "_index": "idx_default",
                "_type": "poipo",
                "_id": "87454",
                "_score": 16.99757,
                "_source": {
                    "address": "北京市通州区北苑街道手寓工坊(万达广场店)",
                    "location": {
                        "lon": 116.64295933891906,
                        "lat": 39.905244856754514
                    }
                }
            }
...

这里只列举前两个结果,显然都是通州区的万达广场,说明我们的参数调整已经发挥作用。

​ 本文我们利用一个例子说明了如何查看ES查询结果及详情,并通过分析得分的计算细节,找出了错误排名的原因。最后,利用ES提供的参数调整接口实现了模型的修改。这个调参的案例比较粗暴的将长度因子进行了剔除,后面章节我们会尝试从词条的优先级入手探讨更细粒度的调参策略。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值