elasticsearch中基于slop参数实现近似匹配

最新推荐文章于 2024-04-23 10:08:45 发布

私念

最新推荐文章于 2024-04-23 10:08:45 发布

阅读量344

点赞数

分类专栏： elasticsearch

本文链接：https://blog.csdn.net/tiancityycf/article/details/116276455

版权

elasticsearch 专栏收录该内容

43 篇文章 3 订阅

订阅专栏

参考：https://www.phpmianshi.com/?id=248

slop的含义

query string，搜索文本，中的几个term，要经过几次移动才能与一个document匹配，这个移动的次数，就是slop

词条位置

当一个字符串被分析时，分析器不仅只返回一个词条列表，它同时也返回原始字符串的每个词条的位置、或者顺序信息：

例如：

POST /_analyze
{
  "analyzer": "standard",
  "text": "区块链比特币"
}

结果：

{
  "tokens" : [
    {
      "token" : "区",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "块",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "链",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "比",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "特",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "币",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    }
  ]
}

示例

假设我们有个theme字段，存储的 “区块链,新能源,比特币,军工,医疗保健,医药”，标准分词后结果如下

POST /_analyze
{
  "analyzer": "standard",
  "text": "区块链,新能源,比特币,军工,医疗保健,医药"
}

结果：

{
  "tokens" : [
    {
      "token" : "区",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "块",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "链",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "新",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "能",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "源",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "比",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "特",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "币",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "军",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "工",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "医",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "疗",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "保",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    },
    {
      "token" : "健",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    },
    {
      "token" : "医",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<IDEOGRAPHIC>",
      "position" : 15
    },
    {
      "token" : "药",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    }
  ]
}

查询我们使用如下命令

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase":{
            "theme":{
              "query":"区块链比特币",
              "slop":0
            }
          }
        }
        ]
    }
  },
  "from": 0,
  "size": 20
}

我们可以看到查询不到结果

原因分析

和match查询类似，match_phrase查询首先解析查询字符串来产生一个词条列表。然后会搜索所有的词条，但只保留包含了所有搜索词条的文档，并且词条的位置要邻接。

“区块链比特币” 标准分词后都是单个字，如上结果

“区块链,新能源,比特币,军工,医疗保健,医药”标准分词后，也都是单个字，如上结果

我们发现，其他关键字都紧邻着，但是“链”的postion=2 和 “比”的positon=6 之间的 position 差了4，但是我们设置的slop为0，要求分词后的位置必须紧邻（不用挪动位置），所以没有搜索到，根据我们刚才的分析，我们试着把slop逐渐增加，发现一直增大到3，才能搜到，也就是需要挪动3次，2挪动3次到5，就跟6紧挨着了，也就匹配到了

`总结`

1.位置信息可以被保存在倒排索引(Inverted Index)中，像match_phrase这样位置感知(Position-aware)的查询能够使用位置信息来匹配那些含有正确单词出现顺序的文档，且在这些单词之间没有插入别的单词。我们可以在短语匹配使用slop参数来引入一些灵活性，slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配。相隔多远的意思是，你需要移动一个词条多少次来让查询和文档匹配

2.slop的含义，不仅仅是说一个query string terms移动几次，跟一个doc匹配上。一个query string terms，最多可以移动几次去尝试跟一个doc匹配上

3.slop搜索下，关键词离的越近，relevance score就会越高