ElasticSearch的match和match_phrase查询

最新推荐文章于 2024-05-20 10:35:07 发布

攻城狮阿楠

最新推荐文章于 2024-05-20 10:35:07 发布

阅读量2.3k

点赞数 1

分类专栏：数据检索文章标签： match和match_phrase

数据检索专栏收录该内容

12 篇文章 0 订阅

订阅专栏

问题：

索引中有『第十人民医院』这个字段，使用IK分词结果如下 :

POST http://localhost:9200/development_hospitals/_analyze?pretty&field=hospital.names&analyzer=ik

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "十人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "十",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "人民医院",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人",
      "start_offset": 2,
      "end_offset": 3,
      "type": "COUNT",
      "position": 5
    },
    {
      "token": "民医院",
      "start_offset": 3,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "医院",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}

使用Postman构建match查询：

可以得到结果，但是使用match_phrase查询『第十』却没有任何结果

问题分析：

参考文档 The Definitive Guide [2.x] | Elastic

phrase搜索跟关键字的位置有关, 『第十』采用ik_max_word分词结果如下

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "十",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 1
    }
  ]
}

虽然『第十』和『十』都可以命中，但是match_phrase的特点是分词后的相对位置也必须要精准匹配，『第十人民医院』采用id_max_word分词后，『第十』和『十』之间有一个『十人』，所以无法命中。

解决方案：

采用ik_smart分词可以避免这样的问题，对『第十人民医院』和『第十』采用ik_smart分词的结果分别是：

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "人民医院",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

稳稳命中

最佳实践：

采用match_phrase匹配，结果会非常严格，但是也会漏掉相关的结果，个人觉得混合两种方式进行bool查询比较好，并且对match_phrase匹配采用boost加权，比如对name进行2种分词并索引，ik_smart分词采用match_phrase匹配，ik_max_word分词采用match匹配，如：

{
  "query": {
    "bool": {
      "should": [
          {"match_phrase": {"name1": {"query": "第十", "boost": 2}}},
          {"match": {"name2": "第十"}}
      ]
    }
  },
  explain: true

}

转自：https://zhuanlan.zhihu.com/p/25970549

攻城狮阿楠

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch的match和match_phrase查询

问题：索引中有『第十人民医院』这个字段，使用IK分词结果如下 :POST http://localhost:9200/development_hospitals/_analyze?pretty&amp;field=hospital.names&amp;analyzer=ik{ "tokens": [ { "token": "第十", "start_offset"...
复制链接

扫一扫

专栏目录