elasticsearch学习7--Full text queries全文检索之match_phrase与match_phrase_prefix query

最新推荐文章于 2024-08-11 15:22:00 发布

Cape_sir

最新推荐文章于 2024-08-11 15:22:00 发布

阅读量426

点赞数

分类专栏： elasticsearch学习文章标签： elasticsearch 大数据 es

本文链接：https://blog.csdn.net/weixin_42652596/article/details/110188403

版权

elasticsearch学习专栏收录该内容

15 篇文章 4 订阅

订阅专栏

同样我们先创建一个index，并添加数据。

# 创建index
PUT /test_003
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  },
  "mappings": {
    "_doc": {
      "dynamic": false,
      "properties": {
        "id": {
          "type": "integer"
        },
        "content": {
          "type": "keyword",
          "fields": {
            "ik_max_field": {
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_max_word"
            },
            "ik_smart_field": {
              "type": "text",
              "analyzer": "ik_smart"
            }
          }
        },
        "name": {
          "type": "text"
        },
        "createAt": {
          "type": "date"
        }
      }
    }
  }
}

# 导入测试数据
POST _bulk
{ "index" : { "_index" : "test_003", "_type" : "_doc", "_id" : "1" } }
{ "id" : 1,"content":"关注我,系统学编程" }
{ "index" : { "_index" : "test_003", "_type" : "_doc", "_id" : "2" } }
{ "id" : 2,"content":"系统学编程,关注我" }
{ "index" : { "_index" : "test_003", "_type" : "_doc", "_id" : "3" } }
{ "id" : 3,"content":"系统编程,关注我" }
{ "index" : { "_index" : "test_003", "_type" : "_doc", "_id" : "4" } }
{ "id" : 4,"content":"关注我,间隔系统学编程" }

1、match_phrase query

match_phrase查询分析文本并根据分析的文本创建一个短语查询。match_phrase会将检索关键词分词。
match_phrase的分词结果必须在被检索字段的分词中都包含，而且顺序必须相同，而且默认必须都是连续的。

举例说明：

# 1.使用match_phrase查询ik_smart_field字段，结果只有文档1匹配
POST /test_003/_doc/_search
{
    "query": {
        "match_phrase": {
            "content.ik_smart_field": {
            	"query": "关注我,系统学"
            }
        }
    }
}

# 2.使用match查询，ik_smart_field字段，可以查询出所有结果
POST /test_003/_doc/_search
{
    "query": {
        "match": {
            "content.ik_smart_field": {
            	"query": "关注我,系统学"
            }
        }
    }
}

分析：上面的例子使用的分词器是ik_smart，所以检索词“关注我，系统学”会被分词为3个Token【关注、我、系统学】；
虽然文档1、文档2和文档4的content被分词后都包含这3个关键词，但是只有文档1的Token的顺序和检索词一致，且连续。所以使用 match_phrase查询只能查询到文档1。
文档2 Token顺序不一致，也不连续；文档4Token不连续，中间有一个【间隔】Token；文档3 Token没有完全包含。
使用match查询可以查询到所有文档，是因为所有文档都有【关注、我】这两个Token。

match_phrase 核心参数：slop 参数-Token之间的位置距离容差值

# 将上面的 match_phrase 查询新增一个 slop参数，文档1和文档4都被检索出来
POST /test_003/_doc/_search
{
    "query": {
        "match_phrase": {
            "content.ik_smart_field": {
            	"query": "关注我,系统学",
            	"slop":1
            }
        }
    }
}

分析文档4 content 的分词

POST /_analyze
{
  "text": "关注我,间隔系统学编程",
  "analyzer": "ik_smart"
}
# 结果
{
  "tokens": [
    {
      "token": "关注",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "我",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "间隔",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "系统学",
      "start_offset": 6,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "编程",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}

通过分析，发现Token【我】与【系统学】的position差值为1(等于slop的值)，所以文档4也被检索出来了。

2、match_phrase_prefix query

与match_phrase查询类似，但是会对Token在倒排序索引列表中进行通配符搜索。
Token的模糊匹配数控制：max_expansions 默认值为50。
我们使用content.ik_smart_field这个字段中的【系统学】（文档1、2、4 包含）和【系统】（文档3包含）这两个Token来讲解match_phraseprefix 的用法。
因为使用的是ik_smart分词器，所以【系统学】就只能被分词为一个Token。

# 1.先使用match_phrase查询，没有结果
POST /test_003/_doc/_search
{
  "query": {
    "match_phrase": {
      "content.ik_smart_field": {
        "query": "系"
      }
    }
  }
}

# 2.使用match_phrase_prefix查询， "max_expansions": 1，得到文档3
POST /test_003/_doc/_search
{
  "query": {
    "match_phrase_prefix": {
      "content.ik_smart_field": {
        "query": "系",
        "max_expansions": 1
      }
    }
  }
}

# 3.使用match_phrase_prefix查询， "max_expansions": 2，得到所有文档
POST /test_003/_doc/_search
{
  "query": {
    "match_phrase_prefix": {
      "content.ik_smart_field": {
        "query": "系",
        "max_expansions": 2
      }
    }
  }
}

结果分析：
【语句1】查不到结果，是因为根据ik_smart分词器生成的倒排序索引中，所有文档中都不包含Token【系】；
【语句2】查询到文档3，是因为文档3包含Token【系统】，同时 “max_expansions”: 1，所以检索关键词【系】+1个通配符匹配，就可以匹配到一个Token【系统】；
【语句3】查询到所有文档，是因为"max_expansions":2，所以检索关键词【系】+ 2个通配符匹配，就可以匹配到两个Token【系统、系统学】，所以就可以查询到所有。

注意："max_expansions"的值最小为1，哪怕你设置为0，依然会 + 1个通配符匹配；所以，尽量不要用该语句，因为，最后一个Token始终要去扫描大量的索引，性能可能会很差。