es match_phrase和edge_ngram&ngram分词器的区别

最新推荐文章于 2024-06-27 10:50:16 发布

空城旧梦丨

最新推荐文章于 2024-06-27 10:50:16 发布

阅读量1.2k

点赞数

分类专栏： Elasticsearch 文章标签： elasticsearch ngram edge_ngram

本文链接：https://blog.csdn.net/qq_34589867/article/details/122875420

版权

Elasticsearch 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

****** 本文仅作为项目中用到知识点的记录,防止下次看到再去各种百度,个人理解!!!仅供参考!!!

由于官网对于match_phrase的解释有限,可参考这篇文章,讲的比较详细,点这里有match和match_phrase的比较

note: match和match_phrase一样都会对搜索的条件进行分词查询,但是上面文章有一点提到的,图中红色选中的部分,不太理解,举例如下: 在这里插入图片描述
使用的是edge_ngram分词器
ngram会细分,如name 会分词成n,na,am,me,但是edge_ngram只会从开头分词,如n,na

1.创建mapping,并指定自定义的edge_ngram分词器

PUT localhost:9200/edge_ngram_custom_example
{
  "mappings":{
      "properties": {
        "content": {
            "type": "text",
            "analyzer": "my_edge_ngram"
            }
        }
    },  
  "settings": {
    "analysis": {
      "analyzer": {
          "my_edge_ngram": {
                "tokenizer": "custom_edge_ngram"
          }        
      },
      "tokenizer": {
        "custom_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 2
          ,"token_chars": [
            "letter",
            "punctuation",
            "symbol",
            "digit"
            ]
        }
      }
    }
  }
}

2.添加数据

POST localhost:9200/edge_ngram_custom_example/_doc/2

{
    "content": "that isnot a test"
}

3.查询1

POST localhost:9200/edge_ngram_custom_example/_search

{
    "query":{
        "match_phrase":{
            "content": {
                "query": "th is a t"
                // ,"slop": 0
            }
            
        }
    }
}

结果:
{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.7260926,
        "hits": [
            {
                "_index": "edge_ngram_custom_example",
                "_type": "_doc",
                "_id": "2",
                "_score": 1.7260926,
                "_source": {
                    "content": "that isnot a test"
                }
            }
        ]
    }
}

4.查询2

POST localhost:9200/edge_ngram_custom_example/_search
{
    "query":{
        "match_phrase":{
            "content": {
                "query": "th i a t"
                // ,"slop": 0
            }
            
        }
    }
}

结果:
{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

可以看到查询1和查询2的查询条件只是少了一个s,name我们来看下分词结果

POST localhost:9200/edge_ngram_custom_example/_analyze
{
   "analyzer": "my_edge_ngram",
   "text": "that isnot a test"
}

结果:
{
    "tokens": [
        {
            "token": "t",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "th",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
        {
            "token": "i",
            "start_offset": 5,
            "end_offset": 6,
            "type": "word",
            "position": 2
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 3
        },
        {
            "token": "a",
            "start_offset": 11,
            "end_offset": 12,
            "type": "word",
            "position": 4
        },
        {
            "token": "t",
            "start_offset": 13,
            "end_offset": 14,
            "type": "word",
            "position": 5
        },
        {
            "token": "te",
            "start_offset": 13,
            "end_offset": 15,
            "type": "word",
            "position": 6
        }
    ]
}

可以看到is 是被分词成"i"和"is"的,按照上面的说法position必须连续 th 和 a 中间隔着一个i和is 理论上是根本没法连续的,但是使用position 1,3,4,5的顺序就能查到,1,2,4,5就没有查到,不太理解为啥,有知道的可以评论一下

空城旧梦丨

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
es match_phrase和edge_ngram&ngram分词器的区别

****** 本文仅作为项目中用到知识点的记录,防止下次看到再去各种百度,个人理解!!!仅供参考!!!由于官网对于match_phrase的解释有限,可参考这篇文章,讲的比较详细,点这里有match和match_phrase的比较note: match和match_phrase一样都会对搜索的条件进行分词查询,但是上面文章有一点提到的,图中红色选中的部分,不太理解,举例如下:使用的是edge_ngram分词器ngram会细分,如name 会分词成n,na,am,me,但是edge_ngram只会从
复制链接

扫一扫

专栏目录