****** 本文仅作为项目中用到知识点的记录,防止下次看到再去各种百度,个人理解!!!仅供参考!!!
由于官网对于match_phrase的解释有限,可参考这篇文章,讲的比较详细,点这里有match和match_phrase的比较
note: match和match_phrase一样 都会对搜索的条件进行分词查询,但是上面文章有一点提到的,图中红色选中的部分,不太理解,举例如下:
使用的是edge_ngram分词器
ngram会细分,如name 会分词成n,na,am,me,但是edge_ngram只会从开头分词,如n,na
1.创建mapping,并指定自定义的edge_ngram分词器
PUT localhost:9200/edge_ngram_custom_example
{
"mappings":{
"properties": {
"content": {
"type": "text",
"analyzer": "my_edge_ngram"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_edge_ngram": {
"tokenizer": "custom_edge_ngram"
}
},
"tokenizer": {
"custom_edge_ngram": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 2
,"token_chars": [
"letter",
"punctuation",
"symbol",
"digit"
]
}
}
}
}
}
2.添加数据
POST localhost:9200/edge_ngram_custom_example/_doc/2
{
"content": "that isnot a test"
}
3.查询1
POST localhost:9200/edge_ngram_custom_example/_search
{
"query":{
"match_phrase":{
"content": {
"query": "th is a t"
// ,"slop": 0
}
}
}
}
结果:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7260926,
"hits": [
{
"_index": "edge_ngram_custom_example",
"_type": "_doc",
"_id": "2",
"_score": 1.7260926,
"_source": {
"content": "that isnot a test"
}
}
]
}
}
4.查询2
POST localhost:9200/edge_ngram_custom_example/_search
{
"query":{
"match_phrase":{
"content": {
"query": "th i a t"
// ,"slop": 0
}
}
}
}
结果:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
可以看到 查询1和查询2的查询条件只是少了一个s,name我们来看下分词结果
POST localhost:9200/edge_ngram_custom_example/_analyze
{
"analyzer": "my_edge_ngram",
"text": "that isnot a test"
}
结果:
{
"tokens": [
{
"token": "t",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "th",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "i",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 3
},
{
"token": "a",
"start_offset": 11,
"end_offset": 12,
"type": "word",
"position": 4
},
{
"token": "t",
"start_offset": 13,
"end_offset": 14,
"type": "word",
"position": 5
},
{
"token": "te",
"start_offset": 13,
"end_offset": 15,
"type": "word",
"position": 6
}
]
}
可以看到is 是被分词成"i"和"is"的,按照上面的说法position必须连续 th 和 a 中间隔着一个i和is 理论上是根本没法连续的,但是 使用position 1,3,4,5的顺序就能查到,1,2,4,5就没有查到,不太理解为啥,有知道的可以评论一下