elasticsearch模糊查询和智能搜索推荐

最新推荐文章于 2024-01-28 14:32:24 发布

jaaawaaa

最新推荐文章于 2024-01-28 14:32:24 发布

阅读量178

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/weixin_41063119/article/details/134155079

版权

elasticsearch 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

模糊查询

前缀搜索：prefix

概念：以xx开头的搜索，不计算相关度评分。

注意：

- 前缀搜索匹配的是term，而不是field。

- 前缀搜索的性能很差

- 前缀搜索没有缓存

- 前缀搜索尽可能把前缀长度设置的更长

语法：

GET <index>/_search
{
  "query": {
    "prefix": {
      "<field>": {
        "value": "<word_prefix>"
      }
    }
  }
}
index_prefixes: 默认   "min_chars" : 2,   "max_chars" : 5

通配符：wildcard

概念：通配符运算符是匹配一个或多个字符的占位符。例如，*通配符运算符匹配零个或多个字符。您可以将通配符运算符与其他字符结合使用以创建通配符模式。

注意：

- 通配符匹配的也是term，而不是field

语法：

GET <index>/_search
{
  "query": {
    "wildcard": {
      "<field>": {
        "value": "<word_with_wildcard>"
      }
    }
  }
}

正则：regexp

概念：regexp查询的性能可以根据提供的正则表达式而有所不同。为了提高性能，应避免使用通配符模式，如.或 .?+未经前缀或后缀

语法：

GET <index>/_search
{
  "query": {
    "regexp": {
      "<field>": {
        "value": "<regex>",
        "flags": "ALL",
      }
    }
  }
}

flags

- ALL
  启用所有可选操作符。

- COMPLEMENT
  启用操作符。可以使用对下面最短的模式进行否定。例如
  a~bc # matches 'adc' and 'aec' but not 'abc'

- INTERVAL
  启用<>操作符。可以使用<>匹配数值范围。例如
  foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
  foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'

- INTERSECTION
  启用&操作符，它充当AND操作符。如果左边和右边的模式都匹配，则匹配成功。例如:
  aaa.+&.+bbb # matches 'aaabbb'

- ANYSTRING
  启用@操作符。您可以使用@来匹配任何整个字符串。
  您可以将@操作符与&和~操作符组合起来，创建一个“everything except”逻辑。例如:
  @&~(abc.+) # matches everything except terms beginning with 'abc'

混淆字符 (box → fox) 缺少字符 (black → lack)
多出字符 (sic → sick) 颠倒次序 (act → cat)

模糊查询：fuzzy

语法

GET <index>/_search
{
  "query": {
    "fuzzy": {
      "<field>": {
        "value": "<keyword>"
      }
    }
  }
}

参数：

- value：（必须，关键词）

- fuzziness：编辑距离，（0，1，2）并非越大越好，召回率高但结果不准确

1. 1. 两段文本之间的Damerau-Levenshtein距离是使一个字符串与另一个字符串匹配所需的插入、删除、替换和调换的数量
  2. 距离公式：Levenshtein是lucene的，es改进版：Damerau-Levenshtein，

axe=>aex Levenshtein=2 Damerau-Levenshtein=1

- transpositions：（可选，布尔值）指示编辑是否包括两个相邻字符的变位（ab→ba）。默认为true。

短语前缀：match_phrase_prefix

match_phrase：

- match_phrase会分词

- 被检索字段必须包含match_phrase中的所有词项并且顺序必须是相同的

- 被检索字段包含的match_phrase中的词项之间不能有其他词项

概念：

match_phrase_prefix与match_phrase相同,但是它多了一个特性,就是它允许在文本的最后一个词项(term)上的前缀匹配,如果是一个单词,比如a,它会匹配文档字段所有以a开头的文档,如果是一个短语,比如 "this is ma" ,他会先在倒排索引中做以ma做前缀搜索,然后在匹配到的doc中做match_phrase查询,(网上有的说是先match_phrase,然后再进行前缀搜索, 是不对的)

参数

- analyzer 指定何种分析器来对该短语进行分词处理

- max_expansions 限制匹配的最大词项

- boost 用于设置该查询的权重

- slop 允许短语间的词项(term)间隔：slop 参数告诉 match_phrase 查询词条相隔多远时仍然能将文档视为匹配什么是相隔多远？意思是说为了让查询和文档匹配你需要移动词条多少次？

原理解析：How to Use Fuzzy Searches in Elasticsearch | Elastic Blog

N-gram和edge ngram

tokenizer

GET _analyze
{
  "tokenizer": "ngram",
  "text": "reba always loves me"
}

token filter

min_gram：创建索引所拆分字符的最小阈值

max_gram：创建索引所拆分字符的最大阈值

ngram：从每一个字符开始,按照步长,进行分词,适合前缀中缀检索

edge_ngram：从第一个字符开始,按照步长,进行分词,适合前缀匹配场景

#prefix: 前缀搜索
DELETE my_index
# elasticsearch stack
# elasticsearch search
# el
# ela 
# elas elasticsearch
PUT my_index
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "ik_max_word",
        "type": "text",
        "index_prefixes":{
          "min_chars":2,
          "max_chars":4
        },
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}
GET my_index/_mapping
POST /my_index/_bulk?filter_path=items.*.error
{"index":{"_id":"1"}}
{"text":"城管打电话喊商贩去摆摊摊"}
{"index":{"_id":"2"}}
{"text":"笑果文化回应商贩老农去摆摊"}
{"index":{"_id":"3"}}
{"text":"老农耗时17年种出椅子树"}
{"index":{"_id":"4"}}
{"text":"夫妻结婚30多年AA制,被城管抓"}
{"index":{"_id":"5"}}
{"text":"黑人见义勇为阻止抢劫反被铐住"}
GET my_index/_search
GET my_index/_mapping
GET _analyze
{
  "text": ["夫妻结婚30多年AA制,被城管抓"]
}
GET my_index/_search
{
  "query": {
    "prefix": {
      "text": {
        "value": "城管"
      }
    }
  }
}

###############################################################
# 通配符
DELETE my_index
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }
DELETE product_en
POST /product_en/_bulk
{ "index": { "_id": "1"} }
{ "title": "my english","desc" :  "shouji zhong de zhandouji","price" :  3999, "tags": [ "xingjiabi", "fashao", "buka", "1"]}
{ "index": { "_id": "2"} }
{ "title": "xiaomi nfc phone","desc" :  "zhichi quangongneng nfc,shouji zhong de jianjiji","price" :  4999, "tags": [ "xingjiabi", "fashao", "gongjiaoka" , "asd2fgas"]}
{ "index": { "_id": "3"} }
{ "title": "nfc phone","desc" :  "shouji zhong de hongzhaji","price" :  2999, "tags": [ "xingjiabi", "fashao", "menjinka" , "as345"]}
{ "title": { "_id": "4"} }
{ "text": "xiaomi erji","desc" :  "erji zhong de huangmenji","price" :  999, "tags": [ "low", "bufangshui", "yinzhicha", "4dsg" ]}
{ "index": { "_id": "5"} }
{ "title": "hongmi erji","desc" :  "erji zhong de kendeji","price" :  399, "tags": [ "lowbee", "xuhangduan", "zhiliangx" , "sdg5"]}
GET my_index/_search
GET product_en/_search

GET my_index/_search
{
  "query": {
    "wildcard": {
      "text.keyword": {
        "value": "my eng*ish"
      }
    }
  }
}
GET product_en/_mapping
#exact value
GET product_en/_search
{
  "query": {
    "wildcard": {
      "tags.keyword": {
        "value": "men*inka"
      }
    }
  }
}

#######################################################
#正则
GET product_en/_search
GET product_en/_search
{
  "query": {
    "regexp": {
      "title": "[\\s\\S]*nfc[\\s\\S]*"
    }
  }
}
GET product_en/_search
GET product_en/_search
{
  "query": {
    "regexp": {
      "desc": {
        "value": "zh~dng",
        "flags": "COMPLEMENT"
      }
    }
  }
}
GET product_en/_search
{
  "query": {
    "regexp": {
      "tags.keyword": {
        "value": ".*<2-3>.*",
        "flags": "INTERVAL"
      }
    }
  }
}
#############################################
# fuzzy:模糊查询
GET product_en/_search
GET product_en/_search
{
  "query": {
    "fuzzy": {
      "desc": {
        "value": "quangongneng nfc",
        "fuzziness": "2"
      }
    }
  }
}

GET product_en/_search
{
  "query": {
    "match": {
      "desc": {
        "query": "nfe quasdasdasdasd",
        "fuzziness": 1
      }
    }
  }
}
#####################################
# match_phrase_prefix
GET product_en/_search
{
  "query": {
    "match_phrase": {
      "desc": "shouji zhong de"
    }
  }
}

GET product_en/_search
{
  "query": {
    "match_phrase_prefix": {
      "desc": {
        "query": "de zhong shouji hongzhaji",
        "max_expansions": 50,
        "slop":3
      }
    }
  }
}


GET product_en/_search
{
  "query": {
    "match_phrase_prefix": {
      "desc": {
        "query": "zhong hongzhaji",
        "max_expansions": 50,
        "slop": 3
      }
    }
  }
}


# source: zhong de hongzhaji
# query:  zhong >  hongzhaji

# source: shouji zhong de hongzhaji
# query:  de zhong shouji hongzhaji

# de shouji/zhong  hongzhaji  1次
# shouji/de zhong  hongzhaji  2次
# shouji zhong/de  hongzhaji  3次
# shouji zhong de  hongzhaji  4次

#############################################
# ngram 和 edge-ngram
#ngram   min_gram =1   "max_gram": 2

GET _analyze
{
  "tokenizer": "ik_max_word",
  "filter": [ "edge_ngram" ],
  "text": "reba always loves me"
}

#min_gram =1   "max_gram": 1
#r a l m

#min_gram =1   "max_gram": 2
#r a l m
#re al lo me

#min_gram =2   "max_gram": 3
#re al lo me
#reb alw lov me

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "2_3_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      },
      "analyzer": {
        "my_edge_ngram": {
          "type":"custom",
          "tokenizer": "standard",
          "filter": [ "2_3_edge_ngram" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer":"my_edge_ngram",
        "search_analyzer": "standard"
      }
    }
  }
}
GET /my_index/_mapping


POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }


GET /my_index/_search
GET /my_index/_mapping
GET /my_index/_search
{
  "query": {
    "match_phrase": {
      "text": "my eng is goo"
    }
  }
}



PUT my_index2
{
  "settings": {
    "analysis": {
      "filter": {
        "2_3_grams": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      },
      "analyzer": {
        "my_edge_ngram": {
          "type":"custom",
          "tokenizer": "standard",
          "filter": [ "2_3_grams" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer":"my_edge_ngram",
        "search_analyzer": "standard"
      }
    }
  }
}
GET /my_index2/_mapping
POST /my_index2/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }

GET /my_index2/_search
{
  "query": {
    "match_phrase": {
      "text": "my eng is goo"
    }
  }
}

GET _analyze
{
  "tokenizer": "ik_max_word",
  "filter": [ "ngram" ],
  "text": "用心做皮肤,用脚做游戏"
}

jaaawaaa

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
elasticsearch模糊查询和智能搜索推荐

match_phrase_prefix与match_phrase相同,但是它多了一个特性,就是它允许在文本的最后一个词项(term)上的前缀匹配,如果是一个单词,比如a,它会匹配文档字段所有以a开头的文档,如果是一个短语,比如 "this is ma" ,他会先在倒排索引中做以ma做前缀搜索,然后在匹配到的doc中做match_phrase查询,(网上有的说是先match_phrase,然后再进行前缀搜索, 是不对的)
复制链接

扫一扫

专栏目录