Elasticsearch - Fuzzy query

最新推荐文章于 2025-03-05 00:24:01 发布

weixin_42692506

最新推荐文章于 2025-03-05 00:24:01 发布

阅读量6.1k

点赞数 6

本文链接：https://blog.csdn.net/weixin_42692506/article/details/101555035

版权

本文深入探讨了Elasticsearch中fuzzyquery的功能与应用，通过实例解释了基于Levenshtein编辑距离的模糊搜索机制，详细分析了查询参数及其对召回结果的影响。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

引言

fuzzy query 是基于Levenshtein Edit Distance（莱温斯坦编辑距离）基础上，对索引文档进行模糊搜索。当用户输入有错误时，使用这个功能能在一定程度上召回一些和输入相近的文档。

例子

首先，我们来直观感受下这个功能。

现在索引的文档如下：

PUT levtest/_doc/_bulk
{ "index" : { "_id": 1 } }
{ "title": "lucky" }

此时，向索引发送如下请求：

GET /_search
{
    "query": {
       "fuzzy" : { "title" : "luky" }
    }
}

由于查询词luky和索引lucky之间的编辑距离为1，此时是可以召回文档lucky的。

fuzzy query的参数

参数名	含义
fuzziness	定义最大的编辑距离，默认为AUTO，即按照es的默认配置。 fuzziness可选的值为0,1,2，也就是说编辑距离最大只能设置为2. AUTO策略：在AUTO模式下，es将根据输入查询的term的长度决定编辑距离大小。用户也可以自定义term长度边界的最大和最小值，AUTO:[low],[high]，如果没有定义的话，默认值为3和6，即等价于 AUTO:3,6，即按照以下方案：输入查询term的长度： 0-2：必须精确匹配 3-5：编辑距离为1 >5：编辑距离为2
prefix_length	定义最初始不会被“模糊”的term的数量。这是基于用户的输入一般不会在最开始犯错误的设定的基础上设置的参数。这个参数的设定将减少去召回限定编辑距离的的term时，检索的term的数量。默认参数为0.
max_expansions	定义fuzzy query会扩展的最大term的数量。默认为50.
transpositions	定义在计算编辑聚利时，是否允许term的交换（例如ab->ba）,实际上，如果设置为true的话，计算的就是Damerau,F,J distance。默认参数为false。

注意：如果prefix_length设为0并且max_expansions设置为很大的一个数，这个查询的计算量将会是非常大。很有可能导致索引里的每个term都被检查一遍。

参数应用的例子：

GET /_search
{
    "query": {
        "fuzzy" : {
            "user" : {
                "title": "ki",
                "boost": 1.0,
                "fuzziness": 2,
                "prefix_length": 0,
                "max_expansions": 100
            }
        }
    }
}

具体的计算流程

至于FST是什么，具体可以参考：lucene字典实现原理

如果想进一步深入了解如何根据编辑距离进行召回，可以参考：Levenshtein Automata

为了进一步了解es的fuzzy query是如何工作的，我们来看几个例子：

我们的索引目前有以下文档：

 {
        "_index": "bitao_fuzzy_test",
        "_type": "doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": 2,
          "title": "组合沙发",
          "title_pinyin": "zu he sha fa",
          "title_pinyin_continuous": "zuheshafa"
        }
      },
      {
        "_index": "bitao_fuzzy_test",
        "_type": "doc",
        "_id": "4",
        "_score": 1,
        "_source": {
          "id": 4,
          "title": "卧室电视柜",
          "title_pinyin": "wo shi dian shi gui",
          "title_pinyin_continuous": "woshidianshigui"
        }
      },
      {
        "_index": "bitao_fuzzy_test",
        "_type": "doc",
        "_id": "5",
        "_score": 1,
        "_source": {
          "id": 5,
          "title": "酒柜",
          "title_pinyin": "jiu gui",
          "title_pinyin_continuous": "jiugui"
        }
      },
      {
        "_index": "bitao_fuzzy_test",
        "_type": "doc",
        "_id": "6",
        "_score": 1,
        "_source": {
          "id": 6,
          "title": "橱柜",
          "title_pinyin": "chu gui",
          "title_pinyin_continuous": "chugui"
        }
      },
      {
        "_index": "bitao_fuzzy_test",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "title": "沙发组合",
          "title_pinyin": "sha fa zu he",
          "title_pinyin_continuous": "shfazuhe"
        }
      },
      {
        "_index": "bitao_fuzzy_test",
        "_type": "doc",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "title": "电视柜",
          "title_pinyin": "dian shi gui",
          "title_pinyin_continuous": "dianshigui"
        }

每个文档都将经过ik_max_word的中文分词器，经过分词后，构建的词典含有以下词：

"token": "卧室",
"token": "电视机",
"token": "电视",
"token": "机柜",
"token "组合",
"token": "沙发",
"token": "酒柜",
"token": "橱柜",
"token": "电视柜",
"token"电视",
"token": "柜",

这时我们进行如下的模糊查询：

 {
 "profile":"true",
  "query": {
    "multi_match": {
      "fields":  [ "title" ],
      "query":     "卧室电视机柜",
      "fuzziness": "1"
    }
  }
}

这时将得到以下的召回

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 3.1329184,
    "hits": [
      {
        "_index": "bitao_suggester_test",
        "_type": "doc",
        "_id": "4",
        "_score": 3.1329184,
        "_source": {
          "id": 4,
          "title": "卧室电视柜",
          "title_pinyin": "wo shi dian shi gui",
          "title_pinyin_continuous": "woshidianshigui"
        }
      },
      {
        "_index": "bitao_suggester_test",
        "_type": "doc",
        "_id": "3",
        "_score": 1.708598,
        "_source": {
          "id": 3,
          "title": "电视柜",
          "title_pinyin": "dian shi gui",
          "title_pinyin_continuous": "dianshigui"
        }
      },
      {
        "_index": "bitao_suggester_test",
        "_type": "doc",
        "_id": "5",
        "_score": 0.75678295,
        "_source": {
          "id": 5,
          "title": "酒柜",
          "title_pinyin": "jiugui"
        }
      },
      {
        "_index": "bitao_suggester_test",
        "_type": "doc",
        "_id": "6",
        "_score": 0.75678295,
        "_source": {
          "id": 6,
          "title": "橱柜",
          "title_pinyin": "chugui"
        }
      }
    ]
  }
}

你一定很疑惑，为什么会召回这么多文档，按照编辑距离的定义，只有"卧室电视柜"与原query ："卧室电视机柜"编辑距离为1才对。

为了解开这个疑惑，我们来进一步看看es具体是怎么召回的：

在发送给索引进行召回时，我们看到，实际是发送了这么一个指令：

"title:卧室 ((title.smart_word:电视)^0.5 (title.smart_word:电视柜)^0.6666666) (title.smart_word:电视 (title.smart_word:电视柜)^0.5) ((ConstantScore(title.smart_word:柜))^0.0 (title.smart_word:橱柜)^0.5 (title.smart_word:酒柜)^0.5)"

{
 "id": "[QWv_XBWmTh6oskkO8axWag][bitao_suggester_test][0]",
 "searches": [
 {
 "query": [
 {
 "type": "BooleanQuery",
 "description": "title.smart_word:卧室 ((title.smart_word:电视)^0.5 (title.smart_word:电视柜)^0.6666666) (title.smart_word:电视 (title.smart_word:电视柜)^0.5) ((ConstantScore(title.smart_word:柜))^0.0 (title.smart_word:橱柜)^0.5 (title.smart_word:酒柜)^0.5)",
 "time_in_nanos": 892767,
 "breakdown": {
 "score": 16041,
 "build_scorer_count": 7,
 "match_count": 0,
 "create_weight": 114345,
 "next_doc": 34802,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 6,
 "score_count": 3,
 "build_scorer": 727562,
 "advance": 0,
 "advance_count": 0
 },
 "children": [
 {
 "type": "TermQuery",
 "description": "title.smart_word:卧室",
 "time_in_nanos": 52344,
 "breakdown": {
 "score": 1303,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 10432,
 "next_doc": 948,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 39651,
 "advance": 0,
 "advance_count": 0
 }
 },
 {
 "type": "BooleanQuery",
 "description": "(title.smart_word:电视)^0.5 (title.smart_word:电视柜)^0.6666666",
 "time_in_nanos": 473710,
 "breakdown": {
 "score": 2531,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 18098,
 "next_doc": 5875,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 447196,
 "advance": 0,
 "advance_count": 0
 },
 "children": [
 {
 "type": "BoostQuery",
 "description": "(title.smart_word:电视)^0.5",
 "time_in_nanos": 11668,
 "breakdown": {
 "score": 388,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 4243,
 "next_doc": 735,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 6292,
 "advance": 0,
 "advance_count": 0
 }
 },
 {
 "type": "BoostQuery",
 "description": "(title.smart_word:电视柜)^0.6666666",
 "time_in_nanos": 8557,
 "breakdown": {
 "score": 351,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 2994,
 "next_doc": 700,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 4502,
 "advance": 0,
 "advance_count": 0
 }
 }
 ]
 },
 {
 "type": "BooleanQuery",
 "description": "title.smart_word:电视 (title.smart_word:电视柜)^0.5",
 "time_in_nanos": 85704,
 "breakdown": {
 "score": 2154,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 15697,
 "next_doc": 6118,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 61725,
 "advance": 0,
 "advance_count": 0
 },
 "children": [
 {
 "type": "TermQuery",
 "description": "title.smart_word:电视",
 "time_in_nanos": 10744,
 "breakdown": {
 "score": 319,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 3454,
 "next_doc": 768,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 6193,
 "advance": 0,
 "advance_count": 0
 }
 },
 {
 "type": "BoostQuery",
 "description": "(title.smart_word:电视柜)^0.5",
 "time_in_nanos": 10845,
 "breakdown": {
 "score": 300,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 3548,
 "next_doc": 806,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 6181,
 "advance": 0,
 "advance_count": 0
 }
 }
 ]
 },
 {
 "type": "BooleanQuery",
 "description": "(ConstantScore(title.smart_word:柜))^0.0 (title.smart_word:橱柜)^0.5 (title.smart_word:酒柜)^0.5",
 "time_in_nanos": 145692,
 "breakdown": {
 "score": 3934,
 "build_scorer_count": 10,
 "match_count": 0,
 "create_weight": 42304,
 "next_doc": 8922,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 6,
 "score_count": 3,
 "build_scorer": 90512,
 "advance": 0,
 "advance_count": 0
 },
 "children": [
 {
 "type": "BoostQuery",
 "description": "(ConstantScore(title.smart_word:柜))^0.0",
 "time_in_nanos": 40335,
 "breakdown": {
 "score": 548,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 19903,
 "next_doc": 2305,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 17569,
 "advance": 0,
 "advance_count": 0
 },
 "children": [
 {
 "type": "TermQuery",
 "description": "title.smart_word:柜",
 "time_in_nanos": 22716,
 "breakdown": {
 "score": 0,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 13273,
 "next_doc": 755,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 0,
 "build_scorer": 8679,
 "advance": 0,
 "advance_count": 0
 }
 }
 ]
 },
 {
 "type": "BoostQuery",
 "description": "(title.smart_word:橱柜)^0.5",
 "time_in_nanos": 15172,
 "breakdown": {
 "score": 626,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 4212,
 "next_doc": 912,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 9412,
 "advance": 0,
 "advance_count": 0
 }
 },
 {
 "type": "BoostQuery",
 "description": "(title.smart_word:酒柜)^0.5",
 "time_in_nanos": 16461,
 "breakdown": {
 "score": 662,
 "build_scorer_count": 6,
 "match_count": 0,
 "create_weight": 3340,
 "next_doc": 852,
 "match": 0,
 "create_weight_count": 1,
 "next_doc_count": 2,
 "score_count": 1,
 "build_scorer": 11597,
 "advance": 0,
 "advance_count": 0
 }
 }
 ]
 }
 ]
 }
 ],
 "rewrite_time": 1073750,
 "collector": [
 {
 "name": "CancellableCollector",
 "reason": "search_cancelled",
 "time_in_nanos": 78907,
 "children": [
 {
 "name": "SimpleTopScoreDocCollector",
 "reason": "search_top_hits",
 "time_in_nanos": 22879
 }
 ]
 }
 ]
 }
 ],
 "aggregations": [
]
 }

这个指令是怎么构成的，指令里的词是怎么来的？

我们来看下输入的query的分词：