Elasticsearch Suggester智能搜索建议

现代的搜索引擎,一般会具备"Suggest As You Type"功能,即在用户输入搜索的过程中,进行自动补全或者纠错。 通过协助用户输入更精准的关键词,提高后续全文搜索阶段文档匹配的程度。例如在京东上输入部分关键词,甚至输入拼写错误的关键词时,它依然能够提示出用户想要输入的内容
在这里插入图片描述
如果自己亲手去试一下,可以看到京东在用户刚开始输入的时候是自动补全的,而当输入到一定长度,如果因为单词拼写错误无法补全,就开始尝试提示相似的词。
那么类似的功能在Elasticsearch里如何实现呢? 答案就在Suggesters API。 Suggesters基本的运作原理是将输入的文本分解为token,然后在索引的字典里查找相似的term并返回。 根据使用场景的不同,Elasticsearch里设计了4种类别的Suggester,分别是:

  • Term Suggester
  • Phrase Suggester
  • Completion Suggester
  • Context Suggester

看一个Term Suggester的示例
准备一个叫做blogs的索引,配置一个text字段

PUT /blogs/
{
  "mappings": {
    "properties": {
      "body":{
        "type": "text"
      }
    }
  }
}

POST _bulk/?refresh=true
{"index":{"_index":"blogs"}}
{"body":"Lucene is cool"}
{"index":{"_index":"blogs"}}
{"body":"Elasticsearch builds on top of lucene"}
{"index":{"_index":"blogs"}}
{"body":"Elasticsearch rocks"}
{"index":{"_index":"blogs"}}
{"body":"Elastic is the company behind ELK stack"}
{"index":{"_index":"blogs"}}
{"body":"elk rocks"}
{"index":{"_index":"blogs"}}
{"body":"elasticsearch is rock solid"}
  • 分析
POST _analyze
{
  "text": [
    "Lucene is cool",
    "Elasticsearch builds on top of lucene",
    "Elasticsearch rocks",
    "Elastic is the company behind ELK stack",
    "elk rocks",
    "elasticsearch is rock solid"
  ]
}
{
  "tokens" : [
    {
      "token" : "lucene",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "cool",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 15,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "builds",
      "start_offset" : 29,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "on",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "top",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "of",
      "start_offset" : 43,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lucene",
      "start_offset" : 46,
      "end_offset" : 52,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 53,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "rocks",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "elastic",
      "start_offset" : 73,
      "end_offset" : 80,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "is",
      "start_offset" : 81,
      "end_offset" : 83,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "the",
      "start_offset" : 84,
      "end_offset" : 87,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "company",
      "start_offset" : 88,
      "end_offset" : 95,
      "type" : "<ALPHANUM>",
      "position" : 14
    },
    {
      "token" : "behind",
      "start_offset" : 96,
      "end_offset" : 102,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "elk",
      "start_offset" : 103,
      "end_offset" : 106,
      "type" : "<ALPHANUM>",
      "position" : 16
    },
    {
      "token" : "stack",
      "start_offset" : 107,
      "end_offset" : 112,
      "type" : "<ALPHANUM>",
      "position" : 17
    },
    {
      "token" : "elk",
      "start_offset" : 113,
      "end_offset" : 116,
      "type" : "<ALPHANUM>",
      "position" : 18
    },
    {
      "token" : "rocks",
      "start_offset" : 117,
      "end_offset" : 122,
      "type" : "<ALPHANUM>",
      "position" : 19
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 123,
      "end_offset" : 136,
      "type" : "<ALPHANUM>",
      "position" : 20
    },
    {
      "token" : "is",
      "start_offset" : 137,
      "end_offset" : 139,
      "type" : "<ALPHANUM>",
      "position" : 21
    },
    {
      "token" : "rock",
      "start_offset" : 140,
      "end_offset" : 144,
      "type" : "<ALPHANUM>",
      "position" : 22
    },
    {
      "token" : "solid",
      "start_offset" : 145,
      "end_offset" : 150,
      "type" : "<ALPHANUM>",
      "position" : 23
    }
  ]
}

分出来的token都会成为词典里一个term,注意有些token会出现多,因此在倒排索引里记录的词频会比较高,同时记录的还有这些token在原文档里的偏移量和相对位置信息。
执行一次suggester搜索看看效果:


POST /blogs/_search
{
  "suggest": {
    "my-suggestion": {
      "text": "lucne rock",
      "term": {
        "suggest_mode": "missing",
        "field": "body"
      }
    }
  }
}

suggest就是一种特殊类型的搜索,DSL内部的"text"指的是api调用方提供的文本,也就是通常用户界
面上用户输入的内容。这里的lucne是错误的拼写,模拟用户输入错误。 "term"表示这是一个term
suggester。 “field"指定suggester针对的字段,另外有一个可选的"suggest_mode”。 范例里
的"missing"实际上就是缺省值,它是什么意思?有点挠头… 还是先看看返回结果吧:

  • 结果
{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "my-suggestion" : [
      {
        "text" : "lucne",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lucene",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      },
      {
        "text" : "rock",
        "offset" : 6,
        "length" : 4,
        "options" : [ ]
      }
    ]
  }
}

在返回结果里"suggest" -> “my-suggestion"部分包含了一个数组,每个数组项对应从输入文本分解出
来的token(存放在"text"这个key里)以及为该token提供的建议词项(存放在options数组里)。 示例
里返回了"lucne”,“rock"这2个词的建议项(options),其中"rock"的options是空的,表示没有可以建议
的选项,为什么? 上面提到了,我们为查询提供的suggest mode是"missing”,由于"rock"在索引的词典
里已经存在了,够精准,就不建议啦。 只有词典里找不到词,才会为其提供相似的选项。
如果将"suggest_mode"换成"popular"会是什么效果?
尝试一下,重新执行查询,返回结果里"rock"这个词的option不再是空的,而是建议为rocks

ock和rocks在索引词典里都是有的。 不难看出即使用户输入的token在索引的词典里已经
有了,但是因为存在一个词频更高的相似项,这个相似项可能是更合适的,就被挑选到options里了。
最后还有一个"always" mode,其含义是不管token是否存在于索引词典里都要给出相似项。
有人可能会问,两个term的相似性是如何判断的? ES使用了一种叫做Levenstein edit distance的算
法,其核心思想就是一个词改动多少个字符就可以和另外一个词一致。 Term suggester还有其他很多
可选参数来控制这个相似性的模糊程度

Phrase suggester在Term suggester的基础上,会考量多个term之间的关系,比如是否同时出现在索

引的原文里,相邻程度,以及词频等等。

POST /blogs/_search
{
  "suggest": {
    "my-suggestion": {
      "text": "lucne and elasticsear rock",
      "phrase": {
        "field": "body",
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

返回结果

{
  "took" : 45,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "my-suggestion" : [
      {
        "text" : "lucne and elasticsear rock",
        "offset" : 0,
        "length" : 26,
        "options" : [
          {
            "text" : "lucene and elasticsearch rock",
            "highlighted" : "<em>lucene</em> and <em>elasticsearch</em> rock",
            "score" : 0.004993905
          },
          {
            "text" : "lucne and elasticsearch rock",
            "highlighted" : "lucne and <em>elasticsearch</em> rock",
            "score" : 0.0033391973
          },
          {
            "text" : "lucene and elasticsear rock",
            "highlighted" : "<em>lucene</em> and elasticsear rock",
            "score" : 0.0029183894
          }
        ]
      }
    ]
  }
}

options直接返回一个phrase列表,由于加了highlight选项,被替换的term会被高亮。因为lucene和
elasticsearch曾经在同一条原文里出现过,同时替换2个term的可信度更高,所以打分较高,排在第一
位返回。Phrase suggester有相当多的参数用于控制匹配的模糊程度,需要根据实际应用情况去挑选和
调试。

Completion Suggester,

它主要针对的应用场景就是"Auto Completion"。 此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在。
为了使用Completion Suggeste 字段的类型需要专门定义如下

PUT /blogs_completion/
{
  "mappings": {
    "properties": {
      "body": {
        "type": "completion"
      }
    }
  }
}


POST _bulk/?refresh=true
{"index":{"_index":"blogs_completion"}}
{"body":"Lucene is cool"}
{"index":{"_index":"blogs_completion"}}
{"body":"Elasticsearch builds on top of lucene"}
{"index":{"_index":"blogs_completion"}}
{"body":"Elasticsearch rocks"}
{"index":{"_index":"blogs_completion"}}
{"body":"Elastic is the company behind ELK stack"}
{"index":{"_index":"blogs_completion"}}
{"body":"the elk stack rocks"}
{"index":{"_index":"blogs_completion"}}
{"body":"elasticsearch is rock solid"}
  • 查找
POST /blogs_completion/_search?pretty
{
  "size": 0,
  "suggest": {
    "blog-suggest": {
      "prefix": "elastic i",
      "completion": {
        "field": "body"
      }
    }
  }
}


*结果

{
  "took" : 34,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "blog-suggest" : [
      {
        "text" : "elastic i",
        "offset" : 0,
        "length" : 9,
        "options" : [
          {
            "text" : "Elastic is the company behind ELK stack",
            "_index" : "blogs_completion",
            "_type" : "_doc",
            "_id" : "V3jB93UBZ_tNtiojBI9q",
            "_score" : 1.0,
            "_source" : {
              "body" : "Elastic is the company behind ELK stack"
            }
          }
        ]
      }
    ]
  }
}

注意的一点是Completion Suggester在索引原始数据的时候也要经过analyze阶段,取决于选用的
analyzer不同,某些词可能会被转换,某些词可能被去除,这些会影响FST编码结果,也会影响查找匹
配的效果


PUT /blogs_completion/
{
"mappings": {
"properties": {
"body": {
"type": "completion",
"analyzer":"english"
}
}
}
} 
POST _bulk/?refresh=true
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Lucene is cool"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "blogs_completion"} }
{ "body": "Elasticsearch rocks"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "blogs_completion" } }
{ "body": "the elk stack rocks"}
{ "index" : { "_index" : "blogs_completion"} }
{ "body": "elasticsearch is rock solid"}
  • 查询
POST /blogs_completion/_search?pretty
{ "size": 0,
"suggest": {
"blog-suggest": {
"prefix": "elastic i",
"completion": {
"field": "body"
}
}
}
}

Context Suggester

Completion Suggester 的扩展
可以在搜索中加入更多的上下文信息,然后根据不同的上下文信息,对相同的输入,比如"star",
提供不同的建议值,比如:
咖啡相关:starbucks
电影相关:star wars

居然没有匹配结果了,多么费解! 原来我们用的english analyzer会离掉stop word,而is就是其中一个,被剥离掉了

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值