09.phrase_suggester

1. Phrase Suggester 简介

term suggester 提供了一个非常方便的API,可以在一定的字符串距离内以每个token为基础访问单词替代词。 API允许单独访问流中的每个token,而 suggest 的返回结果的选择则留给API使用者。但是,通常需要对预先选择的 suggest 处理才能呈现给最终用户。短语 suggester 在term suggester 之上添加了其他逻辑,以选择更适合的短语,而不是基于ngram语言模型加权的单个标记。在实践中,该 suggester 将能够基于词的共同出现和出现频率来更好地决定要选择哪些tokens。

通常,短语 suggester 需要特殊的映射才能正常工作。此页面上的短语 suggester 示例需要以下映射才能起作用。反向分析器仅在最后一个示例中使用。

PUT test
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "analysis": {
        "analyzer": {
          "trigram": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase","shingle"]
          },
          "reverse": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase","reverse"]
          }
        },
        "filter": {
          "shingle": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "trigram": {
            "type": "text",
            "analyzer": "trigram"
          },
          "reverse": {
            "type": "text",
            "analyzer": "reverse"
          }
        }
      }
    }
  }
}

POST test/_doc?refresh=true
{"title": "noble warriors"}

POST test/_doc?refresh=true
{"title": "nobel prize"}


1.shingle filter 使用介绍

本来以为这个家伙没有啥用,结果后面的文档中频频出现,为了避免影响对后面的内容理解的影响,我还是又回来再整理一遍了。
这个shilngle filter的作用是把token stream中的多个连续的tokens 连接形成一个新的token,在支持match_phrase查询上效果更好,当然,官方实际上不建议你直接使用shingles filter来产生phrase,
更好的方案是是定义对应的field的type为 index-phrases 。

一个简单的使用样例

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "shingle",
      "min_shingle_size": 2,
      "max_shingle_size": 3,
      "output_unigrams": false
    }
  ],
  "text": "quick brown fox jumps"
}

返回
[ quick brown, quick brown fox, brown fox, brown fox jumps, fox jumps ]

可以看到返回的都是2个或者3个word组合产生的token

shingle filter有以下几个配置
max_shingle_size: shingle 产生token的最大的原始token的数量,默认为2
min_shingle_size: shingle 产生token的最大的原始token的数量,默认为2
output_unigrams: 是否输出原来的token,默认为true,输出
output_unigrams_if_no_shingles: 当output_unigrams设置为false的时候,当没有可用的shingle token 可以产生,而这个值又是true的话会输出原来的token,output_unigrams 为true的时候这个值无效
token_separator: 在join 原来的token 形成新的shingle token的时候使用的连接符。
filler_token: 这个一般是在和stop filter联合使用的时候有效,对于那些被stop干掉的token ,会使用这里定义的string代替,默认的情况使用的是"_"

Once you have the analyzers and mappings set up you can use the phrase suggester in the same spot you’d use the term suggester:

设置完分析器和映射后,就可以在使用term“ suggester ”的地方使用“ suggester ”一词:

POST test/_search
{
  "suggest": {
    "text": "noble prize",
    "simple_phrase": {
      "phrase": {
        "field": "title.trigram",
        "size": 1,
        "gram_size": 3,
        "direct_generator": [ {
          "field": "title.trigram",
          "suggest_mode": "always"
        } ],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

响应中包含最有可能先进行拼写校正的 suggest 得分。在这种情况下,我们收到了预期的更正“nobel prize”。
The response contains suggestions scored by the most likely spell correction first. In this case we received the expected correction “nobel prize”.

{
  "_shards": ...
  "hits": ...
  "timed_out": false,
  "took": 3,
  "suggest": {
    "simple_phrase" : [
      {
        "text" : "noble prize",
        "offset" : 0,
        "length" : 11,
        "options" : [ {
          "text" : "nobel prize",
          "highlighted": "<em>nobel</em> prize",
          "score" : 0.48614594
        }]
      }
    ]
  }
}

2 基础的api参数 Basic Phrase suggest API parameters

1.field: 用于对语言模型进行n-gram查询的字段名称, suggester 将使用此字段获取统计信息以对更正进行评分。此字段是必填字段。
The name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. This field is mandatory.

2.gram_size: 设置字段中n-grams 的最大大小。如果该字段不包含n-gram,则应将其省略或将其设置为1。请注意,Elasticsearch会尝试根据指定的字段检测gram大小。如果该字段使用shingle过滤器,则如果未明确设置,则gram_size设置为filter的max_shingle_size,这个不知道做啥子用。

Sets max size of the n-grams (shingles) in the field. If the field doesn’t contain n-grams (shingles), this should be omitted or set to 1. Note that Elasticsearch tries to detect the gram size based on the specified field. If the field uses a shingle filter, the gram_size is set to the max_shingle_size if not explicitly set.

3.real_word_error_likelihood
即使该term存在于字典中,该term也会被拼错。默认值为0.95,表示5%的真实单词拼写错误。

The likelihood of a term being a misspelled even if the term exists in the dictionary. The default is 0.95, meaning 5% of the real words are misspelled.

4.confidence
置信水平定义了应用于输入短语分数的因子,该因子用作 suggest 候选者的阈值。返回的result中仅包含得分高于阈值的候选人。例如,置信度为1.0只会返回得分高于输入短语的 suggest 。如果设置为0.0,则返回前N个候选者。默认值为1.0。

The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of 1.0 will only return suggestions that score higher than the input phrase. If set to 0.0 the top N candidates are returned. The default is 1.0.

5.max_errors
为了构成更正而认为是拼写错误的term的最大百分比。此方法接受范围为[0…1)的浮点值作为实际查询字词的一部分,或者接受大于等于1的数字作为查询字词的绝对数量。默认设置为1.0,这意味着仅返回最多包含一个拼写错误的term的更正。请注意,将此值设置得太高会对性能产生负面影响。 suggest 使用低值,例如1或2;否则, suggest 通话所花费的时间可能会超过查询执行所花费的时间。

The maximum percentage of the terms considered to be misspellings in order to form a correction. This method accepts a float value in the range [0…1) as a fraction of the actual query terms or a number >=1 as an absolute number of query terms. The default is set to 1.0, meaning only corrections with at most one misspelled term are returned. Note that setting this too high can negatively impact performance. Low values like 1 or 2 are recommended; otherwise the time spend in suggest calls might exceed the time spend in query execution.

6.separator
用于分隔双字组字段中的term的分隔符。如果未设置,则将空格字符用作分隔符。

The separator that is used to separate terms in the bigram field. If not set the whitespace character is used as a separator.

7.size
为每个单独的查询词生成的候选数。较低的数字(例如3或5)通常会产生良好的效果。提出该要求可以调出具有更高编辑距离的term。预设值为5。

The number of candidates that are generated for each individual query term. Low numbers like 3 or 5 typically produce good results. Raising this can bring up terms with higher edit distances. The default is 5.

8.analyzer
设置分析器进行分析以 suggest 文本。默认为suggest 字段的搜索分析器。

Sets the analyzer to analyze to suggest text with. Defaults to the search analyzer of the suggest field passed via field.

9.shard_size
设置要从每个单独的分片中检索的 suggest term的最大数量。在reduce阶段,基于size选项仅返回前N个 suggest 。默认为5。

Sets the maximum number of suggested terms to be retrieved from each individual shard. During the reduce phase, only the top N suggestions are returned based on the size option. Defaults to 5.

10.text
搜索词

11.highlight
设置 suggest 突出显示。如果未提供,则不返回突出显示的字段。如果提供的话,必须确切地包含pre_tag和post_tag,它们都包裹在更改的标记周围。如果一行中的多个标记被更改,则更改的标记的整个短语都会被包装,而不是每个标记都被包装。

Sets up suggestion highlighting. If not provided then no highlighted field is returned. If provided must contain exactly pre_tag and post_tag, which are wrapped around the changed tokens. If multiple tokens in a row are changed the entire phrase of changed tokens is wrapped rather than each token.

12.collate 核对,校对,校验
针对指定的 suggest search 检查每个 suggest结果 ,以删除哪些在索引中不存在匹配文档的 suggest 。对 suggest 的检查性查询仅在生成 suggest结果的本地分片上运行。必须指定校验使用的查询,并且可以对其进行模板化。当前 suggest 结果将作为{{suggestion}}变量自动提供,该变量应在查询中使用。您仍然可以指定自己的模板参数-将 suggest 值添加到您指定的变量中。另外,您可以设置prunez值以控制是否将返回所有短语 suggest 。当设置为true时, suggest 将具有一个附加选项collate_match,如果找到与该短语匹配的文档,则为true,否则为false。prune的默认值为false。

Checks each suggestion against the specified query to prune suggestions for which no matching docs exist in the index. The collate query for a suggestion is run only on the local shard from which the suggestion has been generated from. The query must be specified and it can be templated, see search templates for more information. The current suggestion is automatically made available as the {{suggestion}} variable, which should be used in your query. You can still specify your own template params — the suggestion value will be added to the variables you specify. Additionally, you can specify a prune to control if all phrase suggestions will be returned; when set to true the suggestions will have an additional option collate_match, which will be true if matching documents for the phrase was found, false otherwise. The default value for prune is false.

POST _search
{
  "suggest": {
    "text" : "noble prize",
    "simple_phrase" : {
      "phrase" : {
        "field" :  "title.trigram",
        "size" :   1,
        "direct_generator" : [ {
          "field" :            "title.trigram",
          "suggest_mode" :     "always",
          "min_word_length" :  1
        } ],
        "collate": {
          "query": { 
            "source" : {
              "match": {
                "{{field_name}}" : "{{suggestion}}" 
              }
            }
          },
          "params": {"field_name" : "title"}, 
          "prune": true 
        }
      }
    }
  }
}

该查询将针对每个 suggest 结果运行一次。

{{suggestion}}变量将由每个 suggest结果的文本替换。

已在参数中指定了一个附加的field_name变量,并由match查询使用。

所有 suggest 都将返回一个额外的collate_match选项,该选项指示生成的短语是否与任何文档匹配。

3. Smoothing Models

词组 suggester 支持多种平滑模型,以在不常见的gram和频繁的gram(索引中至少出现一次)之间权衡权重。可以通过将平滑参数设置为以下选项之一来选择平滑模型。每个平滑模型都支持可以配置的特定属性。

The phrase suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index). The smoothing model can be selected by setting the smoothing parameter to one of the following options. Each smoothing model supports specific properties that can be configured.

stupid_backoff
一个简单的退避模型,如果高阶计数为0,则退回到低阶n-gram模型,并将低阶n-gram模型以恒定因子折现。默认折扣为0.4。愚蠢的退避是默认模型。

A simple backoff model that backs off to lower order n-gram models if the higher order count is 0 and discounts the lower order n-gram model by a constant factor. The default discount is 0.4. Stupid Backoff is the default model.

laplace
使用加法平滑的平滑模型,其中将常数(通常为1.0或更小)添加到所有计数以平衡权重。默认Alpha为0.5。

A smoothing model that uses an additive smoothing where a constant (typically 1.0 or smaller) is added to all counts to balance weights. The default alpha is 0.5.

linear_interpolation
一个平滑模型,该模型根据用户提供的权重(lambda)取得unigram,bigrams和trigram的加权平均值。线性插值没有任何默认值。必须提供所有参数(trigram_lambda,bigram_lambda,unigram_lambda)。

A smoothing model that takes the weighted mean of the unigrams, bigrams, and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn’t have any default values. All parameters (trigram_lambda, bigram_lambda, unigram_lambda) must be supplied.

POST _search
{
  "suggest": {
    "text" : "obel prize",
    "simple_phrase" : {
      "phrase" : {
        "field" : "title.trigram",
        "size" : 1,
        "smoothing" : {
          "laplace" : {
            "alpha" : 0.7
          }
        }
      }
    }
  }
}

4. Candidate Generators

phrase suggester 使用 generator 来生成给定text中每个term的可能提示term列表。单个generator就好像为文本中的每个term调用的term suggester。随后,多个generator 对这个term的打分进行组合评分。

当前仅支持一种类型的generator:direct_generator。短语 suggest API接受关键字direct_generator下的generator列表;列表中的每个generator在原始文本中均按term被调用。

The phrase suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a term suggester called for each individual term in the text. The output of the generators is subsequently scored in combination with the candidates from the other terms for suggestion candidates.

Currently only one type of candidate generator is supported, the direct_generator. The Phrase suggest API accepts a list of generators under the key direct_generator; each of the generators in the list is called per term in the original text.

5. Direct Generators

  1. field: 查找提示词使用的field,可以全局设置可以局部设置

  2. size: 每个suggest text token 将返回的最大更正数。

  3. suggest_mode: suggest_mode控制要包括的suggest,或控制suggest的文本term和suggest的控制。可以指定三个可能的值:
    missing:仅对未在索引中的suggest text term提供suggest。这是默认值。
    popular:仅suggest哪些比原始suggest text term在更多的文档中出现的term。
    always:根据suggest text中的term suggest任何匹配的suggest。

  4. max_edits
    候选suggest可以具有最大编辑距离。只能是1到2之间的值。任何其他值都将导致引发错误的请求错误。默认为2。

  5. prefix_length
    必须匹配的最小前缀字符数才能成为suggest的候选者。默认值为1。增加此数字可提高拼写检查性能。通常用在拼写错误不会出现在前面几个字符的情况,比如英文单词。 (旧名称“ prefix_len”已弃用)

  6. min_word_length
    suggest text term必须包含的最小长度。默认值为4。(旧名称“ min_word_len”已弃用)

  7. max_inspections
    一个因子,用于与shards_size相乘,以便在shard级别上检查更多的候选拼写更正。可以以性能为代价提高准确性。默认为5。

  8. min_doc_freq
    suggest应出现的最小文档数阈值。可以将其指定为绝对数量或相对数量的文档数。通过仅suggest高频项可以提高质量。默认为0f且未启用。如果指定的值大于1,则数字不能为小数。分片级别文档频率用于此选项。

  9. max_term_freq
    可以包含suggest text令牌的文档数量的最大阈值。可以是相对百分比数字(例如0.4)或代表文档频率的绝对数字。如果指定的值大于1,则不能指定小数。默认为0.01f。这可以用来排除高频term-通常被正确拼写-的拼写检查。这也提高了拼写检查性能。分片级别文档频率用于此选项。

  10. pre_filter

一个过滤器(分析器),应用于传递给此候选generator的每个token。在生成候选对象之前,此过滤器将应用于原始token。

A filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated.

  1. post_filter
    在将每个生成的token传递到实际短语计分器之前将其应用于过滤器(分析器)。

下面的示例显示了具有两个generator的词组 suggest 调用:第一个generator使用包含普通索引项的字段,第二个generator使用包含使用反向过滤器索引的项的字段(token按相反顺序索引)。这用于克服直接generator的局限性,即它要求常量前缀以提供高性能 suggest 。 pre_filter和post_filter选项接受普通的分析器名称。

The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.

POST _search
{
  "suggest": {
    "text" : "obel prize",
    "simple_phrase" : {
      "phrase" : {
        "field" : "title.trigram",
        "size" : 1,
        "direct_generator" : [ {
          "field" : "title.trigram",
          "suggest_mode" : "always"
        }, {
          "field" : "title.reverse",
          "suggest_mode" : "always",
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]
      }
    }
  }
}

pre_filter和post_filter也可以用于在生成候选对象之后注入同义词。例如,对于查询队长usq,我们可以为termusq生成候选美国,usq是America的同义词。如果此短语的评分足够高,这可以使我们向用户展示美国队长。

pre_filter and post_filter can also be used to inject synonyms after candidates are generated. For instance for the query captain usq we might generate a candidate usa for the term usq, which is a synonym for america. This allows us to present captain america to the user if this phrase scores high enough.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值