Elasticsearch 权威教程 - 模糊匹配

本文详细介绍了Elasticsearch中的模糊匹配,包括如何使用前缀查询、通配符查询和正则表达式查询进行部分匹配。特别地,文章通过英国邮编的例子展示了如何对结构化数据进行前缀匹配,并探讨了在德语等包含复合词的语言中使用n-gram进行模糊匹配的方法。此外,还提到了潜在的性能问题以及如何通过完成建议者和边缘n-gram来提高效率。
摘要由CSDN通过智能技术生成

[[partial-matching]]
== Partial Matching

A keen observer will notice that all the queries so far in this book have
operated on whole terms.(((“partial matching”))) To match something, the smallest unit had to be a
single term. You can find only terms that exist in the inverted index.

But what happens if you want to match parts of a term but not the whole thing?
Partial matching allows users to specify a portion of the term they are
looking for and find any words that contain that fragment.

The requirement to match on part of a term is less common in the full-text
search-engine world than you might think. If you have come from an SQL
background, you likely have, at some stage of your career,
implemented a poor man’s full-text search using SQL constructs like this:

[source,js]

WHERE text LIKE "*quick*"
  AND text LIKE "*brown*"

AND text LIKE “fox” <1>

<1> *fox* would match fox'' andfoxes.”

Of course, with Elasticsearch, we have the analysis process and the inverted
index that remove the need for such brute-force techniques. To handle the
case of matching both fox'' andfoxes,” we could simply use a stemmer to
index words in their root form. There is no need to match partial terms.

That said, on some occasions partial matching can be useful.
Common use (((“partial matching”, “common use cases”)))cases include the following:

  • Matching postal codes, product serial numbers, or other not_analyzed values
    that start with a particular prefix or match a wildcard pattern
    or even a regular expression

  • search-as-you-type—displaying the most likely results before the
    user has finished typing the search terms

  • Matching in languages like German or Dutch, which contain long compound
    words, like Weltgesundheitsorganisation (World Health Organization)

We will start by examining prefix matching on exact-value not_analyzed
fields.
=== Postcodes and Structured Data

We will use United Kingdom postcodes (postal codes in the United States) to illustrate how(((“partial matching”, “postcodes and structured data”))) to use partial matching with
structured data. UK postcodes have a well-defined structure. For instance, the
postcode W1V 3DG can(((“postcodes (UK), partial matching with”))) be broken down as follows:

  • W1V: This outer part identifies the postal area and district:

** W indicates the area (one or two letters)
** 1V indicates the district (one or two numbers, possibly followed by a letter

  • 3DG: This inner part identifies a street or building:

** 3 indicates the sector (one number)
** DG indicates the unit (two letters)

Let’s assume that we are indexing postcodes as exact-value not_analyzed
fields, so we could create our index as follows:

[source,js]

PUT /my_index
{
“mappings”: {
“address”: {
“properties”: {
“postcode”: {
“type”: “string”,
“index”: “not_analyzed”
}
}
}
}

}

// SENSE: 130_Partial_Matching/10_Prefix_query.json

And index some (((“indexing”, “postcodes”)))postcodes:

[source,js]

PUT /my_index/address/1
{ “postcode”: “W1V 3DG” }

PUT /my_index/address/2
{ “postcode”: “W2F 8HW” }

PUT /my_index/address/3
{ “postcode”: “W1F 7HW” }

PUT /my_index/address/4
{ “postcode”: “WC1N 1LZ” }

PUT /my_index/address/5

{ “postcode”: “SW5 0BE” }

// SENSE: 130_Partial_Matching/10_Prefix_query.json

Now our data is ready to be queried.
[[prefix-query]]
=== prefix Query

To find all postcodes beginning with W1, we could use a (((“prefix query”)))(((“postcodes (UK), partial matching with”, “prefix query”)))simple prefix
query:

[source,js]

GET /my_index/address/_search
{
“query”: {
“prefix”: {
“postcode”: “W1”
}
}

}

// SENSE: 130_Partial_Matching/10_Prefix_query.json

The prefix query is a low-level query that works at the term level. It
doesn’t analyze the query string before searching. It assumes that you have
passed it the exact prefix that you want to find.

[TIP]

By default, the prefix query does no relevance scoring. It just finds
matching documents and gives them all a score of 1. Really, it behaves more
like a filter than a query. The only practical difference between the
prefix query and the prefix filter is that the filter can be cached.

==================================================

Previously, we said that `you can find only terms that exist in the inverted
index,'' but we haven't done anything special to index these postcodes; each
postcode is simply indexed as the exact value specified in each document. So
how does the
prefix` query work?

[role=”pagebreak-after”]
Remember that the inverted index consists(((“inverted index”, “for postcodes”))) of a sorted list of unique terms (in
this case, postcodes). For each term, it lists the IDs of the documents
containing that term in the postings list. The inverted index for our
example documents looks something like this:

Term:          Doc IDs:
-------------------------
"SW5 0BE"    |  5
"W1F 7HW"    |  3
"W1V 3DG"    |  1
"W2F 8HW"    |  2
"WC1N 1LZ"   |  4
-------------------------

To support prefix matching on the fly, the query does the following:

  1. Skips through the terms list to find the first term beginning with W1.
  2. Collects the associated document IDs.
  3. Moves to the next term.
  4. If that term also begins with W1, the query repeats from step 2; otherwise, we’re finished.

While this works fine for our small example, imagine that our inverted index
contains a million postcodes beginning with W1. The prefix query
would need to visit all one million terms in order to calculate the result!

And the shorter the prefix, the more terms need to be visited. If we were to
look for the prefix W instead of W1, perhaps we would match 10 million
terms instead of just one million.

CAUTION: The prefix query or filter are useful for ad hoc prefix matching, but
should be used with care. (((“prefix query”, “caution with”))) They can be used freely on fields with a small
number of terms, but they scale poorly and can put your cluster under a lot of
strain. Try to limit their impact on your cluster by using a long prefix;
this red

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值