Elasticsearch 权威教程 - 模糊匹配

最新推荐文章于 2024-04-20 15:55:29 发布

uxff

最新推荐文章于 2024-04-20 15:55:29 发布

阅读量3.3k

点赞数

分类专栏：分布式 elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/xuduorui/article/details/79417834

版权

模糊匹配

一般的全文检索方式使用 TF/IDF 处理文本或者文本数据中的某个字段内容。将字面切分成很多字、词(word)建立索引，match查询用query中的term来匹配索引中的字、词。match查询提供了文档数据中是否包含我们需要的query中的单、词，但仅仅这样是不够的，它无法提供文本中的字词之间的关系。

举个例子：

小苏吃了鳄鱼
鳄鱼吃了小苏
小苏去哪儿都带着的鳄鱼皮钱包

用match查询小苏鳄鱼，这三句话都会被命中，但是tf/idf并不会告诉我们这两个词出现在同一句话里面还是在同一个段落中（仅仅提供这两个词在这段文本中的出现频率）

理解文本中词语之间的关系是一个很复杂的问题，而且这个问题通过更换query的表达方式是无法解决的。但是我们可以知道两个词语在文本中的距离远近，甚至是否相邻，这个信息似乎上能一定程度的表达这两个词比较相关。

一般的文本可能比我们举的例子长很多，正如我们提到的：小苏跟鳄鱼这两个词可能分布在文本的不同段落中。我们还是期望能找到这两个词分布均匀的文档，但是我们把这两个词距离比较近的文档赋予更好的相关性权重。

这就是段落匹配（phrase matching）或者模糊匹配（proximity matching）所做的事情。

[TIP] ================================================== In this chapter, we are using the same example documents that we used for the

[source,js]

GET /my_index/my_type/_search
{
“query”: {
“match_phrase”: {
“title”: “quick brown fox”
}
}

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

Like the match query, the match_phrase query first analyzes the query
string to produce a list of terms. It then searches for all the terms, but
keeps only documents that contain all of the search terms, in the same
positions relative to each other. A query for the phrase quick fox
would not match any of our documents, because no document contains the word
quick immediately followed by fox.

[TIP]

The match_phrase query can also be written as a match query with type
phrase:

[source,js]

“match”: {
“title”: {
“query”: “quick brown fox”,
“type”: “phrase”
}

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

==================================================

==== Term Positions

When a string is analyzed, the analyzer returns not(((“phrase matching”, “term positions”)))(((“match_phrase query”, “position of terms”)))(((“position-aware matching”))) only a list of terms, but
also the position, or order, of each term in the original string:

[source,js]

GET /_analyze?analyzer=standard

Quick brown fox

// SENSE: 120_Proximity_Matching/05_Term_positions.json

This returns the following:

[role=”pagebreak-before”]

[source,js]

{
“tokens”: [
{
“token”: “quick”,
“start_offset”: 0,
“end_offset”: 5,
“type”: “”,
“position”: 1 <1>
},
{
“token”: “brown”,
“start_offset”: 6,
“end_offset”: 11,
“type”: “”,
“position”: 2 <1>
},
{
“token”: “fox”,
“start_offset”: 12,
“end_offset”: 15,
“type”: “”,
“position”: 3 <1>
}
]

}

<1> The position of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like
the match_phrase query can use them to match only documents that contain
all the words in exactly the order specified, with no words in-between.

==== What Is a Phrase

For a document to be considered a(((“match_phrase query”, “documents matching a phrase”)))(((“phrase matching”, “criteria for matching documents”))) match for the phrase “quick brown fox,” the following must be true:

quick, brown, and fox must all appear in the field.
The position of brown must be 1 greater than the position of quick.
The position of fox must be 2 greater than the position of quick.

If any of these conditions is not met, the document is not considered a match.

[TIP]

Internally, the match_phrase query uses the low-level span query family to
do position-aware matching. (((“match_phrase query”, “use of span queries for position-aware matching”)))(((“span queries”)))Span queries are term-level queries, so they have
no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the span queries directly, as the
match_phrase query is usually good enough. However, certain specialized
fields, like patent searches, use these low-level queries to perform very
specific, carefully constructed positional searches.

==================================================
[[slop]]
=== Mixing It Up

Requiring exact-phrase matches (((“proximity matching”, “slop parameter”)))may be too strict a constraint. Perhaps we do
want documents that contain

最低0.47元/天解锁文章

uxff

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch 权威教程 - 模糊匹配

模糊匹配一般的全文检索方式使用 TF/IDF 处理文本或者文本数据中的某个字段内容。将字面切分成很多字、词(word)建立索引，match查询用query中的term来匹配索引中的字、词。match查询提供了文档数据中是否包含我们需要的query中的单、词，但仅仅这样是不够的，它无法提供文本中的字词之间的关系。举个例子：小苏吃了鳄鱼鳄鱼吃了小苏小苏去哪儿都带着的鳄...
复制链接

扫一扫