ElasticSearch近似匹配调研

最新推荐文章于 2024-07-04 17:10:22 发布

xiaobo_z

最新推荐文章于 2024-07-04 17:10:22 发布

阅读量261

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/qq_29579431/article/details/111992239

版权

elasticsearch 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文探讨了ElasticSearch中短语匹配的概念，包括match和match_phrase的区别，词项位置的重要性，以及如何通过调整slop参数来实现近似匹配。还介绍了多值字段的处理，以及如何通过临近度提高查询的相关性。最后，文章提到了性能优化策略，如使用rescoring来平衡查询效率和结果质量。

摘要由CSDN通过智能技术生成

一、载入数据

1.1 新建和删除索引

在ElasticSearch中，索引的含义和关系型数据库中的数据库类似。如下命令可以查看当前节点下所有索引：

curl -X GET 'http://localhost:9200/_cat/indices?v'

新建Index可以直接向Elastic服务器发出PUT请求。如下建立mysql_log索引：

curl -X PUT 'localhost:9200/mysql_log'

删除索引使用DELETE请求，如下：

curl -X DELETE 'localhost:9200/mysql_log'

载入数据

在ElasticSearch中

curl -X PUT "localhost:9200/mysql_log/test/1?pretty" -H 'Content-Type: application/json' -d'
{
	"time" : "2020/12/30 11:30:24",
	"sql_type" : "select",
	"sql" : "select col1 from tab1 where col2 = 3;"
}
'
curl -X PUT "localhost:9200/mysql_log/test/2?pretty" -H 'Content-Type: application/json' -d'
{
	"time" : "2020/12/30 11:30:24",
	"sql_type" : "insert",
	"sql" : "insert into tab1 values (1, 1, 1);"
}
'
curl -X PUT "localhost:9200/mysql_log/test/3?pretty" -H 'Content-Type: application/json' -d'
{
	"time" : "2020/12/30 11:31:15",
	"sql_type" : "update",
	"sql" : "update tab1 set col1 = 4 WHERE col3 = 4;"
}
'
curl -X PUT "localhost:9200/mysql_log/test/4?pretty" -H 'Content-Type: application/json' -d'
{
	"time" : "2020/12/30 11:30:24",
	"sql_type" : "delete",
	"sql" : "delete from tab1 WHERE col1 = 4;"
}
'

如果成功，则会返回如下结果：

{
  "_index" : "mysql_log",
  "_type" : "test",
  "_id" : "3",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

这里，每一条数据都被称作一个文档，由_index、_type和_id唯一标识一个文档。

短语匹配

match和match_phrase比较

在上节中我们载入了四个文档，其中文档中sql字段的值如下：

select col1 from tab1 where col2 = 3;
insert into tab1 values (1, 1, 1);
update tab1 set col1 = 4 WHERE col3 = 4;
delete from tab1 WHERE col1 = 4;

如果我们使用match进行搜索时：

curl -X GET "localhost:9200/mysql_log/test/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "sql": "col1 from"
        }
    }
}
'

会返回三条结果，篇幅原因只显示部分结果

"_score" : 1.1149836,
"sql" : "delete from tab1 WHERE col1 = 4;"

"_score" : 1.0498221,
"sql" : "select col1 from tab1 where col2 = 3;"

"_score" : 0.33698124,
"sql" : "update tab1 set col1 = 4 WHERE col3 = 4;"

其中，_score是针对匹配程度进行的一个打分，match是使用TF/IDF的标准来对匹配程度进行打分。其中TF/IDF就是将词频（term frequency，即col1和from在相关文档的sql字段中出现的频率）和反向文档频率（inverse document frequency，即col1和from在所有文档的sql字段中出现的频率），以及字段的长度（即字段越短相关度越高）相结合的计算方式。
上面的例子中，第一条比第二条字段长度要短，所以第一条的_score要大些。第三条由于只匹配了col1，所以_score较小。
接下来使用match_phrase进行查询：

curl -X GET "localhost:9200/mysql_log/test/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_phrase": {
            "sql": "col1 from"
        }
    }
}
'

只会返回一条结果：

"_score" : 1.0498221,
"sql" : "select col1 from tab1 where col2 = 3;"

因为match_phrase查询首先将查询字符串解析成一个词项列表，然后对这些词项进行搜索，但只保留那些包含全部搜索词项，且位置与搜索词项相同的文档。相比较与match查询，match_phrase更为严格。

词项的位置

当一个字符串被分析器分析后，不但会分析为一个词项列表，还会记录词项的位置信息。例如对于col1 from进行分析：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "col1 from"
}'

会返回如下结果：

{
	"tokens": [{
		"token": "col1",
		"start_offset": 0,
		"end_offset": 4,
		"type": "<ALPHANUM>",
		"position": 0
	}, {
		"token": "from",
		"start_offset": 5,
		"end_offset": 9,
		"type": "<ALPHANUM>",
		"position": 1
	}]
}

所以match_phrase会根据词项的位置，利用倒排索引，来实现对词项信息敏感的查询

短语

本节中的match_phrase查询就是针对短语进行匹配，短语匹配就是词项及词项位置都相同的才能匹配。

混合起来

由于短语匹配的要求过于严格。例如上面例子中的delete from tab1 WHERE col1 = 4;虽然包含词项col1和from，只是词项的位置没有达到要求，使用短语匹配的话就会失败。此时需要将slop参数将灵活度引入短语匹配中：

curl -X GET "localhost:9200/mysql_log/test/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_phrase": {
            "sql": {
            	"query": "col1 from",
            	"slop":  1
            }
        }
    }
}
'

当slop为1时，只能返回一条结果"sql" : "select col1 from tab1 where col2 = 3;"；但是当slop为4的时候，就会返回包含delete from tab1 WHERE col1 = 4;的两条结果了。这是因为在delete from tab1 WHERE col1 = 4;中from相较于col1偏移了4个位置，所以设置slop为4的话，就会匹配到两条结果。
如果slop设置的足够大，那么match_phrase会匹配到所有包含词项col1和from的文档。由此可见slop参数可以扩大match_phrase搜索范围。

多值字段

如果文档中的字段是一个数组类型，例如：

curl -X PUT "localhost:9200/test2/groups/1?pretty" -H 'Content-Type: application/json' -d'
{
    "names": [ "John Abraham", "Lincoln Smith"]
}
'

然后使用match_phrase查询

curl -X GET "localhost:9200/test2/groups/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln"
        }
    }
}
'

由于Abraham的位置是2，Lincoln位置是3。在以前版本中，是可以匹配到这条文档的。但是在ElasticSearch 7.10.1版本中发生了改变，是匹配不到这条文档。

越近越好

带有参数slop的match_phrase会将查询词条的邻近度考虑到最终相关度_score中。如下所示：

curl -X GET "localhost:9200/mysql_log/test/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_phrase": {
            "sql": {
            	"query": "col1 from",
            	"slop":  4
            }
        }
    }
}
'

返回结果为：

"_score" : 1.0498221,
"sql" : "select col1 from tab1 where col2 = 3;"

"_score" : 0.36330914,
"sql" : "delete from tab1 WHERE col1 = 4;"

可以看出col1和from近的_score较高，反之较小。

使用临近度提高相关性

由于近似查询需要所有的词项都出现在文档中才可以匹配，要求过于严格。因此，我们可以设计一个可以匹配部分的词项，并且利用近似查询来对结果进行评分。我们可以使用bool查询来将多个查询分数累计起来。查询如下：

curl -X GET "localhost:9200/mysql_log/test/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "sql": {
            "query":                "col1 from tab1",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": { 
          "sql": {
            "query": "col1 from tab1",
            "slop":  50
          }
        }
      }
    }
  }
}
'

在这条bool查询中，将must中的match查询分数和should中的match_phrase查询分数合并得到最终的_score。其中must子句的查询规则是结果必须要包含在结果集的，minimum_should_match就是最小匹配度；should子句的查询可以添加一些查询规则来增加匹配结果的_score。上面查询的结果如下：

"_score" : 2.3103652,
"sql" : "select col1 from tab1 where col2 = 3;"

"_score" : 1.6266546,
"sql" : "delete from tab1 WHERE col1 = 4;"

"_score" : 0.4365243,
"sql" : "update tab1 set col1 = 4 WHERE col3 = 4;"

"_score" : 0.10536051,
"sql" : "insert into tab1 values (1, 1, 1);"

由上面的结果可以看出虽然match_phrase是对于col1 from tab1进行查询，但是并不会影响update tab1 set col1 = 4 WHERE col3 = 4;和insert into tab1 values (1, 1, 1);结果的显示，尽管词项没有完全匹配。如果我们将最小匹配度minimum_should_match的参数调到70%，那么就只会匹配到三条结果，insert into tab1 values (1, 1, 1);会因为匹配度过低而过滤掉：

"_score" : 2.3103652,
"sql" : "select col1 from tab1 where col2 = 3;"

"_score" : 1.6266546,
"sql" : "delete from tab1 WHERE col1 = 4;"

"_score" : 0.4365243,
"sql" : "update tab1 set col1 = 4 WHERE col3 = 4;"

性能优化

match_phrase查询效率要比match查询低得多，因为match_phrase查询必须计算并比较多个可能重复词项的位置。我们可以通过减少需要通过短语查询的文档总数来提高match_phrase查询的效率。
主要思路如下：
用match查询先过滤出需要的数据，然后再用match_phrase来根据词项距离提高doc分数，同时match_phrase只针对每个分片的分数排名前N个文档起作用，来重新调整他们的分数，这个过程称之为rescoring（重打分）。
查询如下：

curl -X GET "localhost:9200/mysql_log/test/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {  
            "sql": {
                "query":                "col1 from tab1",
                "minimum_should_match": "30%"
            }
        }
    },
    "rescore": {
        "window_size": 4, 
        "query": {         
            "rescore_query": {
                "match_phrase": {
                    "sql": {
                        "query": "col1 from tab1",
                        "slop":  50
                    }
                }
            }
        }
    }
}
'

数据还是使用的最初的数据，该索引中只有4条文档，并且都在同一个分片中。window_size是每一分片进行重新评分的顶部文档数量，这个例子中设置为4的话就可以覆盖所有match搜索到的文档结果。结果如下：

"_score" : 2.3103652,
"sql" : "select col1 from tab1 where col2 = 3;"

"_score" : 1.6266546,
"sql" : "delete from tab1 WHERE col1 = 4;"

"_score" : 0.4365243,
"sql" : "update tab1 set col1 = 4 WHERE col3 = 4;"

"_score" : 0.10536051,
"sql" : "insert into tab1 values (1, 1, 1);"

当我们将window_size设置为1后，结果如下：

"_score" : 1.6266546,
"sql" : "delete from tab1 WHERE col1 = 4;"

"_score" : 1.1551826,
"sql" : "select col1 from tab1 where col2 = 3;"

"_score" : 0.4365243,
"sql" : "update tab1 set col1 = 4 WHERE col3 = 4;"

"_score" : 0.10536051,
"sql" : "insert into tab1 values (1, 1, 1);"

可以看出只对delete from tab1 WHERE col1 = 4;进行了重打分，因为在match查询时delete from tab1 WHERE col1 = 4;的_score是最高的，由于window_size设置为1，所以只对delete from tab1 WHERE col1 = 4;重打分了。