ElasticSearch的query_string查询方式
情景介绍
前言:这是我第二次使用 CSDN 发表技术博客。还有很多不懂的地方不会使用,还请多多包涵,有不对的地方请不吝赐教,在下方留言。
本人在一家跨境电商公司做b2c的项目,商品搜索引擎这块由我负责。这几天老板演示搜索商品出现了问题:搜索雪花秀 ,结果里出现一大堆和’雪花秀’不相关的商品,发现这些商品都有共同特征:跟 雪 花 秀 这些单个字有联系,于是猜测在搜索的时候被分词了,用了分词后的结果去匹配导致的。
本产品用的ElasticSearch搜索引擎,下面是搜索语句:
{
"size": 50,
"query": {
"bool": {
"must": [
{
"match": {
"xxxx": {
"query": "yyyy",
"operator": "OR",
"prefix_length": 0,
"max_expansions": 50,
"fuzzy_transpositions": true,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"query_string": {
"query": "text for query",
"fields": [],
"default_operator": "or",
"auto_generate_phrase_queries": false,
"use_dis_max": true,
"tie_breaker": 0,
"max_determinized_states": 10000,
"enable_position_increments": true,
"fuzziness": "AUTO",
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"phrase_slop": 0,
"escape": false,
"split_on_whitespace": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
"_source": {
"includes": [],
"excludes": [
"categoryVO",
"brandVO"
]
},
"sort": [
{
"isStock": {
"order": "desc"
}
},
{
"saleQuantity": {
"order": "desc"
}
},
{
"spuId": {
"order": "desc"
}
}
]
}
Query String Query原理
查阅官方文档,是这样描述的:A query that uses a query parser in order to parse its content.Here is an example:
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}
The query_string query parses the input and splits text around operators. Each textual part is analyzed independently of each other.For instance the following query:
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "(new york city) OR (big apple)"
}
}
}
从这看出这个查询方式居然也有语法规则,这就不难查出问题出现的根源了:只要找到语法规则,知道解析原理,就能针对性解决问题。
这里介绍下query string 的语法解析原理:在’query’ 字段里的内容会被解析成一系列词组或者特殊字符,这些特殊字符组合在一起再根据"query_string"内的字段设置的值去匹配相关内容,例如:
“query"内 输入"excellent work”,会被解析成"excellent or work"再去匹配结果。再如:
“小米手机”,会被解析成"小米",“手机”,“小米手机"再去匹配结果,这里要注意,被解析成什么样是根据index初始定义的analyzer 解析器来的,我这里用的默认解析器是"ik_max_word”。这里就明白了为什么"雪花秀"会匹配出不相关的商品来。因为它被解析成了"雪" “花” "秀"三个字去匹配了。然而需求并非如此,我们需要进行一些微小的修改。
参数含义
Parameter | Description | 意译 |
---|---|---|
query | The actual query to be parsed | 实际会被解析的查询语句 |
default_field | The default field for query terms if no prefix field is specified. Defaults to the index.query.default_field index settings, which in turn defaults to *. * extracts all fields in the mapping that are eligible to term queries and filters the metadata fields. All extracted fields are then combined to build a query when no prefix field is provided | 匹配域,用于查找的范围。默认范围是_all |
default_operator | The default operator used if no explicit operator is specified. For example, with a default operator of OR, the query capital of Hungary is translated to capital OR of OR Hungary, and with default operator of AND, the same query is translated to capital AND of AND Hungary. The default value is OR | 不做设置的话默认为 OR,比如搜索 capital of Hungary 会被转化成 capital OR of OR Hungary.如果设置为 AND,将会被转换成capital AND of AND Hungary |
analyzer | The analyzer name used to analyze the query string | 解析器名,用来解析查询语句 |
quote_analyzer | The name of the analyzer that is used to analyze quoted phrases in the query string. For those parts, it overrides other analyzers that are set using the analyzer parameter or the search_quote_analyzer setting | 引用部分的语句的解析器,会覆盖其他地方设置的解析器,优先度最高 |
fuzziness | Set the fuzziness for fuzzy queries. Defaults to AUTO | 设置模糊查询的 规则 |
minimum_should_match | A value controlling how many “should” clauses in the resulting boolean query should match. It can be an absolute value (2), a percentage (30%) or a combination of both | 用来控制最小匹配度,可以为一个数字,可以百分比,也可以是两者混合 |
上表只展示部分我用过的字段,更多字段设置请前往 Elasticsearch官方文档6.4
根据需求,不需要进行分词匹配,我们可以设置default_operator 为AND,查询语句即为:
{
"size": 50,
"query": {
"bool": {
"must": [
{
"match": {
"xxxx": {
"query": "yyyy",
"operator": "OR",
"prefix_length": 0,
"max_expansions": 50,
"fuzzy_transpositions": true,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"query_string": {
"query": "手",
"fields": [
"brand.productBrandDetails.brandName^0.0",
"detail.spuName^0.0"
],
"use_dis_max": true,
"tie_breaker": 0,
"default_operator": "and",
"auto_generate_phrase_queries": false,
"max_determinized_states": 10000,
"enable_position_increments": true,
"fuzziness": "AUTO",
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"phrase_slop": 0,
"escape": false,
"split_on_whitespace": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
"_source": {
"includes": [],
"excludes": []
},
"sort": []
}
至此,问题得到解决
误区
设置成 default_operator 为AND后,搜索精准度提高,但是对于搜索的文本语句要求也提高了。输入太过精确的内容反而查询不到结果,例如:“手机” 能匹配到几十条记录,但"大屏幕 手机"就匹配不到记录,这也降低了用户体验,还需要优化,升级方向:单词词组不进行拆分解析,但是空格会被拆分解析进行匹配。
目前还没找到一个完善的解决方案,各位大佬有懂的还请留言,本人进入编程行业不久,学的java,如有各位大佬相助,想必能跳过很多坑,在此提前感谢。