文章目录
1 环境准备
- 新建一个索引库item
PUT item
{
"mappings":{
"properties":{
"id":{
"type":"long"
},
"title":{
"type":"text",
"analyzer":"ik_max_word"
},
"content":{
"type":"text",
"analyzer":"ik_max_word"
},
"price":{
"type":"float"
},
"category":{
"type":"keyword"
}
}
}
}
- 插入数据
PUT item/_doc/1
{
"id":1,
"title":"小米手机",
"content":"手机中的性价比之王",
"price":1000.00,
"category":"手机"
}
PUT item/_doc/2
{
"id":2,
"title":"小米电视",
"content":"电视中的性价比之王",
"price":1005.00,
"category":"电视"
}
PUT item/_doc/3
{
"id":3,
"title":"华为电视盒子",
"content":"电视盒直播网络机顶盒4K高清华为海思芯片机顶盒WIFI宽带电视盒子家用电视合猫播放器",
"price":1005.00,
"category":"电视"
}
PUT item/_doc/4
{
"id":4,
"title":"海信冰箱",
"content":"食品保险冷冻首先农品",
"price":3005.00,
"category":"冰箱"
}
PUT item/_doc/5
{
"id":5,
"title":"华为手机",
"content":"首款5g手机",
"price":4005.00,
"category":"手机"
}
2 布尔查询(bool)
bool把各种其它查询通过must(与)、must_not(非)、should(或)的方式进行组合
- must:必须出现在匹配文档中,并且会影响匹配得分
- filter:必须出现在匹配文档中,匹配得分将会被忽略(filter不会影响得分)
- should:应该出现在匹配文档中,在布尔查询中,如果没有must或filter子句,文档必须匹配一个或者多个should子句。应该匹配的should子句的最小数量可以通过
minimum_should_match
参数进行设置 - must_not:不能出现在匹配的文档中。
布尔查询采取匹配的越多越好的方式,每个匹配的子句的得分都会被加在一起,为每个文档提供最终得分(_score)
演示
比如要搜手机,价格必须在1000到20000,是否支持5g均可,品牌为华为
GET item/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "手机"
}
},
{
"range": {
"price": {
"gte": 1000,
"lte": 20000
}
}
}
],
"should": [
{
"match": {
"content": "5g"
}
}
],
"filter": {
"term": {
"title": "华为"
}
}
}
}
}
GET item/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "手机"
}
},
{
"range": {
"price": {
"gte": 1000,
"lte": 20000
}
}
}
],
"should": [
{
"match": {
"content": "5g"
}
}
],
"filter": {
"term": {
"title": "华为"
}
}
}
}
}
如果去掉filter:
GET item/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "手机"
}
},
{
"range": {
"price": {
"gte": 1000,
"lte": 20000
}
}
}
],
"should": [
{
"match": {
"content": "5g"
}
}
]
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 7.009481,
"hits" : [
{
"_index" : "item",
"_type" : "_doc",
"_id" : "5",
"_score" : 7.009481,
"_source" : {
"id" : 5,
"title" : "华为手机",
"content" : "首款5g手机",
"price" : 4005.0,
"category" : "手机"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.8754687,
"_source" : {
"id" : 1,
"title" : "小米手机",
"content" : "手机中的性价比之王",
"price" : 1000.0,
"category" : "手机"
}
}
]
}
}
发现两次查询到华为手机的结果的评分(_score)字段的都是7.009481
—>filter不会影响得分
3 最佳匹配字段
3.1 引入
PUT /my_index/_doc/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
PUT /my_index/_doc/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
让我们运行下面的bool查询:
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
{
"took" : 544,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.90425634,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90425634,
"_source" : {
"title" : "Quick brown rabbits",
"body" : "Brown rabbits are commonly seen."
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.77041256,
"_source" : {
"title" : "Keeping pets healthy",
"body" : "My quick brown fox eats rabbits on a regular basis."
}
}
]
}
}
从查询得分来看文档1的得分比文档2的得分要高,但是我们看来搜索Brown fox
,文档2的匹配度更高一点(Brown fox文档2 的body字段完整的包含了Brown fox)
bool查询是如何计算得到其分值的:
- 运行should子句中的两个查询
- 相加查询返回的分值
- 将相加得到的分值乘以匹配的查询子句的数量
- 除以总的查询子句的数量
文档1在两个字段中都包含了brown,因此两个match查询都匹配成功并拥有了一个分值。文档2在body字段中包含了brown以及fox,但是在title字段中没有出现任何搜索的单词。因此对body字段查询得到的高分加上对title字段查询得到的零分,然后在乘以匹配的查询子句数量1,最后除以总的查询子句数量2,导致整体分值比文档1的低。
在这个例子中,title和body字段是互相竞争的。我们想要找到一个最佳匹配(Best-matching)的字段。
如果我们不是合并来自每个字段的分值,而是使用最佳匹配子句的分值作为整个查询的整体分值呢?这就会让包含有我们寻找的两个单词的字段有更高的权重,而不是在不同的字段中重复出现的相同单词。
3.2 dis_max查询
相比使用bool查询,我们可以使用dis_max查询(Disjuction Max Query)。Disjuction的意思"OR"(而Conjunction的意思是"AND"),因此Disjuction Max Query的意思就是返回匹配了任何查询的文档,并且分值是产生了最佳匹配的查询所对应的分值:
GET my_index/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.77041256,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.77041256,
"_source" : {
"title" : "Keeping pets healthy",
"body" : "My quick brown fox eats rabbits on a regular basis."
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"title" : "Quick brown rabbits",
"body" : "Brown rabbits are commonly seen."
}
}
]
}
}
3.3 tie_breaker参数
如果搜索的是"quick pets",那么会发生什么呢?两份文档都包含了单词quick,但是只有文档2包含了单词pets,文档1没有包含。两份文档都没能在一个字段中同时包含搜索的两个单词:
- quick:文档1的title字段包含了quick。文档2的body字段包含了quick
- pets:文档1不包含pets,文档2的title字段包含了pets
一个像下面那样的简单dis_max查询会选择出拥有最佳匹配字段的查询子句,而忽略其他的查询子句的得分:
GET my_index/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
]
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"title" : "Quick brown rabbits",
"body" : "Brown rabbits are commonly seen."
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"title" : "Keeping pets healthy",
"body" : "My quick brown fox eats rabbits on a regular basis."
}
}
]
}
}
可以发现,两份文档的分值是一模一样的。
我们期望的是同时匹配了title字段和body字段的文档能够拥有更高的排名,但是结果并非如此。需要记住:dis_max查询只是简单的使用最佳匹配查询子句得到的_score。
使用tie_breaker参数将其它匹配的查询子句考虑进来
GET my_index/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
],
"tie_breaker": 0.3
}
}
}
{
"hits": [
{
"_id": "2",
"_score": 0.14757764,
"_source": {
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
},
{
"_id": "1",
"_score": 0.124275915,
"_source": {
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
}
]
}
现在文档2的分值比文档1稍高一些,就比较符合我们的期望值
tie_breaker参数会让dis_max查询的行为更像是dis_max和bool的一种折中。它会通过下面的方式改变分值计算过程:
- 取得最佳匹配查询子句的_score。
- 将其它每个匹配的子句的分值乘以tie_breaker。
- 将以上得到的分值进行累加并规范化。
- 通过tie_breaker参数,所有匹配的子句都会起作用,只不过最佳匹配子句的作用更大。
tie_breaker的取值范围是0到1之间的浮点数,取0时即为仅使用最佳匹配子句(译注:和不使用tie_breaker参数的dis_max查询效果相同),取1则会将所有匹配的子句一视同仁。它的确切值需要根据你的数据和查询进行调整,但是一个合理的值会靠近0,(比如,0.1 -0.4),来确保不会压倒dis_max查询具有的最佳匹配性质。
4 过滤(filter)
4.1 入门
条件查询中进行过滤
所有的查询都会影响到文档的评分及排名。如果我们需要在查询结果中进行过滤,并且不希望过滤条件影响评分,那么就不要把过滤条件作为查询条件来用。而是使用filter方式:
上面已经介绍过了:
GET item/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "手机"
}
},
{
"range": {
"price": {
"gte": 1000,
"lte": 20000
}
}
}
],
"should": [
{
"match": {
"content": "5g"
}
}
],
"filter": {
"term": {
"title": "华为"
}
}
}
}
}
4.2 constant_score
如果一次查询只有过滤,没有查询条件,不希望进行评分,我们可以使用constant_score取代只有 filter 语句的 bool 查询。在性能上是完全相同的,但对于提高查询简洁性和清晰度有很大帮助。
GET item/_search
{
"query": {
"constant_score": {
"filter": {
"terms": {
"title": [
"冰箱",
"手机"
]
}
},
"boost": 1.2
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.2,
"hits" : [
{
"_index" : "item",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.2,
"_source" : {
"id" : 1,
"title" : "小米手机",
"content" : "手机中的性价比之王",
"price" : 1000.0,
"category" : "手机"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.2,
"_source" : {
"id" : 4,
"title" : "海信冰箱",
"content" : "食品保险冷冻首先农品",
"price" : 3005.0,
"category" : "冰箱"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.2,
"_source" : {
"id" : 5,
"title" : "华为手机",
"content" : "首款5g手机",
"price" : 4005.0,
"category" : "手机"
}
}
]
}
}
5 高亮
5.1 入门
通过highlight进行设置,查询字段高亮
GET item/_search
{
"query": {
"term": {
"title": "手机"
}
},
"highlight": {
"fields": {
"title":{}
}
}
}
{
"took" : 135,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.9395274,
"hits" : [
{
"_index" : "item",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9395274,
"_source" : {
"id" : 1,
"title" : "小米手机",
"content" : "手机中的性价比之王",
"price" : 1000.0,
"category" : "手机"
},
"highlight" : {
"title" : [
"小米<em>手机</em>"
]
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.9395274,
"_source" : {
"id" : 5,
"title" : "华为手机",
"content" : "首款5g手机",
"price" : 4005.0,
"category" : "手机"
},
"highlight" : {
"title" : [
"华为<em>手机</em>" # 默认使用<em>标签
]
}
}
]
}
}
5.2 自定义高亮标签
es默认使用的是标签标记关键字
GET item/_search
{
"query": {
"term": {
"title": "手机"
}
},
"highlight": {
"fields": {
"title": {
"pre_tags": ["<strong>"],
"post_tags": ["<strong>"]
}
}
}
}
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.9395274,
"hits" : [
{
"_index" : "item",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9395274,
"_source" : {
"id" : 1,
"title" : "小米手机",
"content" : "手机中的性价比之王",
"price" : 1000.0,
"category" : "手机"
},
"highlight" : {
"title" : [
"小米<strong>手机<strong>"
]
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.9395274,
"_source" : {
"id" : 5,
"title" : "华为手机",
"content" : "首款5g手机",
"price" : 4005.0,
"category" : "手机"
},
"highlight" : {
"title" : [
"华为<strong>手机<strong>"
]
}
}
]
}
}
5.3 多字段高亮
比如搜索title字段的时候,也希望content的字段也会高亮,使用require_field_match
,默认是true
GET item/_search
{
"query": {
"term": {
"title": "手机"
}
},
"highlight": {
"require_field_match": "false",
"fields": {
"title": {
"pre_tags": ["<strong>"],
"post_tags": ["<strong>"]
},
"content": {}
}
}
}
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.9395274,
"hits" : [
{
"_index" : "item",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9395274,
"_source" : {
"id" : 1,
"title" : "小米手机",
"content" : "手机中的性价比之王",
"price" : 1000.0,
"category" : "手机"
},
"highlight" : {
"title" : [
"小米<strong>手机<strong>"
],
"content" : [
"<em>手机</em>中的性价比之王"
]
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.9395274,
"_source" : {
"id" : 5,
"title" : "华为手机",
"content" : "首款5g手机",
"price" : 4005.0,
"category" : "手机"
},
"highlight" : {
"title" : [
"华为<strong>手机<strong>"
],
"content" : [
"首款5g<em>手机</em>"
]
}
}
]
}
}
5.4 高亮性能分析
es提供了三个高亮器
- highlighter: 默认
- highlighter实现高亮功能需要对
_source
保存的原始文档进行二次分析,速度最慢,优点是不需要额外的存储空间
- highlighter实现高亮功能需要对
- postings-highlighter
- 不需要对
_source
保存的原始文档进行二次分析,但是需要在字段映射中设置index_options
,取值为offsets,保存关键词的偏移量
- 不需要对
- fast-vector-highlighter
- 速度最快。但是需要在字段映射中设置
with_positions_offsets
,取值为offsets,保存关键词的未知和偏移信息,占用存储空间最大
- 速度最快。但是需要在字段映射中设置
6 排序
默认排序
es按照查询和文档的相关度进行排序的,默认按照评分降序排序:
GET item/_search
{
"query": {
"term": {
"title": "手机"
}
},
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
对应match_all,由于只返回所有文档,不需要评分(返回都是1),就是按照添加的顺序进行排序
GET item/_search
{
"query": {
"match_all": {}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "item",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"title" : "小米手机",
"content" : "手机中的性价比之王",
"price" : 1000.0,
"category" : "手机"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 2,
"title" : "小米电视",
"content" : "电视中的性价比之王",
"price" : 1005.0,
"category" : "电视"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"id" : 3,
"title" : "华为电视盒子",
"content" : "电视盒直播网络机顶盒4K高清华为海思芯片机顶盒WIFI宽带电视盒子家用电视合猫播放器",
"price" : 1005.0,
"category" : "电视"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"id" : 4,
"title" : "海信冰箱",
"content" : "食品保险冷冻首先农品",
"price" : 3005.0,
"category" : "冰箱"
}
},
{
"_index" : "item",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"id" : 5,
"title" : "华为手机",
"content" : "首款5g手机",
"price" : 4005.0,
"category" : "手机"
}
}
]
}
}
6.2 多字段排序
比如先按照价格升序,在按照id降序
GET item/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"price": {
"order": "asc"
}
},
{
"id": {
"order": "desc"
}
}
]
}