【ElasticSearch-基础篇】ES高级查询Query DSL全文检索

最新推荐文章于 2024-08-10 15:42:28 发布

皮卡皮卡皮·

最新推荐文章于 2024-08-10 15:42:28 发布

阅读量2.2k

点赞数 55

分类专栏： ElasticSearch 文章标签：全文检索 elasticsearch spring boot

本文链接：https://blog.csdn.net/qq_42890164/article/details/135584453

版权

ElasticSearch 专栏收录该内容

5 篇文章 2 订阅

订阅专栏

Query DSL之全文检索

什么是全文检索
一、数据准备
二、match query
三、multi_match query
四、match_phrase query
五、query_string query
六、simple_query_string

什么是全文检索

和术语级别查询（Term-Level Queries）不同，全文检索查询（Full Text Queries）旨在基于相关性搜索和匹配文本数据。这些查询会对输入的文本进行分析，将其拆分为词项（单个单词），并执行诸如分词、词干处理和标准化等操作。

全文检索的关键特点：

对输入的文本进行分析，并根据分析后的词项进行搜索和匹配。全文检索查询会对输入的文本进行分析，将其拆分为词项，并基于这些词项进行搜索和匹配操作。
以相关性为基础进行搜索和匹配。全文检索查询使用相关性算法来确定文档与查询的匹配程度，并按照相关性进行排序。相关性可以基于词项的频率、权重和其他因素来计算。
全文检索查询适用于包含自由文本数据的字段，例如文档的内容、文章的正文或产品描述等。

一、数据准备

PUT full_index
{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "long"
      },
      "description" : {
          "type" : "text",
          "analyzer": "ik_max_word",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
    }
  }
}

测试数据如下:
{name=张三, description=北京故宫圆明园, age=11}
{name=王五, description=南京总统府, age=15}
{name=李四, description=北京市天安门广场, age=18}
{name=富贵, description=南京市中山陵, age=22}
{name=来福, description=山东济南趵突泉, age=8}
{name=憨憨, description=安徽黄山九华山, age=27}
{name=小七, description=上海东方明珠, age=31}

二、match query

匹配查询: match在匹配时会对所查找的关键词进行分词，然后按分词匹配查找。

match支持以下参数：

query : 指定匹配的值
operator : 匹配条件类型
and : 条件分词后都要匹配
or : 条件分词后有一个匹配即可(默认)
minmum_should_match : 最低匹配度，即条件在倒排索引中最低的匹配度

DSL: 索引description字段包含 “南京总统府” 的数据

GET  full_index/_search
{
  "query": {
    "match": {
      "description": "南京总统府"
    }
  }
}

返回数据如下:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.2667978,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.2667978,
        "_source" : {
          "name" : "王五",
          "age" : 15,
          "description" : "南京总统府"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0751815,
        "_source" : {
          "name" : "富贵",
          "age" : 22,
          "description" : "南京市中山陵"
        }
      }
    ]
  }
}

springboot实现:

    private final static Logger LOGGER = LoggerFactory.getLogger(FullTextQuery.class);

    private static final String INDEX_NAME = "full_index";

    @Resource
    private RestHighLevelClient client;
    
    @RequestMapping(value = "/match_query", method = RequestMethod.GET)
    @ApiOperation(value = "DSL - match_query")
    public void match_query() throws Exception {
        // 定义请求对象
        SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
        // 查询所有
        searchRequest.source(new SearchSourceBuilder().query(QueryBuilders.matchQuery("description","南京总统府")));
        // 打印返回数据
        printLog(client.search(searchRequest, RequestOptions.DEFAULT));
    }

    private void printLog(SearchResponse searchResponse) {
        SearchHits hits = searchResponse.getHits();
        System.out.println("返回hits数组长度:" + hits.getHits().length);
        for (SearchHit hit: hits.getHits()) {
            System.out.println(hit.getSourceAsMap().toString());
        }
    }
    
返回结果如下:
返回hits数组长度:2
{name=王五, description=南京总统府, age=15}
{name=富贵, description=南京市中山陵, age=22}

分析：此时可以发现当搜索 “南京总统府” 时，返回了两条数据，那么为什么 “南京市中山陵” 也被搜索到了呢？
原因就是全文检索会拆分搜索的此项，因为在创建索引的时候指定了 description 字段的分词方式是 “ik_max_word” ，而该分词类型会将 “南京总统府” 拆分成如下词类去搜索倒排索引:

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": ["南京总统府"]
}

{
  "tokens" : [
    {
      "token" : "南京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "总统府",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "总统",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "府",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    }
  ]
}

其中就有"南京"这个词项，所以用 “南京总统府” 去搜索是可以搜到 “南京中山陵” 的数据，那么match_query的operator也不用多说，就是满足所有拆分的词项

比如此时我们再插入一条数据:
POST /full_index/_bulk
{"index":{"_id":8}}
{"name":"张三","age":11,"description":"南京总统"}

当我们搜索:"南京总统",可以搜到两条数据
GET  full_index/_search
{
  "query": {
    "match": {
      "description": {
        "query": "南京总统",
        "operator": "and"
      }
    }
  }
}
数据如下:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.898355,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 2.898355,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "南京总统"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.35562,
        "_source" : {
          "name" : "王五",
          "age" : 15,
          "description" : "南京总统府"
        }
      }
    ]
  }
}

但是当搜索:"南京总统府"时，只能搜索到一条数据,就是因为分词时，有一个词项"府",在其中一条数据中不存在

三、multi_match query

多字段查询:可以根据字段类型，决定是否使用分词查询，得分最高的在前面
注意：字段类型分词,将查询条件分词之后进行查询，如果该字段不分词就会将查询条件作为整体进行查询。

DSL: 查询 “name” 或者 “description” 这两个字段中出现 “北京王五” 词汇的数据

GET  full_index/_search
{
  "query": {
    "multi_match": {
      "query": "北京王五",
      "fields": ["name","description"]
    }
  }
}

返回结果如下:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 3.583519,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.583519,
        "_source" : {
          "name" : "王五",
          "age" : 15,
          "description" : "南京总统府"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.4959542,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "北京故宫圆明园"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.98645234,
        "_source" : {
          "name" : "李四",
          "age" : 18,
          "description" : "北京市天安门广场"
        }
      }
    ]
  }
}

springboot实现:

    @RequestMapping(value = "/multi_match", method = RequestMethod.GET)
    @ApiOperation(value = "DSL - multi_match")
    public void multi_match() throws Exception {
        // 定义请求对象
        SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
        // 查询所有
        searchRequest.source(new SearchSourceBuilder().query(
                QueryBuilders.multiMatchQuery("北京王五", new String[]{"name","description"})));
        // 打印返回数据
        printLog(client.search(searchRequest, RequestOptions.DEFAULT));
    }

查询结果如下:
返回hits数组长度:3
{name=王五, description=南京总统府, age=15}
{name=张三, description=北京故宫圆明园, age=11}
{name=李四, description=北京市天安门广场, age=18}

前面也强调到 字段类型分词,将查询条件分词之后进行查询，如果该字段不分词就会将查询条件作为整体进行查询
那么我们来测试一下,比如当不对 “description” 字段分词时查询

GET  full_index/_search
{
  "query": {
    "multi_match": {
      "query": "北京王五",
      "fields": ["name","description.keyword"]
    }
  }
}
返回结果如下:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 3.583519,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.583519,
        "_source" : {
          "name" : "王五",
          "age" : 15,
          "description" : "南京总统府"
        }
      }
    ]
  }
}

可以看到，当使用 “description.keyword” 也就是不对 “description” 进行分词时，只返回了一条数据，该条数据只有 “name” 字段为 “王五” 满足了查询条件分词匹配后的结果。

四、match_phrase query

短语搜索(match phrase)会对搜索文本进行文本分析，然后到索引中寻找搜索的每个分词并要求分词相邻，你可以通过调整slop参数设置分词出现的最大间隔距离。match_phrase 会将检索关键词分词。

DSL: 搜索 "description " 字段有 “北京故宫” 的数据

GET  full_index/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "北京故宫"
      }
    }
  }
}

返回数据如下:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 3.5884824,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 3.5884824,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "北京故宫圆明园"
        }
      }
    ]
  }
}

springboot实现:

    @RequestMapping(value = "/match_phrase", method = RequestMethod.GET)
    @ApiOperation(value = "DSL - match_phrase")
    public void match_phrase() throws Exception {
        // 定义请求对象
        SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
        // 查询所有
        searchRequest.source(new SearchSourceBuilder().query(
                QueryBuilders.matchPhraseQuery("description","北京故宫")));
        // 打印返回数据
        printLog(client.search(searchRequest, RequestOptions.DEFAULT));
    }

返回数据如下:
返回hits数组长度:1
{name=张三, description=北京故宫圆明园, age=11}

思考：搜索 "description " 字段有 “北京故宫” 的数据有返回，那么搜索 “北京圆明园” ，为什么没数据返回？

GET  full_index/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "北京圆明园"
      }
    }
  }
}
返回数据如下:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

原因分析: 先查看 “北京故宫圆明园” 的分词结果，如下:

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": ["北京故宫圆明园"]
}

{
  "tokens" : [
    {
      "token" : "北京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "故宫",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "圆明园",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

可以发现 “北京” 和 “圆明园” 并不是相邻的词条,他们之间相差了一个词条，所以这时候就需要用到 “slop” , slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配

GET  full_index/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "北京圆明园",
        "slop": 1
      }
    }
  }
}
返回结果如下:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.4425511,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.4425511,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "北京故宫圆明园"
        }
      }
    ]
  }
}

五、query_string query

允许我们在单个查询字符串中指定AND | OR | NOT条件，同时也和 multi_match query 一样，支持多字段搜索。和match类似，但是match需要指定字段名，query_string是在所有字段中搜索，范围更广泛。
注意: 查询字段分词就将查询条件分词查询，查询字段不分词将查询条件不分词查询

DSL: 搜索当前索引所有字段中含有 “北京故宫” 的文档

GET  full_index/_search
{
  "query": {
    "query_string": {
      "query": "安徽张三"
    }
  }
}

返回数据如下:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.5618675,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.5618675,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "北京故宫圆明园"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 2.5618675,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "南京总统"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.7342355,
        "_source" : {
          "name" : "憨憨",
          "age" : 27,
          "description" : "安徽黄山九华山"
        }
      }
    ]
  }
}

springboot实现：

    @RequestMapping(value = "/query_string", method = RequestMethod.GET)
    @ApiOperation(value = "DSL - query_string")
    public void query_string() throws Exception {
        // 定义请求对象
        SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
        // 查询所有
        searchRequest.source(new SearchSourceBuilder().query(
                QueryBuilders.queryStringQuery("安徽张三")));
        // 打印返回数据
        printLog(client.search(searchRequest, RequestOptions.DEFAULT));
    }

返回hits数组长度:3
{name=张三, description=北京故宫圆明园, age=11}
{name=张三, description=南京总统, age=11}
{name=憨憨, description=安徽黄山九华山, age=27}

指定字段查询: “description” 字段中含有 “安徽张三” 的文档

GET  full_index/_search
{
  "query": {
    "query_string": {
      "query": "安徽张三",
      "fields": ["description"]
    }
  }
}

返回数据如下：
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.7342355,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.7342355,
        "_source" : {
          "name" : "憨憨",
          "age" : 27,
          "description" : "安徽黄山九华山"
        }
      }
    ]
  }
}

指定多个字段查询： 查询 “安徽” “憨憨” 同时满足

GET  full_index/_search
{
  "query": {
    "query_string": {
      "query": "安徽 AND 憨憨",
      "fields": ["description","name"]
    }
  }
}

返回:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 6.6615744,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 6.6615744,
        "_source" : {
          "name" : "憨憨",
          "age" : 27,
          "description" : "安徽黄山九华山"
        }
      }
    ]
  }
}

GET  full_index/_search
{
  "query": {
    "query_string": {
      "query": "(安徽 AND 憨憨)OR 张三",
      "fields": ["description","name"]
    }
  }
}
返回数据如下:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 6.6615744,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 6.6615744,
        "_source" : {
          "name" : "憨憨",
          "age" : 27,
          "description" : "安徽黄山九华山"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.5618675,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "北京故宫圆明园"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 2.5618675,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "南京总统"
        }
      }
    ]
  }
}

query_string query 这种查询方式类似于 match query匹配查询结合 multi_match query 多字段查询一起使用。

六、simple_query_string

类似Query String，但是会忽略错误的语法,同时只支持部分查询语法，不支持AND OR NOT，会当作字符串处理。支持部分逻辑：

“+” 替代 “AND”
“|” 替代 “OR”
“-” 替代 “NOT”

GET full_index/_search
{
  "query": {
    "simple_query_string": {
      "query": "(安徽 + 憨憨) | 张三",
      "fields": ["description","name"]
    }
  }
}

返回结果如下:
{
  "took" : 41,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 6.6615744,
    "hits" : [
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 6.6615744,
        "_source" : {
          "name" : "憨憨",
          "age" : 27,
          "description" : "安徽黄山九华山"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.5618675,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "北京故宫圆明园"
        }
      },
      {
        "_index" : "full_index",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 2.5618675,
        "_source" : {
          "name" : "张三",
          "age" : 11,
          "description" : "南京总统"
        }
      }
    ]
  }
}