elasticsearch （dsl）

三小姐YY

已于 2024-06-27 18:15:55 修改

阅读量357

点赞数 6

分类专栏： elasticsearch 文章标签： elasticsearch 大数据搜索引擎

于 2024-05-31 18:07:31 首次发布

本文链接：https://blog.csdn.net/qq_35720068/article/details/139356755

版权

elasticsearch 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

正排索引和倒排索引

正排索引：通过id ，查询content

倒排索引：通过content，查询到符合的 ids

eg：

正排索引就是通过《静夜思》，找到整片文章。

倒排索引通过“明月”，找到《静夜思》《望月怀古》《关山月》等

get 查询

索引的基本信息：

GET your_index/_mapping //跟看mysql表字段差不多，可以查看字段的类型type：keyword,text
GET your_index/_alias //查看索引的别名
GET /_cat/health?v //查看集群状态
GET _cat/indices // 查看所有index
GET _cat/shards/your_index //查看指定索引的分片数，每个分片有主（p）副（r）分片

查询索引内容：

match_all


GET /you_index/_search
{
  "query":{
    "match_all": {}
}

`bool`

bool查询是一个非常强大且常用的复合查询，它允许你组合多个查询条件。bool 查询的核心概念包括以下四种子句：

must: 子句必须匹配文档。类似于 SQL 中的 AND 操作符。
filter: 子句必须匹配文档，但不影响评分。也就是说，它只过滤文档，但不参与评分计算。
should: 子句可以匹配文档。如果在一个 bool 查询中包含了多个 should 子句，则至少一个 should 子句必须匹配文档。类似于 SQL 中的 OR 操作符。
must_not: 子句不能匹配文档。类似于 SQL 中的 NOT 操作符。

eg：

GET you_index/_search
{
  "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "term": {
                                    "name": {
                                        "value": "林俊凯",
                                        "boost": 1
                                    }
                                }
                            },
                            {
                                "term": {
                                    "zh_name": {
                                        "value": "林俊凯",
                                        "boost": 1
                                    }
                                }
                            }
                        ]
                    }
                },
                {
                    "bool": {
                        "should": [
                            {
                                "range": {
                                    "fans_num": {
                                        "gte": "800"
                                    }
                                }
                            },
                            {
                                "terms": {
                                    "tag": [
                                        1010,
                                        1013
                                    ]
                                }
                            }
                        ]
                    }
                }
            ]
        }
    },
    "sort": {
        "_score": {
            "order": "desc"
        },
        "score": {
            "order": "desc"
        }
    }
}

range

    "range": {
            "fans_num": {
              "gte": 800,
              "lte":126334
            }
     }

gte：大于等于；lte小于等于

term

不分词，精准完全匹配查询

GET your_index_search/_search
{
  "query": {
    "term": {
      "name": {
        "value": "天空"
      }
    }
  }
}

terms

不分词，命中数组一个即可，不要求全部命中

GET your_index_search/_search
{
  "query": {
    "terms": {
      "tag": [
        "美食",
        "购物"
      ]
    }
  
  }
}

prefix

前缀匹配，不分词，精准匹配前半部分

GET your_index_search/_search
{
  "query": {
    "prefix": {
      "name_full": {
        "value": "林俊"
      }
    }
  
  }
}

林俊凯，林俊xxx都会命中

multi_match

会对query词进行分词

GET your_index_search/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "北京景点",
            "fields": [
              "name",
              "name_full", 
              "name_lower"
              ],
              "analyzer":"ik",
              "minimum_should_match":"3<80%"
          }
        }
      ]
    }
  }
}

这里的 "minimum_should_match": "3<80%" 指定了如下规则：

如果分词数量小于或等于 3，则必须匹配所有分词。
如果分词数量大于 3，则至少匹配 80% 的分词。

这里的“analyzer”，是分词器，常见的有ik ik-smart standard mla

GET _analyze
{
  "analyzer":"mla",
  "text":"北京景点"
}

//结果为
{
  "tokens": [
    {
      "token": "北京",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "景点",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

eg：“北京景点”分词为【北京，景点】，分词项为2，小于3，那么【北京】和【景点】都需要在field中匹配到。multi_match 查询的目的是在多个字段中搜索查询词中的词语，并且匹配规则会跨字段应用，比如【北京】在name匹配到，【景点】在name_lower配到到，即匹配成功。

注意⚠️：纯英文的按照空格分词！！！！

match

会分词，multi_match会涉及多个字段的作用域。match只涉及一个field。

$match = new \Elastica\Query\Match();
$match->setField("name", array(
                "query" => strtolower($params['keyword']),
                "fuzziness" => "AUTO",
                "operator" => "or",
                "analyzer" => "ik_max_word"
            ));

multi_phrase

会对query词进行分词（有的人会认为不会分词，❌），match_phrase要求严格，不仅要求，能够匹配到分词后的所有单词，且分词后的单词顺序也要和命中结果中的顺序保持一致。

GET your_index_search/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "name": "孤独的战士"
          }
        }
      ]
    }
  }
}

name = “孤独的战士” 通过ik_max_word 分词后【孤独的，孤独，战士】，es中有个name = “孤独星球的战士”是否能被召回呢？看一下分词：【孤独，星球，战士】。“孤独的”未被匹配到，所以不会被召回。

假如，name = “孤独的战士”的分词为【孤独，战士】，能否召回“孤独星球的战士”呢？可以！通过设置slop，最大间隔数，默认是0。

为什么match能找到，term查询不到呢？

首先，要看创建索引的时候mapping ，字段的类型。如果是type是keyword，不允许分词。

其次，查看字段类型发现是text，term查询的字段类型只能是keyword

"keyword_full": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },

如果想要查询到想要的文档结果，会有很多因素。首先，用的什么分词器（ik_max_word，ik_smart），分成的tokens决定了匹配度。其次，就是dsl，精准查询就用term ，

最后，写一个例子，多路召回，尽可能多的结果。