Elasticsearch --- （十二）初识搜索引擎《二》

最新推荐文章于 2024-04-23 16:17:43 发布

梦里梦见梦不见的

最新推荐文章于 2024-04-23 16:17:43 发布

阅读量615

点赞数

分类专栏： Elasticsearch

本文链接：https://blog.csdn.net/weixin_43240792/article/details/108194215

版权

Elasticsearch 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

1、search api 的基础语法

2、http协议中 get 是否可以带上request body

3、快速上机动手实战Query DSL搜索语法

（1）一个例子明白什么是Query DSL（对应的Query String就是在后面？拼接）

（2）Query DSL的基本语法

（3）组合多个搜索条件{"query":{"bool":{"must":[{"match":{"field":value}}]}}}

（1）match all -----查所有

（2）match ------条件查询（会对查询条件进行分词）

（3）multi match -----相当于or

（4）range query（range可以用在query中，也可用在filter中，query中与相关度分数有关）

（5）term query -------会将查询条件（my test）当成整个串去倒排索引里查，不会对其分词

6、如何定位不合法的搜索以及其原因

7、如何定制搜索结果的排序规则

（1）默认排序规则：按照_score降序排序

（2）定制排序规则（sort）

8、如何将一个field索引两次来解决字符串排序问题

9、相关度平法TF & IDF算法

（1）算法介绍

（2）_score是如何计算的（拼接上？explain）

（3）分析一个document是如何被匹配上的

10、内核级知识点之doc value（正排索引）

11、query phase

（4）replica shard 如何提升搜索吞吐量

12、fetch phase

13、搜索相关参数梳理

（1）preference

（2）bouncing results问题

1、search api 的基础语法

（1）GET /search {}                  -------查所有

（2）GET /index1,index2/type1,type2/search {}                     ------查指定index、type

（3）GET /search{                          ---------分页

           "from":0,

           "size":10

        }

2、http协议中 get 是否可以带上request body

HTTP协议，一般不允许get请求带上request body（指的是{}中的那一串），但是因为get更加适合描述查询数据的操作，因此还是这么用了，碰巧，很多浏览器或者服务器，也都支持GET+request body模式，如果遇到不支持的场景，也可以用POST /_search

GET /test_index/test_type/_search
{
  "from":0,
  "size":3
}


GET /_search?from=0&size=3


POST /_search
{
    "from":0,
    "size":3
}

3、快速上机动手实战Query DSL搜索语法

（1）一个例子明白什么是Query DSL（对应的Query String就是在后面？拼接）

GET _search
{
"query": {
"match_all": {}
}
}

（2）Query DSL的基本语法

{

          QUERY_NAME:{

                  ARGUMENT:VALUE,

                  ARGUMENT:VALUE, .......

          }

}

{

           QUERY_NAME:{

                   FIELD_NAME:{

                          ARGUMENT:VALUE, ......

                   }

           }

}

GET /test_index/test_type/_search          ------test_field1中包含test_field1
{
  "query": {
    "match": {
      "test_field1": "test_field1"
    }
  }
}

（3）组合多个搜索条件{"query":{"bool":{"must":[{"match":{"field":value}}]}}}

//------------插入测试数据
PUT /website/article/3
{
  "title":"my elasticsearch article",
  "content":"elasticsearch is very bad",
  "author_id":112
}

PUT /website/article/2
{
  "title":"my hadoop article",
  "content":"hadoop is very good",
  "author_id":111
}

PUT /website/article/1
{
  "title":"my es article",
  "content":"es is very good",
  "author_id":110
}

//------搜索需求：title必须包含 Elasticsearch，content可以包含Elasticsearch也可以不包含，author_id必须不为111

GET website/article/_search
{
  "query":{
    "bool":{
      "must":[
        {
          "match":
            {
              "title":"elasticsearch"
            }
        }
      ],
      "should":[
        {
         "match":{
            "content":"elasticsearch"
          }
        }
      ],
      "must_not":[
        {
          "match":{
            "author_id":111
          }
        }
      ]
    }
  }
}

4、filter与query深入对比

（1）示例

//--------搜索请求：年龄必须大于等于30，join_date必须是2016-01-01
GET company/employee/_search
{
  "query":{
    "bool": {
      "must": [
        {
          "match": {
            "join_date":"2016-01-01"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "gte": 30
          }
        }
      }
    }
  }
}

（2）filter与query对比

filter，仅仅只是按照搜索条件过滤出需要的数据而已，不计算任何相关度分数，对相关度没有影响
query，会去计算每个document相对于搜索条件的相关度，并按照相关度进行排序
一般来说，如果你是在进行搜索，需要将最匹配搜索条件的数据先返回，那么用query，如果只是要根据一些条件筛选出一部分数据，不关注其排序，那么用filter

（3）filter与query性能

filter，不需要计算相关度分数，不需要按照相关度分数进行排序，同时还有内置的自动cache最常使用filter的数据
query，相反，要计算相关度分数，按照分数进行排序，而且无法cache结果

5、实战常用的各种query搜索语法

（1）match all -----查所有

GET company/employee/_search
{
  "query": {
    "match_all": {}
  }
}

（2）match ------条件查询（会对查询条件进行分词）

GET /test_index/test_type/_search
{
  "query": {
    "match": {
      "test_field1": "test_field1"
    }
  }
}

（3）multi match -----相当于or

//------------- test-content或test_field1中包含test
GET test_index/test_type/_search
{
  "query": {
    "multi_match": {
      "query": "test",
      "fields": ["test-content","test_field1"]
    }
  }
}

（4）range query（range可以用在query中，也可用在filter中，query中与相关度分数有关）

//-------------年龄大于30的
GET company/employee/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 30
      }
    }
  }
}

（5）term query -------会将查询条件（my test）当成整个串去倒排索引里查，不会对其分词

GET test_index/test_type/_search
{
  "query": {
    "term": {
      "test-content": "my test"
    }
  }
}

6、如何定位不合法的搜索以及其原因

例如：

//-----------错误写法
GET test_index/test_type/_validate/query?explain
{
  "query":{
    "math":{
      "test_field":"test"
    }
  }
}


//----------提示的错误
{
  "valid": false,
  "error": "org.elasticsearch.common.ParsingException: no [query] registered for [math]"
}



//---------------书写正确时
GET test_index/test_type/_validate/query?explain
{
  "query":{
    "match":{
      "test_field":"test"
    }
  }
}

//-------------执行结果
{
  "valid": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "explanations": [
    {
      "index": "test_index",
      "valid": true,
      "explanation": "+test_field:test #(#_type:test_type)"
    }
  ]
}

7、如何定制搜索结果的排序规则

（1）默认排序规则：按照_score降序排序

默认情况下，是按照_score降序排序的

然而，某些情况下，可能没有有用的_score，比如说filter

（2）定制排序规则（sort）

//-----------年龄大于等于30，按join_date升序排
GET company/employee/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 30
      }
    }
  },
  "sort": [
    {
      "join_date": {
        "order": "asc"
      }
    }
  ]
}

8、如何将一个field索引两次来解决字符串排序问题

如果对一个string field进行排序，结果往往不准确，因为分词后是多个单词，在排序就不是想要的结果了。

通常解决方案：将一个string field建立两次索引，一个分词，用来进行搜索，一个不分词，用来进行排序

//---- 1、 手动为索引中的type建立数据结构（mapping）
PUT website
{
  "mappings": {
    "article":{
      "properties": {
        "title":{                         --------------------索引两次
          "type":"text",                   ------------分词的
          "fields": {                     ------------不分词的
            "raw":{
              "type": "string",
              "index": "not_analyzed"
            }
          },
          "fielddata": true    //正排索引
        },
        "content":{
          "type": "text"
        },
        "post_date":{
          "type": "date"
        },
        "author_id":{
          "type": "long"
        }
      }
    }
  }
}

//---- 2、插入数据
PUT website/article/3
{
  "title":"third article",
  "content":"this is my third article",
  "post_date":"2017-03-01",
  "author_id":110
}

PUT website/article/2
{
  "title":"first article",
  "content":"this is my first article",
  "post_date":"2017-02-01",
  "author_id":110
}

PUT website/article/1
{
  "title":"second article",
  "content":"this is my second article",
  "post_date":"2017-01-01",
  "author_id":110
}

//---- 3、根据title 分词排序（是根据first、second、third排序的）
GET website/article/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "title": {
        "order": "desc"
      }
    }
  ]
}

//---- 4、根据title.raw 不分词排序（是根据first article、second article、third article排序的）
GET website/article/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "title.raw": {
        "order": "desc"
      }
    }
  ]
}

9、相关度平法TF & IDF算法

（1）算法介绍

relevance score算法，简单来说，就是计算出，一个索引中的文本，与搜索文本，他们之间的关联匹配程度

Elasticsearch使用的是 term frequency/inverse document frequency算法，简称为TF/IDF算法

trem frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关

例如搜索请求：hello world

doc1：hello you,and world is very good

doc2:hello,how are you

doc1更相关

inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关

例如搜索请求：hello world

doc1:hello,today is very good

doc2:hi world,how are you

比如说，在index中有一万条document，hello这个单词在所有的document中，一共出现了1000次，world这个单词在所有的document中一共出现了100次，doc2更相关

Field-length norm：field长度，field越长，相关度越弱

搜索请求：Hello world

doc1：{“title”：“hello article”，“content”:“balalbala 一万个单词”}

doc2：{“title”：“my article”,"content”：“balabala 一万个单词，hi world”}

hello world在整个index中出现的次数是一样多时，doc1更相关，title field更短

（2）_score是如何计算的（拼接上？explain）

GET test_index/test_type/_search?explain
{
  "query": {
    "match": {
      "test-content": "test test"
    }
  }
}

（3）分析一个document是如何被匹配上的

GET test_index/test_type/2/_explain
{
  "query": {
    "match": {
      "test-content": "test test"
    }
  }
}

10、内核级知识点之doc value（正排索引）

搜索的时候，要依靠倒排索引；排序的时候，需要依靠正排索引，看到每个document的每个field，然后进行排序，所谓的正排索引，其实就是doc values，在建立索引的时候，一方面会建立倒排索引，以供搜索用；一方面会建立正排索引，也就是doc values，以供排序、聚合、过滤等操作使用。doc values是被保存在磁盘上的，此时如果内存足够，os会自动将其缓存在内存中，性能还是很高；如果内存不够，os会将其写入磁盘上

11、query phase

（1）搜索请求发送到某一个coordinator node，构建一个priority queue，长度以paging操作from和size为准，默认为10

（2）coordinator node 将请求转发到所有的shard，每个shard本地搜索，并构建一个本地的priority queue

（3）各个shard将自己的priority queue 返回给coordinator node，并构建一个全局的priority queue

（4）replica shard 如何提升搜索吞吐量

一次请求要打到所有shard的一个replica/primary 上去，如果每个shard都有多个replica，那么同时并发过来的搜索请求可以同时打到其他的replica上

12、fetch phase

（1）coordinator node构建完priority queue之后，就发送mget请求去所有shard上获取对应的document

（2）各个shard将document返回给coordinator node

（3）coordinator node将合并后的document结果返回给client客户端

（4）一般搜索，如果不加from和size，就默认搜索前10条，按照_score排序

13、搜索相关参数梳理

（1）preference

决定了哪些shard会被用来执行搜索操作
_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

（2）bouncing results问题

两个document排序，field值相同，在不同的shard上，可能排序不同，每次请求轮询打到不同的replica shard上，每次页面上看到的搜索结果的排序都不一样。这就是bouncing result，也就是跳跃的结果。

搜索的时候，是轮询将搜索请求发送到每一个replica shard（primary shard），但是在不同的shard上，可能document的排序不同

解决方案：就是将preference设置为一个字符串，比如说user_id，让每个user每次搜索的时候，都使用同一个replica shard去执行，就不会看到bouncing results了

（3）timeout

主要就是限定在一定时间内，将部分获取到的数据直接返回，避免查询耗时过长

（4）routing

document文档路由，_id路由，routing=user_id，这样的话可以让同一个user对应的数据到一个shard上去

（5）search_type

default：query_then_fetch
dfs_query_then_fetch，可以提升revelance sort精准度

14、scroll滚动查询

如果一次性要查出来比如10万条数据，那么性能会很差，此时一般会采取用scoll滚动查询，一批一批的查，直到所有数据都查询完处理完

使用scoll滚动搜索，可以先搜索一批数据，然后下次再搜索一批数据，以此类推，直到搜索出全部的数据来
scoll搜索会在第一次搜索的时候，保存一个当时的视图快照，之后只会基于该旧的视图快照提供数据搜索，如果这个期间数据变更，是不会让用户看到的
采用基于_doc进行排序的方式，性能较高
每次发送scroll请求，我们还需要指定一个scoll参数，指定一个时间窗口，每次搜索请求只要在这个时间窗口内能完成就可以了

GET /test_index/test_type/_search?scroll=1m
{
  "query": {
    "match_all": {}
  },
  "sort": [ "_doc" ],
  "size": 3
}


{
  "_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3",
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 10,
    "max_score": null,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "8",
        "_score": null,
        "_source": {
          "test_field": "test client 2"
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "6",
        "_score": null,
        "_source": {
          "test_field": "tes test"
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "AVp4RN0bhjxldOOnBxaE",
        "_score": null,
        "_source": {
          "test_content": "my test"
        },
        "sort": [
          0
        ]
      }
    ]
  }
}

获得的结果会有一个scoll_id，下一次再发送scoll请求的时候，必须带上这个scoll_id

GET /_search/scroll
{
    "scroll": "1m", 
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3"
}

11,4,7
3,2,1
20

scoll，看起来挺像分页的，但是其实使用场景不一样。分页主要是用来一页一页搜索，给用户看的；scoll主要是用来一批一批检索数据，让系统进行处理的

梦里梦见梦不见的

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch --- （十二）初识搜索引擎《二》

1、search api 的基础语法（1）GET /search {} -------查所有（2）GET /index1,index2/type1,type2/search {} ------查指定index、type（3）GET /search{ ...
复制链接

扫一扫