Elasticsearch进阶学习笔记

最新推荐文章于 2022-11-27 16:03:24 发布

LaiDeJi_

最新推荐文章于 2022-11-27 16:03:24 发布

阅读量263

点赞数

文章标签： elasticsearch

本文链接：https://blog.csdn.net/LaiDeJi_/article/details/123703672

版权

本篇文章是我在学习Elasticsearch时（以下简称es）记录的进阶笔记
之前的自己学习新技术的时候不喜欢记笔记，觉得脑子记住就行了，不懂的就面对百度编程。但是最近觉得还是：好记性不如烂笔头。
可能学完es之后，一个月不用又把知识点忘掉了，但是可以回来看看自己记的笔记，可以快速的知道大概内容。还有就是想锻炼自己写作的能力，越长大越喜欢写一些东西，越喜欢分享，哈哈哈。
我知道自己现在很菜，但是只要坚持，总会慢慢成长的，坚持自己热爱的东西，just do it.
不说废话了，好好写笔记吧
本文章基于es7.4.2
一定要耐着性子看，不要看到下边好长的内容就看不下去了，其实太多都是一些json数据的展示，跟着标题走，基本就行了

准备

首先，我们需要在kibana上导入一些练习数据，链接如下，复制就好
accounts.json: elasticsearch官网的测试数据
并使用以下命令进行导入
POST bank/account/_bulk
应该都ok了吧，那我们在开始之前，记下几个基础的知识点
这些在es的官方文档中就有，喜欢翻看官方文档的朋友们可以在此链接找到相关内容我也只是把官方文档中的英文用浏览器翻译为中文而已

Query DSL&&match_all

我们运行以下命令

以下请求检索索引中按帐号排序的所有文档
GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "account_number": "asc"
    }
  ]
}
默认情况下，响应部分包括与搜索条件匹配的前 10 个文档：hits
```json
{
  "took" : 63,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
        "value": 1000,
        "relation": "eq"
    },
    "max_score" : null,
    "hits" : [ {
      "_index" : "bank",
      "_type" : "_doc",
      "_id" : "0",
      "sort": [0],
      "_score" : null,
      "_source" : {"account_number":0,"balance":16623,"firstname":"Bradshaw","lastname":"Mckenzie","age":29,"gender":"F","address":"244 Columbus Place","employer":"Euron","email":"bradshawmckenzie@euron.com","city":"Hobucken","state":"CO"}
    }, {
      "_index" : "bank",
      "_type" : "_doc",
      "_id" : "1",
      "sort": [1],
      "_score" : null,
      "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
    }, ...
    ]
  }
}

响应还提供有关搜索请求的以下信息：

took– Elasticsearch 运行查询所花费的时间，以毫秒为单位
timed_out– 搜索请求是否超时
_shards– 搜索了多少个分片，以及有多少分片成功、失败或跳过了。
max_score– 找到的最相关文档的分数
hits.total.value- 找到了多少个匹配的文档
hits.sort- 文档的排序位置（未按相关性分数排序时）
hits._score- 文档的相关性分数（使用时不适用match_all)

每个搜索请求都是独立的：Elasticsearch 不会跨请求维护任何状态信息。
例如，以下请求获得命中 10 到 19：

GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "account_number": "asc"
    }
  ],
  "from": 10,
  "size": 10
}

下面操作类似于MySQL中的select field……

GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "balance": {
        "order": "desc"
      }
    }
  ],
  "from": 10,
  "size": 10,
  "_source": [
    "balance",
    "firstname"
  ]
}

match全文检索

接下来我们再看精确匹配查询

GET /bank/_search
{
  "query": {
    "match": {
      "account_number": 20
    }
  }
}

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "20",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 20,
          "balance" : 16418,
          "firstname" : "Elinor",
          "lastname" : "Ratliff",
          "age" : 36,
          "gender" : "M",
          "address" : "282 Kings Place",
          "employer" : "Scentric",
          "email" : "elinorratliff@scentric.com",
          "city" : "Ribera",
          "state" : "WA"
        }
      }
    ]
  }
}

如果是查询字符串，便是全文检索

GET /bank/_search
{
  "query": {
    "match": {
      "address": "Kings"
    }
  }
}

我们可以查出两条数据，address里都包含我们的查询条件，这里是因为es使用倒排索引。

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 5.9908285,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "20",
        "_score" : 5.9908285,
        "_source" : {
          "account_number" : 20,
          "balance" : 16418,
          "firstname" : "Elinor",
          "lastname" : "Ratliff",
          "age" : 36,
          "gender" : "M",
          "address" : "282 Kings Place",
          "employer" : "Scentric",
          "email" : "elinorratliff@scentric.com",
          "city" : "Ribera",
          "state" : "WA"
        }
      },
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "722",
        "_score" : 5.9908285,
        "_source" : {
          "account_number" : 722,
          "balance" : 27256,
          "firstname" : "Roberts",
          "lastname" : "Beasley",
          "age" : 34,
          "gender" : "F",
          "address" : "305 Kings Hwy",
          "employer" : "Quintity",
          "email" : "robertsbeasley@quintity.com",
          "city" : "Hayden",
          "state" : "PA"
        }
      }
    ]
  }
}

如果查询条件为以下的话，只要结果满足匹配Holmes或Lane即可被检索出来，并且会按照相关性倒序排序显示

GET /bank/_search
{
  "query": {
    "match": {
      "address": "Holmes Lane"
    }
  }
}

match_phrase 短语搜索

如果我们想要执行短语搜索而不是匹配单个术语,就可以使用match_phrase,这样的话就把Holmes Lane当作短语搜索，而不是分词搜索了

GET /bank/_search
{
  "query": {
    "match_phrase": {
      "address": "Holmes Lane"
    }
  }
}

multi_match

分词搜索，只要在address或city中出现就会被检索出来，会按照关联性倒序排序

GET /bank/_search
{
  "query": {
    "multi_match": {
      "query": "Brogan Lane",
      "fields": [
        "address",
        "city"
      ]
    }
  }
}

bool复合查询

布尔查询中的每个、和元素都称为查询子句。文档满足每个或子句中的条件的程度会影响文档的相关性分数。分数越高，文档越符合您的搜索条件。默认情况下，Elasticsearch 返回按这些相关性分数排名的文档。must为必须满足的，must_not为必须不满足的，should为最好满足

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "address": "Lane"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "gender": "M"
          }
        }
      ],
      "should": [
        {
          "match": {
            "state": "VT"
          }
        }
      ]
    }
  }
}

filter过滤

子句中的条件被视为筛选器。它会影响文档是否包含在结果中，但不会影响文档的评分方式。您还可以根据结构化数据显式指定任意过滤器以包含或排除文档。
例如，以下请求使用范围筛选器将结果限制为余额介于 20，000 美元和 30，000 美元（含）之间的帐户。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

term查询

官网警告：避免对文本字段使用查询。默认情况下，Elasticsearch 会更改字段的值作为分析的一部分。这使得查找字段值的精确匹配项变得困难。这里附上官网更详细的笔记，可以看这个
那我们就是用来对文本字段使用term进行查询呢？

GET /bank/_search
{
  "query": {
    "term": {
      "age": {
        "address": "960 Glendale Court"
      }
    }
  }
}

那我们就会收到es最亲切的error
搞事情

{
  "error": {
    "root_cause": [
      {
        "type": "parsing_exception",
        "reason": "[term] query does not support [address]",
        "line": 5,
        "col": 20
      }
    ],
    "type": "parsing_exception",
    "reason": "[term] query does not support [address]",
    "line": 5,
    "col": 20
  },
  "status": 400
}

官方更建议我们使用term查询根据价格、产品 ID 或用户名等精确值查找文档
在对文本类型进行搜索，更推荐使用match
并且我们知道，match match_phrase都可以对文本进行搜索，match会分词查询，所有包含其中一个单词的都会被检索到，而match_phrase是全文匹配，只有含有全部单词才会被匹配上
我们先看match

GET /bank/_search
{
  "query": {
    "match": {
      "address": "Holmes Lane"
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 16,
      "relation" : "eq"
    },
    "max_score" : 10.605789,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "1",
        "_score" : 10.605789,
        "_source" : {
          "account_number" : 1,
          "balance" : 39225,
          "firstname" : "Amber",
          "lastname" : "Duke",
          "age" : 32,
          "gender" : "M",
          "address" : "880 Holmes Lane",
          "employer" : "Pyrami",
          "email" : "amberduke@pyrami.com",
          "city" : "Brogan",
          "state" : "IL"
        }
      },
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "70",
        "_score" : 4.1042743,
        "_source" : {
          "account_number" : 70,
          "balance" : 38172,
          "firstname" : "Deidre",
          "lastname" : "Thompson",
          "age" : 33,
          "gender" : "F",
          "address" : "685 School Lane",
          "employer" : "Netplode",
          "email" : "deidrethompson@netplode.com",
          "city" : "Chestnut",
          "state" : "GA"
        }
      },
     }

再看match_phrase

GET /bank/_search
{
  "query": {
    "match_phrase": {
      "address": "Holmes Lane"
    }
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 10.605789,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "1",
        "_score" : 10.605789,
        "_source" : {
          "account_number" : 1,
          "balance" : 39225,
          "firstname" : "Amber",
          "lastname" : "Duke",
          "age" : 32,
          "gender" : "M",
          "address" : "880 Holmes Lane",
          "employer" : "Pyrami",
          "email" : "amberduke@pyrami.com",
          "city" : "Brogan",
          "state" : "IL"
        }
      }
    ]
  }
}

我们明显可以看出查询结果的不同
其实在match中，有一个更精确的查询条件，需要全部匹配，不能多也不能少.FIELD.keyword
（在es官方文档中，所有的大写单词都为占位符）可以和mybatis中的#{}一样理解

GET /bank/_search
{
  "query": {
    "match": {
      "address.keyword": "Holmes Lane"
    }
  }
}

我们看到，此时命中0个结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

aggregations 聚合

聚合框架有助于提供基于搜索查询的聚合数据。它基于称为聚合的简单构建块，可以组合这些构建块以构建复杂的数据摘要。
聚合可以看作是在一组文档上构建分析信息的工作单元。执行的上下文定义了此文档集是什么（例如，顶级聚合在搜索请求的已执行查询/筛选器的上下文中执行）。

## 查询不同年龄分段的平均工资
GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "ageAgg": {
      "terms": {
        "field": "age",
        "size": 100
      },
      ## 这里的aggs是子聚合
      "aggs": { 
        "balanceAvg": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

## 查询不同年龄下的不同性别的不同平均工资
GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "ageAgg": {
      "terms": {
        "field": "age",
        "size": 100
      },
      "aggs": {
        "genderAvg": {
          "terms": {
          ## 这里需要用keyword，否则会报illegal_argument_exception的错误
          ## 要么设置gender的fielddata=true 要么使用关键字字段
            "field": "gender.keyword",
            "size": 10
          },
          "aggs": {
            "balanceAvg": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}

mapping创建

ElasticSearch7-去掉type概念

关系型数据库中两个数据表示是独立的，即使他们里面有相同名称的列也不影响使用，但ES中不是这样的。elasticsearch是基于Lucene开发的搜索引擎，而ES中不同type下名称相同的filed最终在Lucene中的处理方式是一样的。
1. 两个不同type下的两个user_name，在ES同一个索引下其实被认为是同一个filed，你必须在两个不同的type中定义相同的filed映射。否则，不同type中的相同字段名称就会在处理中出现冲突的情况，导致Lucene处理效率下降。
2. 去掉type就是为了提高ES处理数据的效率。
Elasticsearch 7.x URL中的type参数为可选。比如，索引一个文档不再要求提供文档类型。
Elasticsearch 8.x 不再支持URL中的type参数。
解决：
将索引从多类型迁移到单类型，每种类型文档一个独立索引
将已存在的索引下的类型数据，全部迁移到指定位置即可。详见数据迁移

那么接下来，我们看一下像创建mysql数据表一样创建mapping映射吧
es数据类型

PUT /my_index
{
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "email": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      }
    }
  }
}

修改映射&&数据迁移

我们可以使用这个命令给my_index中添加映射

PUT /my_index/_mapping
{
  "properties": {
    "employee-id": {
      "type": "keyword",
      "index": false
    }
  }
}

并且，我们上边已经提到，在es6就标记type为过时的了，那我们原来的数据如何迁移到新的上边呢？
首先，我们在新的index下创建映射

PUT /newbank
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text"
      },
      "age": {
        "type": "long"
      },
      "balance": {
        "type": "long"
      },
      "city": {
        "type": "keyword"
      },
      "email": {
        "type": "keyword"
      },
      "employer": {
        "type": "keyword"
      },
      "firstname": {
        "type": "text"
      },
      "gender": {
        "type": "keyword"
      },
      "lastname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "state": {
        "type": "keyword"
      }
    }
  }
}

然后，我们使用固定的迁移api进行数据迁移，注意，如果数据迁移方和被迁移方都为6以上创建的index，没有type的话，下边json出可以去除"type"kv键值对，如果是老版本迁移，则需要加上"type"

POST _reindex
{
  "source": {
    "index": "bank",
    "type": "account"
  },
  "dest": {
    "index": "newbank"
  }
}

此时，我们再来看我们的数据的话，就会发现，已经迁移成功啦！

GET /newbank/_search

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 1,
          "balance" : 39225,
          "firstname" : "Amber",
          "lastname" : "Duke",
          "age" : 32,
          "gender" : "M",
          "address" : "880 Holmes Lane",
          "employer" : "Pyrami",
          "email" : "amberduke@pyrami.com",
          "city" : "Brogan",
          "state" : "IL"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 6,
          "balance" : 5686,
          "firstname" : "Hattie",
          "lastname" : "Bond",
          "age" : 36,
          "gender" : "M",
          "address" : "671 Bristol Street",
          "employer" : "Netagy",
          "email" : "hattiebond@netagy.com",
          "city" : "Dante",
          "state" : "TN"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 13,
          "balance" : 32838,
          "firstname" : "Nanette",
          "lastname" : "Bates",
          "age" : 28,
          "gender" : "F",
          "address" : "789 Madison Street",
          "employer" : "Quility",
          "email" : "nanettebates@quility.com",
          "city" : "Nogal",
          "state" : "VA"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "18",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 18,
          "balance" : 4180,
          "firstname" : "Dale",
          "lastname" : "Adams",
          "age" : 33,
          "gender" : "M",
          "address" : "467 Hutchinson Court",
          "employer" : "Boink",
          "email" : "daleadams@boink.com",
          "city" : "Orick",
          "state" : "MD"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 20,
          "balance" : 16418,
          "firstname" : "Elinor",
          "lastname" : "Ratliff",
          "age" : 36,
          "gender" : "M",
          "address" : "282 Kings Place",
          "employer" : "Scentric",
          "email" : "elinorratliff@scentric.com",
          "city" : "Ribera",
          "state" : "WA"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "25",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 25,
          "balance" : 40540,
          "firstname" : "Virginia",
          "lastname" : "Ayala",
          "age" : 39,
          "gender" : "F",
          "address" : "171 Putnam Avenue",
          "employer" : "Filodyne",
          "email" : "virginiaayala@filodyne.com",
          "city" : "Nicholson",
          "state" : "PA"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "32",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 32,
          "balance" : 48086,
          "firstname" : "Dillard",
          "lastname" : "Mcpherson",
          "age" : 34,
          "gender" : "F",
          "address" : "702 Quentin Street",
          "employer" : "Quailcom",
          "email" : "dillardmcpherson@quailcom.com",
          "city" : "Veguita",
          "state" : "IN"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "37",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 37,
          "balance" : 18612,
          "firstname" : "Mcgee",
          "lastname" : "Mooney",
          "age" : 39,
          "gender" : "M",
          "address" : "826 Fillmore Place",
          "employer" : "Reversus",
          "email" : "mcgeemooney@reversus.com",
          "city" : "Tooleville",
          "state" : "OK"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "44",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 44,
          "balance" : 34487,
          "firstname" : "Aurelia",
          "lastname" : "Harding",
          "age" : 37,
          "gender" : "M",
          "address" : "502 Baycliff Terrace",
          "employer" : "Orbalix",
          "email" : "aureliaharding@orbalix.com",
          "city" : "Yardville",
          "state" : "DE"
        }
      },
      {
        "_index" : "newbank",
        "_type" : "_doc",
        "_id" : "49",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 49,
          "balance" : 29104,
          "firstname" : "Fulton",
          "lastname" : "Holt",
          "age" : 23,
          "gender" : "F",
          "address" : "451 Humboldt Street",
          "employer" : "Anocha",
          "email" : "fultonholt@anocha.com",
          "city" : "Sunriver",
          "state" : "RI"
        }
      }
    ]
  }
}