ElasticSearch基础知识

阿无，

已于 2024-01-19 16:09:50 修改

阅读量1.6k

点赞数

分类专栏：数据库文章标签： java

于 2020-11-14 19:14:13 首次发布

本文链接：https://blog.csdn.net/weixin_44431371/article/details/109694563

版权

数据库专栏收录该内容

20 篇文章 0 订阅

订阅专栏

数据类型

核心数据类型

字符串类型： text, keyword
数字类型：long, integer, short, byte, double, float, half_float, scaled_float
日期：date
日期纳秒：date_nanos
布尔型：boolean
二进制：binary
范围类型: integer_range, float_range, long_range, double_range, date_range

复杂数据类型

数组类型：array
对象类型：object
嵌套类型：nested object（用于json对象数组）

地理位置数据类型

geo_point(点)、geo_shape(形状)

专用类型

ip类型：记录IP地址ip
补全类型：实现自动补全completion
记录分词数：token_count，用于统计字符串种的词条数量
记录字符串hash值母乳murmur3

多字段特性multi-fields

允许对同一个字段采用不同的配置，比如分词，例如对人名实现拼音搜索，只需要在人名中新增一个子字段为pinyin即可

es节点信息及索引查询

# 查看所有节点
GET /_cat/nodes
# 查看es健康状况
GET /_cat/health
# 查看主节点
GET /_cat/master
# 查看所有索引
# 相当于mysql的show databases
GET /_cat/indices

类型type的操作（7版本之后已弃用）

文档操作

基础crud

// 根据id查询文档 索引/类型/id
// 这个只能根据id查，必须携带id
// 这个语句只能查真实的索引，不能查聚合的索引，
// 例如只能查索引index_2022.01.01，不能查index*
GET	localhost:9200/blog1/article/1

DELETE	localhost:9200/blog1/article/1

POST	localhost:9200/blog1/article/1
// article为表名,1为文档的id(就是es自带的id，不是我们的业务id)，如果不指定，es会自动分配一个string类型的id
// 修改文档和创建文档的格式一样，id重复就是更新
{
	"id":1,
	"title":"亮剑野狼峪白刃战",
}


// 这也是更新，与不带_update更新不同的是
// 上面更新完之后再更新同样的内容_seq_no,_version会累加
// 这儿更新完之后再更新同样的内容_seq_no,_version不会累加，result返回noop(no operation)

// post和put一样
// 也可以增加属性，直接写就行，上面的也可以，post和put也都可以
POST /customer/external/1/_update
{
  "doc":{
    "name":"jay"
  }
}


// _bulk：批量写入数据

// 这的json不能换行，否则会报错
// 第一行是用来定义规则的
// index：创建或更新，create：创建，update：更新，delete：删除
// 上一条如果失败不会影响到下一条
POST /customer/external/_bulk
{"index":{"_id":"1"}}
{"author":"gaosilin","price":1000,"id":2,"birth":"2022-03-28T06:59:59Z"}
{"index":{"_id":"2"}}
{"author":"gaosiling","price":1000,"id":3}

// 下面执行的话，是需要合并到一起的
// 这里分开为了好看
// _retry_on_conflict:3 更新失败重试3次
POST /_bulk
{"delete":{"_index":"website","_type":"blog","_id":"123"}}

{"create":{"_index":"website","_type":"blog","_id":"123"}}
{"title":"my first blog first"}

{"index":{"_index":"website","_type":"blog"}}
{"title":"my first blog first"}

{"update":{"_index":"website","_type":"blog","_id":"123"}}
{"doc":{"title":"my first blog first"}}

查询结果说明

// _version不是1，证明数据被更新过

// _seq_no和_primary_term做乐观锁操作
// 只要数据有改动它俩都会累加
// _seq_no 并发控制字段，每次更新就会+1，用来做乐观锁
// _primary_term 同上，主分片重新分配，如重启，就会变化

// found 数据被找到了

// 乐观锁
// 例如a,b并发都想修改1数据，a想把1改成2，b想把1改成3，
// 两者都只想改1，而不是改其改2或者3
// 这个时候就可以用_seq_no来控制，老版的是用_version控制
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 4,
  "_seq_no" : 3,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "john doe"
  }
}

_seq_no和_primary_term实现乐观锁

// 两个只能有一个被改，如果想再改，需要if_seq_no+1
// if_seq_no if_primary_term 是文档级的还是类型及的还是索引级的暂时不关心，以后应用到再说

PUT /customer/external/1?if_seq_no=9&if_primary_term=2
{"name":"1"}

PUT /customer/external/1?if_seq_no=9&if_primary_term=2
{"name":"2"}

es测试数据

POST /bank/account/_bulk

https://gitee.com/xlh_blog/common_content/blob/master/es%E6%B5%8B%E8%AF%95%E6%95%B0%E6%8D%AE.json#

文档进阶操作

检索

// q=* 表示查询所有 默认只返回10条记录
GET bank/_search?q=*&sort=account_number:asc

// query dsl
// 我们以后都用这种方式
GET bank/_search
{
  "query":{
    "match_all": {}
  },
  "sort": [
    {
        "account_number": "asc"
    },
    {
      "balance": "desc"
    }
  ]
}



// 排序也可以这么写，上面的是简写
// 分页查询所有，并返回指定的字段
GET bank/_search
{
  "query":{
    "match_all": {}
  },
  "sort": [
    {
      "balance": {
        "order": "desc"
      }
    }
  ],
  "from": 0,
  "size": 20,
  "_source": ["balance","firstname"]
}




// 数值是精确匹配
GET bank/_search
{
  "query": {
    "match": {
      "balance": "49989"
    }
  }
}


//  这是模糊查询，也叫全文检索(按照评分score进行排序)
// Mill road 会将这个分词，最终Mill 和 road都会被查出来
// 如果查询不想被分词，则把match替换为match_phrase
GET bank/_search
{
  "query": {
    "match": {
      "address": "Mill road"
    }
  }
}


// 多字段匹配
// 也会进行匹配，mill和movico在address或者city里都可以被查出来
GET bank/_search
{
  "query": {
    "multi_match": {
      "query": "mill movico",
      "fields": ["address","city"]
    }
  }
}


// bool 复合查询，多条件查询
// should 字段最好属于值，不属于也可以，
// 主要的作用是会影响score，进而影响展示的先后顺序
// must must_not should 都会影响score
GET bank/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "gender": "M"
          }
        },
        {
          "match": {
            "address": "mill"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "age": "28"
          }
        }
      ],
      "should": [
        {
          "match": {
            "lastname": "Hines"
          }
        }
      ]
    }
  }
  
}







// filter对score没有影响
GET bank/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "age": {
              "gte": 18,
              "lte": 30
            }
          }
        },
        {
          "match": {
            "address": "mill road"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "gte": 10,
            "lte": 29
          }
        }
      }
    }
  }
}


// term 一般只针对具体的值进行查询，
// 我们规定，全文检索用 match，非text字段我们用term
// 第二个查询语句是不会返回任何值的，即使其中有一个数据的值就是 451 Humboldt Street
GET bank/_search
{
  "query": {
    "term": {
      "age": {
        "value": "28"
      }
    }
  }
}





GET bank/_search
{
  "query": {
    "term": {
      "address": {
        "value": "451 Humboldt Street"
      }
    }
  }
}




// match_phrase 与 字段.keyword 区别
// match_phrase 做词语匹配，只要值包含即可
// .keyword 做精确匹配，必须是这个值
GET bank/_search
{
  "query": {
    "match": {
      "address.keyword": "789 Madison"
    }
  }
}






GET bank/_search
{
  "query": {
    "match_phrase": {
      "address": "789 Madison"
    }
  }
}





// 使用query_string多字段查询
POST	localhost:9200/blog1/article/_search

{
	"query":{
			"query_string":{
				"fields":["title","content"],
				"query":"我是程序员"
			}
		},
	"size":50,
	"from":0
	
}

查询条件的关键字

match：模糊匹配，需要指定字段名，但是输入会进行分词，比如"hello world"会进行拆分为hello和world，然后匹配，如果字段中包含hello或者world，或者都包含的结果都会被查询出来，也就是说match是一个部分匹配的模糊查询。查询条件相对来说比较宽松。
term: 这种查询和match在有些时候是等价的，比如我们查询单个的词hello，那么会和match查询结果一样，但是如果查询"hello world"，结果就相差很大，因为这个输入不会进行分词，就是说查询的时候，是查询字段分词结果中是否有"hello world"的字样，而不是查询字段中包含"hello world"的字样，elasticsearch会对字段内容进行分词，“hello world"会被分成hello和world，不存在"hello world”，因此这里的查询结果会为空。这也是term查询和match的区别。
match_phase：会对输入做分词，但是需要结果中也包含所有的分词，而且顺序要求一样。以"hello world"为例，要求结果中必须包含hello和world，而且还要求他们是连着的，顺序也是固定的，hello that word不满足，world hello也不满足条件。
query_string：和match类似，但是match需要指定字段名，query_string是在所有字段中搜索，范围更广泛。
其他

wildcard：通配符（一般不用，会影响效率），例如苏*
prefix：前缀（一般不用，会影响效率）
fuzzy：模糊
range：范围（例如查6-10）
text：
missing：

聚合分析 aggregations

聚合提供了从数据中分组和提取数据的能力。最简单的聚合方法大致等于 SQL 的 GROUP BY 和聚合函数。

# 搜索address中包含mill的所有人的年龄分布以及平均年龄
# aggs，值被query出来后，进行聚合

# ageAgg 给聚合起个名字
# terms 聚合类型，看值有多少种可能
# size，如果age有100种可能，我们只取10个
# 返回的结果，key age的具体的值，doc_count 该值一共有多少条

# avg 求平均值

# size 0 表示不展示搜索的结果
GET bank/_search
{
  "query": {
    "match": {
      "address": "mill"
    }
  },
  "aggs": {
    "ageAgg": {
      "terms": {
        "field": "age",
        "size": 10
      }
    },
    "ageAvg": {
      "avg": {
        "field": "age"
      }
    },
    "balanceAvg":{
      "avg": {
        "field": "balance"
      }
    }
  },
  "size": 0
}







# 子聚合
# 按照年龄聚合，并且请求这些年龄段的这些人的平均薪资
GET bank/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "ageAgg":{
      "terms": {
        "field": "age",
        "size": 100
      },
      "aggs": {
        "balanceAvg": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}








# 查出所有年龄分布，并且这些年龄段种M的平均薪资和F的平均薪资以及这个年龄段的总体平均薪资


GET bank/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "ageAgg": {
      "terms": {
        "field": "age",
        "size": 100
      },
      
      "aggs": {
        "genderAgg": {
          "terms": {
            "field": "gender.keyword",
            "size": 100
          },
          "aggs": {
            "balanceAvg": {
              "avg": {
                "field": "balance"
              }
            }
          }
        },
        "balanceAvg":{
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

映射

# 查询索引映射信息
GET bank/_mapping


# 创建索引并指定映射
PUT /my_index
{
  "mappings": {
    "properties": {
      "age":{
        "type": "integer"
      },
      "email":{
        "type": "keyword"
      },
      "name":{
        "type": "text"
      }
    }
  }
}



# 添加映射，只能添加新的字段，不能更新已有字段的映射
# index 是否可以用来检索
# store 是否存储
# analyzer 分词器
# doc_values 是否可以被聚合
PUT /my_index
{
  "mappings": {
    "properties": {
      "employee-id":{
        "type": "keyword",
        "index": true
      }
    }
  }
}





GET bank/_search
GET bank/_mapping
GET new_bank/_mapping
# 更新映射
# 对于已存在的字段，字段的类型是不能被更新的
# 我们需要创建一个新的索引进行数据迁移
PUT /new_bank
{
  "mappings": {
    "properties": {
      "account_number" : {
          "type" : "long"
        },
        "address" : {
          "type" : "text"
          
        },
        "age" : {
          "type" : "integer"
        },
        "balance" : {
          "type" : "long"
        },
        "city" : {
          "type" : "keyword"
        },
        "email" : {
          "type" : "keyword"
        },
        "employer" : {
          "type" : "keyword"
        },
        "firstname" : {
          "type" : "text"
        },
        "gender" : {
          "type" : "keyword"
        },
        "lastname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "state" : {
          "type" : "keyword"
        }
    }
  }
}



# 数据迁移
# 可以不指定类型，数据是老版本的数据
# 在7以后的版本中就可以不指定type了
POST _reindex
{
  "source": {
    "index": "bank",
    "type": "account"
  },
  "dest": {
    "index": "new_bank"
  }
}
GET new_bank/_search

nested

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "user.first": "Alice"
          }
        },
        {
          "match": {
            "user.last": "Smith"
          }
        }
      ]
    }
  }
}

上面的查询两个条件都是必须，理论上来讲没有一个user是Smith Alice的，但实际情况我们是可以查出来的

这是因为es把first和last的所有值都分别存在了一个数组里，查询的时候es发现这两个值在其数组中都存在，就有了结果
在这里插入图片描述

// 指定user为nested类型就不会出现这种情况了
PUT my_index
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}

特殊查询

查询每天7:50的数据

实际查的是15:50的数据，这里是因为es时区的关系

POST your_es_index*/_search
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "doc['@timestamp'].value.getHour() == params.queryHour  && doc['@timestamp'].value.getMinute() == params.queryMinute ",
            "params": {
              "queryHour":7,
              "queryMinute": 50
            }
          }
        }  
      },
      "must": [
        {
          "term": {
            "license.keyword":"巴啦啦"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2023-01-16T15:50:59.999+0800",
              "lte": "2024-01-17T15:50:00.000+0800"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ],
"from": 0,
 "size": 366
}