elasticsearch

最新推荐文章于 2023-12-10 11:11:49 发布

不负经年

最新推荐文章于 2023-12-10 11:11:49 发布

阅读量229

点赞数 1

分类专栏：大数据文章标签： elasticsearch

本文链接：https://blog.csdn.net/test_number1/article/details/114757187

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、elasticsearch基本介绍

1. ES中各字段概念

index：索引库，类似于MySQL中的数据库
type：类型，在索引库下创建类型，类似于MySQL中数据库表
document：文档，es中一条数据就是一个document
filed：字段，一条document有多个字段组成
mapping：映射关系，映射filed字段的类型，字段分词，索引，存储特性
settings：设置，设置es索引库中数据的分片数及副本数
cluster：集群，es每个节点叫node，所有node组织起来为集群
node:一个节点即为node，相当于一个子服务器

2. ES中分词，分页，分片的概念

2.1分词

即把一段中文或者别的划分成一个个的关键字,我们在搜索时候会把自己的信息进行分词,会把数据库中或者索引库中的数据进行分词,然后进行一个匹配操作。
若我们定义一个字段name类型为text，则表明该字段在被搜索时会将值进行分词，如name：“张韶涵”
则进行分词后结果有“张”、“韶”、“涵”、“张韶”、“韶涵”，则当我搜索分词时也能将该条数据搜索出来。
如不想将词语“张韶涵”分开，则可使用分词器插件，指定分词。
安装elasticsearch-analysis-ik（ik中文分词器插件）
在ik-config中配置文件IKAnalyzer.cfg.xml中指定分词或指定分词文档目录，在分词文档中加入分词
在这里插入图片描述

2.2分页

分页分为浅分页和深分页
from-size"浅"分页：如要获取第5页数据，则需要将前5页数据全部查出再过滤掉前4页数据
弊端：当数据较多时执行效率低
在这里插入图片描述
scroll“深”分页：使用scroll可以模拟一个传统数据的游标，记录当前读取的文档信息位置，这个分页的用法，不是为了实时查询数据，而是为了一次性查询大量的数据，查询结果生成一个快照。快照生成后，如指定scroll=1m，则缓存1分钟，当超过1分钟后释放内存。
在这里插入图片描述
返回信息中包含scroll_id，下次查询则通过scroll_id取值

2.3分片

每个索引都由多个分片组成，分片又分为主分片和副分片，副分片即为主分片的副本
在这里插入图片描述
客户端创建一个索引过程：

客户端选择一个node发送请求过去，这个node就是coordinating node（协调节点）
coordinating node对document进行路由，将请求转发给对应的node（有primary shard）
实际的node上的primary shard处理请求，然后将数据同步到replica node
coordinating node，如果发现primary node和所有replica node都搞定之后，就返回响应结果给客户端

客户端读取一个索引过程：

客户端发送请求到一个coordinate node
协调节点将搜索请求转发到所有的shard对应的primary shard或replica shard也可以
query phase：每个shard将自己的搜索结果（其实就是一些doc
id），返回给协调节点，由协调节点进行数据的合并、排序、分页等操作，产出最终结果
fetch phase：接着由协调节点，根据doc id去各个节点上拉取实际的document数据，最终返回给客户端

客户端搜索一个索引过程：

客户端发送请求到任意一个node，成为coordinate node
coordinate node对document进行路由，将请求转发到对应的node，此时会使用round-robin随机轮询算法，在primary shard以及其所有replica中随机选择一个，让读请求负载均衡
接收请求的node返回document给coordinate node
coordinate node返回document给客户端

二、mapping定义字段类型

1.字段类型

mapping类似于数据库中的表结构定义，es 7.11为例有以下类型。
字符型：
text：默认会进行分词，支持模糊查询（5.x之后版本string类型已废弃）
keyword：不进行分词；keyword类型默认开启doc_values来加速聚合排序操作，占用了大量磁盘io 如非必须可以禁用doc_values
日期类型：
date：支持毫秒、根据指定的format解析对应的日期格式，内部以long类型存储。
数字类型：
long：-2^63 到 2^63
integer：-2^31 到 -2^31
short：−32768 到 32767
byte：−128 到 127
double：IEEE 754标准双精度浮点类型，8字节
float：IEEE 754标准单精度浮点类型，4字节
half_float：IEEE 754标准半精度浮点类型，2字节
scaled_float：缩放类型浮点类型
布尔类型：
boolean：默认store属性为false，并且不可以被搜索
范围类型：
integer_range:可以表示最大的范围为 [-2^31,231]
float_range:可以表达IEEE754单精度浮点数范围
long_range:可以表示最大的范围为 [-2^63,263]
double_range:可以表达IEEE754双精度浮点数范围
date_range:可以表达64位时间戳（单位毫秒）范围
经纬度类型：
geo_point：存储经纬度数据对
IP类型：
ip：将ip数据存储在这种数据类型中，方便后期对ip字段的模糊与范围查询
数组类型：
array：es不需要显示定义数组类型，只需要在插入数据时用’[]‘表示即可，’[]'中的元素类型需保持一致
嵌套类型：
nested：一种特殊的object类型，存储object数组，可检索内部子项
object：嵌套类型，不支持数组

2.实际操作

#创建索引
PUT /test_user
#定义索引映射
POST /test_user/test_game/_mapping?include_type_name=true
{
“test_game”:{
“properties”: {
“name”: {“type”: “keyword”},
“game_name”: {“type”: “text”},
“amount”: {“type”: “float”}
}
}
}
#查看索引映射关系
GET test_user/_mapping
#向索引中加字段
PUT /test_user/user/_mapping?include_type_name=true
{
“test_type”:{
“properties”:{
“creat_time”:{“type”:“date”}
}
}
}
遇到的问题：
es 7.11中创建索引及类型时，由于默认mapping不指定type，需加上参数include_type_name=true

三、es语法

1.增删改

#创建一个叫blog01的索引(1为系统id，pretty表示将数据进行格式化，展示更好看)
put /blog01/1?pretty
#插入文档及数据(1表示在article这个类型中插入ID=1的数据?pretty表示以固定格式传数据)
put /blog01/1?pretty {"id":"1","title":"What is lucene"}
#更新ID=1的文档
put /blog01/article/1?pretty {"id":"2","title":"What is es"}
#查询ID=1的文档
get /blog01/1?pretty
#搜索(搜索title为es的数据)
get /blog01/_search?q=title:es
（返回信息中，took为返回时间，单位毫秒）
#根据ID删除文档
delete /blog01/article/2
#使用bulk批量添加数据（在school索引库创建student索引，并新增以下数据）
POST /school/student/_bulk
{ "index": { "_id": 1 }}
{ "name" : "liubei", "age" : 20 , "sex": "boy", "birth": "1996-01-02" ,"about": "i like diaocan he girl" }
{ "index": { "_id": 2 }}
{ "name" : "guanyu", "age" : 21 , "sex": "boy", "birth": "1995-01-02" ,"about": "i like diaocan" }
{ "index": { "_id": 3 }}
{ "name" : "zhangfei", "age" : 18 , "sex": "boy", "birth":"1998-01-02" , "about": "i like travel" }
{ "index": { "_id": 4 }}
{ "name" : "diaocan", "age" : 20 , "sex": "girl", "birth":"1996-01-02" , "about": "i like travel and sport" }
{ "index": { "_id": 5 }}
{ "name" : "panjinlian", "age" : 25 , "sex": "girl", "birth":"1991-01-02" , "about": "i like travel and wusong" }
{ "index": { "_id": 6 }}
{ "name" : "caocao", "age" : 30 , "sex": "boy", "birth": "1988-01-02" ,"about": "i like xiaoqiao" }
{ "index": { "_id": 7 }}
{ "name" : "zhaoyun", "age" : 31 , "sex": "boy", "birth":"1997-01-02" , "about": "i like travel and music" }
{ "index": { "_id": 8 }}
{ "name" : "xiaoqiao", "age" : 18 , "sex": "girl", "birth":"1998-01-02" , "about": "i like caocao" }
{ "index": { "_id": 9 }}
{ "name" : "daqiao", "age" : 20 , "sex": "girl", "birth":"1996-01-02" , "about": "i like travel and history" }

2.查

#查询单条件match
get /school/_search?pretty
{
    "query":{
	    "match":{
		    "about":"travel"
		}
	}
}
#查询多条件bool（查询喜欢旅游的女孩）
GET /school/_search?pretty
{
  "query":{
    "bool": {
      "must": 
        {
          "match": {"about": "travel"}
          
        },
      "must_not": 
        {
          "match":{"sex":"boy"}     
        }
    }
  }
}

GET /school/_search?pretty
{
  "query":{
    "bool": {
      "filter": [
        {
          "match": {"about": "travel"}
        },
        {
          "match":{"sex":"boy"}      
        }
      ]
    }
  }
}
#查询某个字段中包含两个关键字，关键字为或的关系（使用bool和must）
GET /school/_search?pretty
{
  "query":{
    "bool": {
      "must": {"terms":{"about":["travel","history"]}
        }
    }
  }
}
#查询一个范围，range(gt:大于,lte:小于等于)
GET /school/_search?pretty
{
  "query":{
    "range": {
      "age": {
        "gt": 20,
        "lte": 25
      }
    }
  }
}
#查询喜欢旅游并且年龄大于20小于30的同学
GET /school/_search?pretty
{
  "query":{
    "bool": {
      "must":[
        {"term":{"about":"travel"}},
        {"range": {
          "age": {
            "gt": 20,
            "lt": 30
          }
        }
    }]
  }
}
}
#查询带中文的字段(若该字段类型为text)，也可使用.keyword属性
#es查询时，查询条件为中文时返回一直为空（原因：当中文类型为text，默认进行分词），若查询中文则需将每个中文使用多个term连接查询
GET /test_index/_search?pretty
{
  "query":{
    "bool": {
      "must":[
        {"term":{"game_name":"斗"}},
        {"term":{"game_name":"地"}},
        {"term":{"game_name":"主"}}
      ]
    }
  }
}
GET test_index/_search?pretty
{
  "query": {
    "term": {
      "game_name.keyword": {
        "value": "斗地主"
      }
    }
  }
}
#对返回值进行去重，筛选，排序，分页
"collapse": {
    "field": "api_type"
  },
"_source": ["_id","name","result","game_name","api_type","settle_time"],
  "sort": [
    {
      "_id": {
        "order": "desc"
      }
    },
    {
      "settle_time": {
        "order": "desc"
      }
    }
  ],
  "from": 0,
  "size": 20

DSL结构化查询
term:精确过滤，match：模糊匹配，满足包含即可,range：查询一个范围(gte:>=,ite:<=)
bool联合查询：联合查询会使用到must,should,must_not,filter（must：相当于and，should：相当于or）
返回值说明
hits：实际搜索的结果集，一个hits数组默认包含前10个文档，若指定size及排序，则size设置多少，按排序返回多少条
total-value：一共返回多少条数据
max_score：本次搜索中最大的相关分数
took：执行整个请求耗时多少毫秒
shard：在查询中参与分片的总数，以及分片成功和失败的个数
timeout：查询是否超时指定10毫秒（get /_search?timeout=10ms）
aggerations：查询分类聚合数据

3.更改mapping字段类型

#创建新索引
PUT /test_oder
#定义新索引映射
POST /test_oder/test_game/_mapping?include_type_name=true
{
  "test_game":{
    "properties": {
        "name": {"type": "keyword"},
        "api_type": {"type": "keyword"},
        "game_type": {"type": "keyword"},
        "game_name": {"type": "text"},
        "bean_amount": {"type": "float"},
        "status": {"type": "text"},
        "game_time": {"type": "date"},
        "game_no": {"type": "keyword"}
    }
  }
}
#将索引旧test_index数据复制到test_oder
POST _reindex
{
  "source":{
    "index":"test_index"
  },
  "dest":{
    "index":"test_oder"
  }
}
#删掉旧索引
DELETE test_index
#重命名新索引
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "test_oder",
        "alias": "test_index"
      }
    }
  ]
}

四、数据迁移

1.elasticdump

数据迁移共有4中方式，这里只举elasticdump为例说明
1.在安装目录下创建elasticdump文件夹并使用cmd进入该目录
2.若未安装npm则执行npm install
3.执行elasticdump安装命令npm install elasticdump -g
4.在elasticdump下执行数据迁移命令（input为新环境服务器，output为被拷贝服务器）

'#拷贝analyzer如分词
elasticdump
–input=http://production.es.com:9200/my_index
–output=http://staging.es.com:9200/my_index
–type=analyzer
'#拷贝映射
elasticdump
–input=http://production.es.com:9200/my_index
–output=http://staging.es.com:9200/my_index
–type=mapping
'#拷贝数据
elasticdump
–input=http://production.es.com:9200/my_index
–output=http://staging.es.com:9200/my_index
–type=data

2.使用python将指定文件数据导入es

# coding=utf-8
from elasticsearch import Elasticsearch
from elasticsearch import helpers
import os

# 实例化es
es = Elasticsearch(hosts='http://ip', port=9200)
# 定义索引映射关系

mappings = {
    "mappings": {
        "test_type": {
            "properties": {
        		"name": {"type": "keyword"},
        		"api_type": {"type": "keyword"},
        		"game_name": {"type": "text"},
        		"bean_amount": {"type": "float"},
        		"status": {"type": "text"}
        		"game_time": {"type": "date"}
        		"game_no": {"type": "keyword"}
    }
        }
    }
}
# 删除索引
es.indices.delete(index="test_index")
# 创建索引库
es.indices.create(index="test_index", include_type_name="true", body=mappings)
print("创建成功")

fileUrl = os.path.dirname(os.path.abspath(".")) + os.sep + "learn" + os.sep + "test-1.csv"
# 打开并读取文件
f = open(fileUrl)
actions = []
i = 1

# 将文件数据赋值给索引库字段
for line in f:
    line = line.strip().split(',')
    action = {
        "_index": "test_index",
        "_type": "test_type",
        "_id": i,
        "_source": {
            u"name": line[0].replace("\"", ""),
            u"api_type": line[1].replace("\"", "")
            u"game_name": line[3].replace("\"", ""),
            u"bean_amount": line[4],
            u"status": line[11].replace("\"", ""),
            u"game_time": line[13].replace("\"", ""),
            u"game_no": line[16].replace("\"", "")
        }
    }
    i = i + 1
    actions.append(action)
	# 避免actions数组过长，每10条数据做一次插入
    if len(actions) == 10:
        helpers.bulk(es, actions)
        del actions[0:len(actions)]

if len(actions) > 0:
    helpers.bulk(es, actions)
    del actions[0:len(actions)]