Getting Started
概述
ES是一个可伸缩的、开源的全文检索和分析引擎,提供对海量数据的存储、检索、实时数据分析,广泛用于很多场景下的搜索需求。
基本概念
NRT:near realtime,低延迟,从创建所有到可以被检索到有一定延迟,但不大。
Cluster:ES是一个集群,注意不同环境下集群名字要有所不同,集群可以只有一个节点。
index:索引是文档的集合,索引名称必须是小写。
type:即将废弃,6.0版本已标记为decrepted,现在索引只有一个类型。
document:索引的基本的单位,本质是是存在于索引的类型中。
shard:每个分区都是一个单独的搜索引擎,每个lucen可存储对文档数量有上限,LUCENE-5843上限是2,147,483,519(= Integer.MAX_VALUE - 128)。
replica:为了防止node故障,需要保存数据副本。
索引被分散到很多分区中,分区有有零个或多个副本,副本和分区到数量在索引创建之初确定,之后也可以动态调整。默认情况下,一个索引有五个分区,每个分区有一个副本,这意味着集群至少有两个节点,整个集群有10个分区。
ES默认端口号9200(针对REST API)
安装
- 下载ES安装包
- 下载kibana安装包,于ES版本要一直,否则可能不兼容
- 启动ES
- 启动kibana
- http://localhost:5601
集群
集群健康
GET /_cat/health?v
green:particion和replica都可用
yellow:particion可用,部分replica不可用
red:部分particion不可用
集群节点
GET /_cat/nodes?v
集群索引列表
GET /_cat/indices?v
创建索引
PUT /customer?pretty
新建文档
PUT /customer/_doc/1?pretty
{
"name":"helios"
}
POST /customer/_doc?pretty
{
"name": "Jane Doe"
}
PUT原语可以指定文档ID,如果不需要,则采用POST,多次PUT操作,如果ID一样,则是更新(覆盖)操作,否则则是插入操作。
注意:文档对创建和索引是否已经创建无关,如果索引没有创建,则在创建文档时会自动创建索引。
检索文档
GET /customer/_doc/1?pretty
删除索引
DELETE /customer?pretty
更新文档
POST /customer/_doc/1/_update?pretty
{
"doc": {
"name":"helios-update",
"age":"27"
}
}
POST /customer/_doc/1/_update?pretty
{
"script": "ctx._source.age += 5"
}
注意:文档的更新本质上并不是对某些字段做更新,而是会删除、合并旧文档,第二种方式为脚本更新。
删除文档
DELETE /customer/_doc/1?pretty
批处理
POST /customer/_doc/_bulk?pretty
{"index":{"_id":"1"}}
{"name":"helios-bulk-1"}
{"index":{"_id":"2"}}
{"name":"helios-bulk-2"}
POST /customer/_doc/_bulk?pretty
{"update":{"_id":"1"}}
{"doc":{"name":"bulk-update"}}
{"delete":{"_id":"2"}}
搜索
GET /bank/_search?q=*&sort=account_number:asc&pretty
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}
说明:match_all在整个文档做匹配。
source
GET /bank/_search
{
"query": { "match_all": {} },
"_source": ["account_number", "balance"]
}
说明:默认_source字段范围整个文档,可以在request中选择显示哪几个字段。
match
GET /bank/_search
{
"query": { "match": { "address": "mill" } }
}
说明:不同于match_all,匹配到具体字段,关键字会被切词。
match_phrase
GET /bank/_search
{
"query": { "match_phrase": { "address": "mill lane" } }
}
bool must
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}
说明:must中筛选项必须全部满足。
bool should
GET /bank/_search
{
"query": {
"bool": {
"should": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}
说明:should中筛选项至少满足一个。
bool must_not
GET /bank/_search
{
"query": {
"bool": {
"must_not": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}
说明:must_not必须全部都不满足。
聚合
terms:
request:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
}
}
}
}
response:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound": 20,
"sum_other_doc_count": 770,
"buckets" : [ {
"key" : "ID",
"doc_count" : 27
}, {
"key" : "TX",
"doc_count" : 27
}, {
"key" : "AL",
"doc_count" : 25
}, {
"key" : "MD",
"doc_count" : 25
}, {
"key" : "TN",
"doc_count" : 23
}, {
"key" : "MA",
"doc_count" : 21
}, {
"key" : "NC",
"doc_count" : 21
}, {
"key" : "ND",
"doc_count" : 21
}, {
"key" : "ME",
"doc_count" : 20
}, {
"key" : "MO",
"doc_count" : 20
} ]
}
}
}
注意:bucket默认返回十个
avg
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
range:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
},
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
}
}
过滤
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
注意:过滤并不计算得分。
Set Up ES
安装
检查ES是否运行
GET /
后台运行
./bin/elasticsearch -d -p pid
说明: -p参数后跟文件名,将pid保存到文件中。
命令行配置ES
./bin/elasticsearch -d -Ecluster.name=my_cluster -Enode.name=node_1
聚合
- metrics:avg
- pipline:max、min
- bucket:terms
- matrix:
query dsl
Mapping
概述
元字段:_index、_type、_id、_source
字段:
基础类型:text, keyword, date, long, double, boolean, ip
防止mapping无限制增加:
index.mapping.total_fields.limit:索引字段的最大数量,默认是1000.
index.mapping.depth.limit:字段最大深度,默认是20.
index.mapping.nested_fields.limit:嵌套字段上限,默认是50,每个嵌套字段都是个独立的隐藏文档。
注意:索引创建的时候mapping已经固定了,如果想要更新mapping,可以重建索引,采用alias方式。
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": { "type": "text" },
"name": { "type": "text" },
"age": { "type": "integer" },
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
Mapping多类型移除
6.x之前的版本,索引支持多类型,自6.0之后开始取消这一特性,仅支持单类型索引。
早期多类型实现:
PUT twitter
{
"mappings": {
"user": {
"properties": {
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" }
}
},
"tweet": {
"properties": {
"content": { "type": "text" },
"user_name": { "type": "keyword" },
"tweeted_at": { "type": "date" }
}
}
}
}
PUT twitter/user/kimchy
{
"name": "Shay Banon",
"user_name": "kimchy",
"email": "shay@kimchy.com"
}
PUT twitter/tweet/1
{
"user_name": "kimchy",
"tweeted_at": "2017-10-24T09:00:00Z",
"content": "Types are going away"
}
GET twitter/tweet/_search
{
"query": {
"match": {
"user_name": "kimchy"
}
}
}
替代实现:
PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"type": { "type": "keyword" },
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" },
"content": { "type": "text" },
"tweeted_at": { "type": "date" }
}
}
}
}
PUT twitter/_doc/user-kimchy
{
"type": "user",
"name": "Shay Banon",
"user_name": "kimchy",
"email": "shay@kimchy.com"
}
PUT twitter/_doc/tweet-1
{
"type": "tweet",
"user_name": "kimchy",
"tweeted_at": "2017-10-24T09:00:00Z",
"content": "Types are going away"
}
GET twitter/_search
{
"query": {
"bool": {
"must": {
"match": {
"user_name": "kimchy"
}
},
"filter": {
"match": {
"type": "tweet"
}
}
}
}
}
多类型索引到单类型的迁移
PUT users
{
"settings": {
"index.mapping.single_type": true
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text"
},
"user_name": {
"type": "keyword"
},
"email": {
"type": "keyword"
}
}
}
}
}
PUT tweets
{
"settings": {
"index.mapping.single_type": true
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
},
"user_name": {
"type": "keyword"
},
"tweeted_at": {
"type": "date"
}
}
}
}
}
POST _reindex
{
"source": {
"index": "twitter",
"type": "user"
},
"dest": {
"index": "users"
}
}
POST _reindex
{
"source": {
"index": "twitter",
"type": "tweet"
},
"dest": {
"index": "tweets"
}
}
自定义类型字段
PUT new_twitter
{
"mappings": {
"_doc": {
"properties": {
"type": {
"type": "keyword"
},
"name": {
"type": "text"
},
"user_name": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"content": {
"type": "text"
},
"tweeted_at": {
"type": "date"
}
}
}
}
}
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
},
"script": {
"source": """
ctx._source.type = ctx._type;
ctx._id = ctx._type + '-' + ctx._id;
ctx._type = '_doc';
"""
}
}
字段数据类型
String
text:分词
keyword:不分词。
Numeric datatypes
long, integer, short, byte, double, float, half_float, scaled_float
Date datatype
date
Boolean datatypes
boolean
Binary datatype
binary
Range datatypes
integer_range, float_range, long_range, double_range, date_range
Array datatype
数组类型并没有特别定义,当需要使用数据时,不需要显示声明字段类型,只要数组里对象直接或间接类型一致即可,需要注意的时,对于对象的数组,倒排索引存储的时候会打散对象各字段,如果想保持对象在数组里的独立实体属性,可以使用nexted。
Object datatype
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"region": {
"type": "keyword"
},
"manager": {
"properties": {
"age": { "type": "integer" },
"name": {
"properties": {
"first": { "type": "text" },
"last": { "type": "text" }
}
}
}
}
}
}
}
}
Nested datatype
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"user": {
"type": "nested"
}
}
}
}
}
PUT my_index/_doc/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
GET my_index/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }}
]
}
}
}
}
}
GET my_index/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "White" }}
]
}
},
"inner_hits": {
"highlight": {
"fields": {
"user.first": {}
}
}
}
}
}
}
Geo-point datatype
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
PUT my_index/_doc/1
{
"text": "Geo-point as an object",
"location": {
"lat": 41.12,
"lon": -71.34
}
}
PUT my_index/_doc/2
{
"text": "Geo-point as a string",
"location": "41.12,-71.34"
}
PUT my_index/_doc/3
{
"text": "Geo-point as a geohash",
"location": "drm3btev3e86"
}
PUT my_index/_doc/4
{
"text": "Geo-point as an array",
"location": [ -71.34, 41.12 ]
}
GET my_index/_search
{
"query": {
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 42,
"lon": -72
},
"bottom_right": {
"lat": 40,
"lon": -74
}
}
}
}
}
Geo-shape datatype
详情略,需要的话去查阅官方文档。
Ip datatype
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"ip_addr": {
"type": "ip"
}
}
}
}
}
PUT my_index/_doc/1
{
"ip_addr": "192.168.1.1"
}
GET my_index/_search
{
"query": {
"term": {
"ip_addr": "192.168.0.0/16"
}
}
}
Alias datatype
PUT trips
{
"mappings": {
"_doc": {
"properties": {
"distance": {
"type": "long"
},
"route_length_miles": {
"type": "alias",
"path": "distance" //
},
"transit_mode": {
"type": "keyword"
}
}
}
}
}
GET _search
{
"query": {
"range" : {
"route_length_miles" : {
"gte" : 39
}
}
}
}
注意:path必须是字段全称。
多字段
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
PUT my_index/_doc/1
{
"city": "New York"
}
PUT my_index/_doc/2
{
"city": "York"
}
GET my_index/_search
{
"query": {
"match": {
"city": "york"
}
},
"sort": {
"city.raw": "asc"
},
"aggs": {
"Cities": {
"terms": {
"field": "city.raw"
}
}
}
}
Analyzer
由字符过滤器、分词器、token过滤器组成。
Character Filter:字符流进行过滤,转换或者添加、删除一些字符。
Tokenizer:对字符流进行分词。
Token Filter:对token流进行过滤,删除、添加一些token,或者转换一些token为同义词。
索引阶段,分析器寻找顺序:
- mapping中字段定义的analyzer
- setting中定义的analyzer
- standard
搜索阶段,分析器寻找顺序
- 全文检索(full text query)中定义的analyzer
- mapping中字段定义的search_analyzer
- mapping中字段定义的analyzer
- setting定义的default_search
- setting定义的default
- standard
示例分析:
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase"
]
},
"my_stop_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"english_stop"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"title": {
"type":"text",
"analyzer":"my_analyzer",
"search_analyzer":"my_stop_analyzer",
"search_quote_analyzer":"my_analyzer"
}
}
}
}
}
PUT my_index/_doc/1
{
"title":"The Quick Brown Fox"
}
PUT my_index/_doc/2
{
"title":"A Quick Brown Fox"
}
GET my_index/_search
{
"query":{
"query_string":{
"query":"\"the quick brown fox\""
}
}
}
示例流程分析:
- filter为token filter。
- search_analyzer在搜索阶段,如果是non-phrase query则采用此分析器。
- search_quote_analyzer在搜索阶段,如果是phrase query则采用此分析器。
- query_string为phrase查询,此时启用my_analyze分析器,查询此条分解为[the, quick, brown, fox] 。
- 紧接着,分解的token流又是term query,为non-phrase query,采用search_analyzer,the为停用此,被删除。
- 因此查询词条为[quick, brown, fox]
Analysis
已经不需要看了。
索引别名
PUT /my_index_v1
PUT /my_index_v1/_alias/my_index
GET /*/_alias/my_index
GET /my_index_v1/_alias/*
POST /_aliases
{
"actions": [
{ "remove": { "index": "my_index_v1", "alias": "my_index" }},
{ "add": { "index": "my_index_v2", "alias": "my_index" }}
]
}