ElasticSearch从入门到放弃（一） -- 介绍，映射，字段类型，查询，聚合【基于官方文档7.5】

最新推荐文章于 2020-09-11 17:33:00 发布

疯狂学习的白菜

最新推荐文章于 2020-09-11 17:33:00 发布

阅读量259

点赞数

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/xcvbxv01/article/details/103537585

版权

ElasticSearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

点击查看原文（包含源码和图片）：http://note.youdao.com/noteshare?id=d439afd2a88da302fd79634ff79c5359&sub=0302D1C67F6C40AB9105E138BA897D16

1.名词解释

近实时（NRT）

ES是一个近实时的搜索引擎（平台），代表着从添加数据到能被搜索到只有很少的延迟。

（大约是1s）

集群

可以将多台ES服务器作为集群使用，可以在任何一台节点上进行搜索。集群有一个默认的名（可修改），“elasticsearch”，这个集群名称必须是唯一的，因为集群的节点是通过集群名称来加入集群的。确保在相同环境中不要有相同的集群名称，否则有可能节点会加入到非预期的集群中

节点

节点是作为集群的一部分的单个服务器，存储数据，并且参与集群的索引和搜索功能。

与集群一样，节点由一个名称标识，默认情况下，该名称是在启动时分配给节点的随机通

用唯一标识符（UUID）。如果不希望使用默认值，则可以定义所需的任何节点名称。此名称

对于管理目的很重要，因为您希望确定网络中的哪些服务器对应于ElasticSearch集群中的

哪些节点

索引

索引是具有某种相似特性的文档集合。

例如，您可以拥有客户数据的索引、产品目录的另一个索引以及订单数据的另一个索引。索引由

一个名称（必须全部是小写）标识，当对其中的文档执行索引、搜索、更新和删除操作时，该名

称用于引用索引。在单个集群中，您可以定义任意多个索引。如果你学习过Mysql ，可以将其

暂时理解为 MySql中的 database

类型

一个索引可以有多个类型。例如一个索引下可以有文章类型，也可以有用户类型，也可以有评论类型。

在一个索引中不能再创建多个类型，在以后的版本中将删除类型的整个概念。

文档

一个文档是一个可被索引的基础信息单元。比如，你可以拥有某一个客户的文档，某一个产品的一个文档，当然，也可以拥有某个订单的一个文档。文档以JSON（Javascript Object Notation）格式来表示，而JSON是一个到处存在的互联网数据交互格式。

在一个index/type里面，你可以存储任意多的文档。注意，尽管一个文档，物理上存在于一个索引之中，文档必须被索引/赋予一个索引的type。

版本控制

ElasticSearch采用了乐观锁来保证数据的一致性，也就是说，当用户对document进行操作时，并不需要对该document作加锁和解锁的操作，只需要指定要操作的版本即可。当版本号一致时，ElasticSearch会允许该操作顺利执行，而当版本号存在冲突时，ElasticSearch会提示冲突并抛出异常（VersionConflictEngineException异常）。

ElasticSearch的版本号的取值范围为1到2^63-1。

内部版本控制：使用的是_version

外部版本控制：elasticsearch在处理外部版本号时会与对内部版本号的处理有些不同。它不再是检查_version是否与请求中指定的数值_相同_,而是检查当前的_version是否比指定的数值小。如果请求成功，那么外部的版本号就会被存储到文档中的_version中。

为了保持_version与外部版本控制的数据一致使用version_type=external

2.映射

mapping定义了type中的每个字段的数据类型以及这些字段如何分词等相关属性

# 查看映射 GET bank/_mapping >>>>> { "bank" : { "mappings" : { "properties" : { "account_number" : { "type" : "long" }, "address" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "age" : { "type" : "long" }, "balance" : { "type" : "long" }, "city" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "email" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "employer" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "firstname" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "gender" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "lastname" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "state" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } }

字段类型

创建索引的时候,可以预先定义字段的类型以及相关属性，这样就能够把日期字段处理成日期，把数字字段处理成数字，把字符串字段处理字符串值等

支持的数据类型:

a.核心数据类型（Core datatypes）

1.字符型：string，string类型包括 text 和 keyword 1.1 text类型被用来索引长文本，在建立索引前会将这些文本进行分词，转化为词的组合，建立索引。允许es来检索这些词语。text类型不能用来排序和聚合。 1.2 Keyword类型不需要进行分词，可以被用来检索过滤、排序和聚合。 keyword 类型字段只能用本身来进行检索 2.数字型：long, integer, short, byte, double, float 3.日期型：date 4.布尔型：boolean 5.二进制型：binary

b.复杂数据类型（Complex datatypes）

1.数组类型（Array datatype）：数组类型不需要专门指定数组元素的type，例如：字符型数组: [ "one", "two" ] 整型数组：[ 1, 2 ] 数组型数组：[ 1, [ 2, 3 ]] 等价于[ 1, 2, 3 ] 对象数组：[ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }] 2.对象类型（Object datatype）：_ object _ 用于单个JSON对象； 3.嵌套类型（Nested datatype）：_ nested _ 用于JSON数组；

c.地理位置类型（Geo datatypes）

1.地理坐标类型（Geo-point datatype）：_ geo_point _ 用于经纬度坐标； 2.地理形状类型（Geo-Shape datatype）：_ geo_shape _ 用于类似于多边形的复杂形状；

d.特定类型（Specialised datatypes）

1.IPv4 类型（IPv4 datatype）：_ ip _ 用于IPv4 地址； 2.Completion 类型（Completion datatype）：_ completion _提供自动补全建议； 3.Token count 类型（Token count datatype）：_ token_count _ 用于统计做了标记的字段的index数目，该值会一直增加，不会因为过滤条件而减少。 4.mapper-murmur3 类型：通过插件，可以通过 _ murmur3 _ 来计算 index 的 hash 值； 5.附加类型（Attachment datatype）：采用 mapper-attachments 插件，可支持_ attachments _ 索引，例如 Microsoft Office 格式，Open Document 格式， ePub, HTML 等。

字段的属性

"address" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "age" : { "type" : "long" }, "balance" : { "type" : "long" }

"store":false//是否单独设置此字段的是否存储而从_source字段中分离，默认是false，只能搜索，不能获取值

"index": true//分词，不分词是：false ，设置成false，字段将不会被索引

"analyzer":"ik"//指定分词器,默认分词器为standard analyzer

"boost":1.23//字段级别的分数加权，默认值是1.0

"doc_values":false//对not_analyzed字段，默认都是开启，分词字段不能使用，对排序和聚合能提升较大性能，节约内存

"fielddata":{"format":"disabled"}//针对分词字段，参与排序或聚合时能提高性能，不分词字段统一建议使用doc_value

"fields":{"raw":{"type":"string","index":"not_analyzed"}} //可以对一个字段提供多种索引模式，同一个字段的值，一个分词，一个不分词

"ignore_above":100 //超过100个字符的文本，将会被忽略，不被索引

"include_in_all":ture//设置是否此字段包含在_all字段中，默认是true，除非index设置成no选项

"index_options":"docs"//4个可选参数docs（索引文档号） ,freqs（文档号+词频），positions（文档号+词频+位置，通常用来距离查询），offsets（文档号+词频+位置+偏移量，通常被使用在高亮字段）分词字段默认是position，其他的默认是docs

"norms":{"enable":true,"loading":"lazy"}//分词字段默认配置，不分词字段：默认{"enable":false}，存储长度因子和索引时boost，建议对需要参与评分字段使用，会额外增加内存消耗量

"null_value":"NULL"//设置一些缺失字段的初始化值，只有string可以使用，分词字段的null值也会被分词

"position_increament_gap":0//影响距离查询或近似查询，可以设置在多值字段的数据上火分词字段上，查询时可指定slop间隔，默认值是100

"search_analyzer":"ik"//设置搜索时的分词器，默认跟ananlyzer是一致的，比如index时用

standard+ngram，搜索时用standard用来完成自动提示功能

"similarity":"BM25"//默认是TF/IDF算法，指定一个字段评分策略，仅仅对字符串型和分词类型有效

"term_vector":"no"//默认不存储向量信息，支持参数yes（term存储），with_positions（term+位置）,with_offsets（term+偏移量），with_positions_offsets(term+位置+偏移量) 对快速高亮fast vector highlighter能提升性能，但开启又会加大索引体积，不适合大数据量用

创建映射

PUT /testmapping { "settings":{ "number_of_shards" : 3, "number_of_replicas" : 0 }, "mappings":{ "properties":{ "title":{"type":"text"}, "name":{"type":"text","index":false}, "publish_date":{"type":"date","index":false}, "price":{"type":"double"}, "number":{"type":"integer"} } } }

查看创建的映射

{ "testmapping" : { "aliases" : { }, "mappings" : { "properties" : { "name" : { "type" : "text", "index" : false }, "number" : { "type" : "integer" }, "price" : { "type" : "double" }, "publish_date" : { "type" : "date", "index" : false }, "title" : { "type" : "text" } } }, "settings" : { "index" : { "creation_date" : "1575625879384", "number_of_shards" : "3", "number_of_replicas" : "0", "uuid" : "gsDYYk_NT0yIO7qT632x2g", "version" : { "created" : "7050099" }, "provided_name" : "testmapping" } } } }

3.REST API（增删改）

# 查看集群信息 GET _cat/health?v # 查看Master信息 GET _cat/master?v # 查看Nodes信息 GET _cat/nodes?v # 查看索引信息 GET _cat/indices?v # 创建索引customer PUT customer?pretty GET _cat/indices?v # 向索引中添加文档 PUT /customer/_doc/1?pretty { "name": "John Doe" } # 格式 <HTTP Verb> /<Index>/<Endpoint>/<ID> # 向索引中添加文档[错误，一个索引只能有一个类型] PUT /customer/_type1/1?pretty { "name": "John Doe" } # 查询文档 GET /customer/_doc/1 # 删除索引 DELETE customer # 删除文档 DELETE /customer/_doc/1 # 修改文档 -- 会用新的内容将文档整个替换掉 POST /customer/_doc/1 { "name": "John Doe", "age":12 } POST /customer/_doc/1 { "age":13 } # 更新文档中字段的值 POST /customer/_update/1 { "script": "ctx._source.age += 5" } # 批处理bulk API ##（1）delete：删除一个文档，只需要一个json串 ##（2）create：PUT /index/type/id/_create，强制创建，如果原本存在则会报错 ##（3）index：普通的put操作，可以是创建文档，也可以是全量替换文档 ##（4）update：执行的partial update操作,内容用doc标识 POST /customer/_bulk { "index": { "_id": 1 }} { "price" : 10, "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30, "productID" : "QQPX-R-3956-#aD8" } POST /customer/_bulk { "update":{ "_id": 1 }} {"doc": { "price" : 30, "productID" : "xxxxxxxxxx" }} POST /customer01/_bulk {"delete":{"_id":11}} {"create":{"_id":3}} {"test_field":"test3"} {"index":{"_id":4}} {"test_field":"test4"} {"index":{"_id":4}} {"test_field":"replaced test2"} {"update":{"_id":1}} {"doc":{"test_field2":"2"}} # 批量获取 GET bank/_mget { "ids":["1","2","3"] }

4.批处理，加载外部文件命令

命令文件【将其放入一个目录下，并且在该目录下执行以下curl命令】

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json" curl "localhost:9200/_cat/indices?v"

5.查询

现在让我们从一些简单的搜索开始。运行搜索有两种基本方法:

一种是通过REST请求URI发送搜索参数。
另一种是通过REST请求体发送搜索参数。
查询的基本结构:GET /index/_search
接下来可接 query/sort/size/from/_source
然后其中match 中可以添加条件
bool查询中可添加must/must_not/should/filter

GET /bank/_search { // 查询 "query": { "match_all": {} }, //排序 "sort": [ { "account_number": "asc" } ], // limit "size": 20, // from "from": 0, // 显示的列 "_source": "account_number" "_source": { "includes": ["name","address"], "excludes": ["age","birthday"] } "_source": { "includes": "addr*", "excludes": ["name","bir*"] }, } 其中查询又可细分： "query": { // 匹配所有 "match_all": {}, // 匹配指定 "match": { "FIELD": "TEXT" }, // 将用空格分隔的字符串当成整体处理 "match_phrase": { "FIELD": "PHRASE xxx" } // 前缀匹配 "match_phrase_prefix": { "email": "ab" } // 可以指定多个字段 "multi_match": { "query": "IL", "fields": ["state","email"] } // bool 查询 "bool": { "must": [ {} ], "must_not": [ {} ], "should": [ {} ], "filter": { } }, // 范围查询,筛选 "range": { "FIELD": { "gte": 10, "lte": 20 } } } 其中match 中可以添加条件 "match": { // FIELD 包含 TEXT "FIELD": "TEXT", // 年龄等于20 "age":20 }, bool查询中可添加条件must/must_not/should/filter "bool": { "must": [ {} ], "must_not": [ {} ], "should": [ {} ], "filter": { } }

5.1 分数（_score）

分数是一个数值，它是衡量文档与我们指定的搜索查询匹配程度的一个相对指标。分数越高，文档越相关，分数越低，文档越不相关。
但是查询并不总是需要生成分数，特别是当它们只用于“过滤”文档集时。Elasticsearch会自动优化查询执行，以避免计算无用的分数

5.2 简单示例

###### 1.简单的请求 ##### # 【结果说明】 ## took Elasticsearch执行搜索所用的时间(以毫秒为单位) ## timed_out 告诉我们搜索是否超时 ## _shards告诉我们搜索了多少碎片，以及成功/失败搜索碎片的计数 ## hits搜索结果 ## hits.total包含与搜索条件匹配的文档总数相关的信息的对象 ## hits.total.value总命中数的值。 ## hits.total.relation :hits.total.value值是准确的命中次数，在这种情况下它等于eq或总命中次数的下界(大于或等于 )，在这种情况下它等于gte ## hits.hits 实际的搜索结果数组(默认为前10个文档) ## hits.sort结果排序键(如果按分数排序，则丢失) ## hits._score和 max_score——暂时忽略这些字段 # query部分告诉我们进行查询操作，match_all只是我们想要运行的查询类型，match_all只是搜索指定索引中的所有文档 # GET /bank/_search { "query": { "match_all": {} } } ###### 2.排序 sort ##### # GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ] } ###### 3 size 和 from ##### # from参数(基于0)指定从哪个文档索引开始，size参数指定从from参数开始返回多少文档。该特性在实现搜索结果分页时非常有用 # 如果没有指定from，则默认值为0 # 相当于 mysql中的 limit # GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ], "from": 10, "size": 10 } ##### 4 如何从搜索中返回两个字段，即帐号和余额 ##### # 相当于 select account_number,balance from ban） # GET /bank/_search { "query": { "match_all": {} }, "_source": ["account_number","balance"] } #### 5 使用 match 查询 ##### # 相当于Mysql中的条件查询 # 查询编号为20的帐户 # select account_number from bank where account_number = 20 # GET /bank/_search { "query": { "match": { "account_number": 20 } } } # 查询地址中包含Mill的，不区分大小写 # select address from bank where address like %MILL% # GET /bank/_search { "query": { "match": { "address": "mill" } } } # 地址中包含“mill”或“lane”的所有帐户 # select address from bank where address like %MILL% or %lane% # GET /bank/_search { "query": { "match": { "address": "mill lane" } } } # 如果我想查 “mill lane” 呢？需要使用match_phrase # match 中如果加空格，那么会被认为两个单词，包含任意一个单词将被查询到 # match_parase 将忽略空格，将该字符认为一个整体，会在索引中匹配包含这个整体的文档。 # GET /bank/_search { "query": { "match_phrase": { "address": "mill lane" } } } ##### 6.bool查询 ###### # 如果您熟悉mysql，那么你就会发现布尔查询其实相当于 and or not... # 包含两个匹配查询，返回地址中包含“mill”和“lane”的所有帐户 # bool must子句指定了所有必须为true的查询，则将文档视为匹配 # 相当于 and 连接 # GET /bank/_search { "query": { "bool": { "must": [ { "match": { "address": "mill" }}, { "match": { "address": "lane" } } ] } } } # 包含两个匹配查询，并返回地址中包含“mill”或“lane”的所有帐户 # bool should子句指定了一个查询列表，其中任何一个查询必须为真，才能将文档视为匹配。 # 相当于 or 连接 GET /bank/_search { "query": { "bool": { "should": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } } # 包含两个匹配查询，返回地址中既不包含“mill”也不包含“lane”的所有帐户 # bool must_not子句指定了一个查询列表，其中没有一个查询必须为真，才能将文档视为匹配。 # 相当于 not 连接 GET /bank/_search { "query": { "bool": { "must_not": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } } # 返回所有40岁但不居住在ID(aho)的人的账户 GET /bank/_search { "query": { "bool": { "must": [ { "match": { "age": 40 } } ], "must_not": [ { "match": { "address": "aho" } } ] } } } # bool + filter + range # 使用bool查询返回余额大于或等于20000，小于或等于30000的账户 GET /bank/_search { "query": { "bool": { "must": [ {"match_all": {}} ], "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }

6.聚合函数

聚合提供了对数据进行分组和提取统计信息的能力。考虑聚合最简单的方法是大致将其等同于SQL GROUP by和SQL聚合函数。

# 按状态对所有帐户进行分组，然后返回按count降序排列,显示前20个状态 GET /bank/_search { "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 20 } } }, "size": 0 } # 按状态对所有帐户进行分组,计算平均帐户余额,降序排列,显示前20 # 我们如何将average_balance聚合嵌套在group_by_state聚合中。这是所有聚合的常见模式。您可以在聚合中任意嵌套聚合，以从数据中提取所需的结果 # GET /bank/_search { "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 20 }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } }, "size": 0 } # 基于之前的聚合，我们现在按降序对平均余额排序 GET /bank/_search { "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 20, "order": { "average_balance": "desc" } }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } }, "size": 0 } # 按照年龄等级(20-29岁，30-39岁，40-49岁)分组，然后按性别分组，最后得到每个年龄等级，每个性别的平均账户余额 GET /bank/_search { "aggs": { "group_by_age":{ "range": { "field": "age", "ranges": [ { "from": 20, "to": 29 }, { "from": 30, "to": 39 }, { "from": 40, "to": 49 } ] }, "aggs": { "group_by_gender": { "terms": { "field": "gender.keyword", "size": 10 }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } } }, "size": 0 }

7.其他查询

（1）范围查询

range:实现范围查询

参数：from,to,include_lower,include_upper,boost

include_lower:是否包含范围的左边界，默认是true

include_upper:是否包含范围的右边界，默认是true

GET bank/_search { "query": { "range": { "age": { "from":30, "to":50, "include_lower":false, "include_upper":false } } } }

（2）wildcard查询

允许使用通配符* 和 ?来进行查询

*代表0个或多个字符

？代表任意一个字符

GET /lib3/user/_search { "query": { "wildcard": { "name": "zhao*" } } }

（3）fuzzy实现模糊查询

value：查询的关键字

boost：查询的权值，默认值是1.0

min_similarity:设置匹配的最小相似度，默认值为0.5，对于字符串，取值为0-1(包括0和1);对于数值，取值可能大于1;对于日期型取值为1d,1m等，1d就代表1天

prefix_length:指明区分词项的共同前缀长度，默认是0

max_expansions:查询中的词项可以扩展的数目，默认可以无限大

GET /lib3/user/_search { "query": { "fuzzy": { "interests": "chagge" } } } GET /lib3/user/_search { "query": { "fuzzy": { "interests": { "value": "chagge" } } } }

（4）高亮搜索结果

GET /lib3/user/_search { "query":{ "match":{ "interests": "changge" } }, "highlight": { "fields": { "interests": {} } } }

疯狂学习的白菜

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch从入门到放弃（一） -- 介绍，映射，字段类型，查询，聚合【基于官方文档7.5】

点击查看原文（包含源码和图片）：http://note.youdao.com/noteshare?id=d439afd2a88da302fd79634ff79c5359&sub=0302D1C67F6C40AB9105E138BA897D161.名词解释近实时（NRT）ES是一个近实时的搜索引擎（平台），代表着从添加数据到能被搜索到只有很少的延迟。（大约是1s）...
复制链接

扫一扫

专栏目录