Elasticsearch笔记
开源搜索:Elasticsearch、ELK Stack 和 Kibana 的开发者 | Elastic
Elasticsearch简介
全文搜索属于最常见的需求,开源的Elasticsearch是目前全文搜索引擎的首选。它可以快速地存储、搜索和分析海量数据。维基百科、Stack Overflow、Github都采用它。
Elastic的底层是开源库Lucene。但是,你没法直接用Lucene,必须自己写代码去调用它的接口。Elastic是Lucene的封装,提供了REST API的操作接口,开箱即用。
基本概念
-
Index(索引)
当做动词,相当于MySQL中的insert
当做名次,相当于MySQL中的Database
-
Type(类型)
在Index(索引)中,可以定义一个或多个类型。类似于MySQL中的Table表。每一种类型的数据放在一起。
-
Document(文档)
保存在某个索引(Index)下,某种类型(Type)的一个数据(Document),文档是JSON格式的,Document就像是MySQL中的某个Table的内容。
-
倒排索引
Elasticsearch 使用的是一种名为_倒排索引_的数据结构,这一结构的设计可以允许十分快速地进行全文本搜索。倒排索引会列出在所有文档中出现的每个特有词汇,并且可以找到包含每个词汇的全部文档。
在索引过程中,Elasticsearch 会存储文档并构建倒排索引,这样用户便可以近实时地对文档数据进行搜索。索引过程是在索引 API 中启动的,通过此 API 您既可向特定索引中添加 JSON 文档,也可更改特定索引中的 JSON 文档。
Docker安装
-
下载镜像文件
docker pull elasticsearch:7.4.2 // 可视化检索工具 docker pull kibana:7.4.2
-
Elasticsearch
mkdir -p /mydata/elasticsearch/config mkdir -p /mydata/elasticsearch/data echo "http.host: 0.0.0.0">>/mydata/elasticsearch/config/elasticsearch.yml docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \ -e "discovery.type=single-node" \ -e ES_JAVA_OPTS="-Xms64m -Xmx512m" \ -v /mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \ -v /mydata/elasticsearch/data:/usr/share/elasticsearch/data \ -v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \ -d elasticsearch:7.4.2
特别注意:
-e ES_JAVA_OPTS=“Xms64m -Xmx128m” 测试环境下,设置ES的初始内存和最大内存,否则导致过大启动不了ES
修改data、config、plugins目录的权限
chmod -R 777 /mydata/elasticsearch/
-
Kibana
docker run --name kibana -e ELASTICSEARCH_HOSTS=http://192.168.56.10:9200 -p 5601:5601 \ -d kibana:7.4.2
初步检索
-
_cat
GET /_cat/nodes→查看所有节点
GET /_cat/health→查看es健康状况
GET /_cat/master→查看主节点
GET /_cat/indices→查看所有索引
-
索引一个文档(保存)
保存一个数据,保存在哪个索引的哪个类型下,指定用哪个唯一标识
PUT customer/external/1→在customer索引下的external类型下保存1好数据为
http://192.168.56.10:9200/customer/external/1 { "name":"zhangsan" }
发送多次是更新操作
{ "_index": "customer", "_type": "external", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }
post→http://192.168.56.10:9200/customer/external/ 不指定id则一直created,并自动生成唯一id,如果指定了id则是更新
-
查询文档
GET customer/external/1
http://192.168.56.10:9200/customer/external/1
{ "_index": "customer", // 在哪个索引 "_type": "external", // 在哪个类型 "_id": "1", // 记录id "_version": 2, // 版本号 "_seq_no": 1, // 并发控制字段,每次更新就会+1,用来做乐观锁 "_primary_term": 1, // 同上,住分片重新分配,如重启,就会变化 "found": true, // 是否找到 "_source": { // 真正的内容 "name": "zhangsan" } }
更新携带 →?if_seq_no=1&if_primary_term=1
-
更新文档
POST customer/external/1/_update { "doc":{ "name":"lisi" } } // 会对比元数据是否有改变如果没有则不做任何操作
POST customer/external/1 { "name":"lisi" } // 不会对比元数据是否有改变直接更新
PUT customer/external/1 { "name":"lisi" } // 不会对比元数据是否有改变直接更新
-
删除文档&索引
DELETE customer/external1
DELETE customer
不支持删除类型
-
bulk批量API
POST customer/external/_bulk {"index":{"_id":"1"}} {"name":"zhangsan"} {"index":{"_id":"2"}} {"name":"lisi"} {"create":{"_id":"1"}} {"name":"lisi"} // 语法格式 {"action":{metadata}} \n {request body}\n {"action":{metadata}} \n {request body}\n
{ "took" : 350, "errors" : false, "items" : [ { "index" : { "_index" : "customer", "_type" : "external", "_id" : "1", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 0, "_primary_term" : 1, "status" : 201 } }, { "index" : { "_index" : "customer", "_type" : "external", "_id" : "2", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 1, "_primary_term" : 1, "status" : 201 } } ] }
注意:create 不会覆盖文档,index会覆盖文档进行更新
-
导入测试数据
POST /bank/account/_bulk { .... }
进阶检索
-
SearchAPI
-
ES支持两种基本方式检索
-
一个是通过使用REST request URI 发送搜索参数(URI+检索参数)
请求参数方式检索 GET bank/_search?q=*&sort=account_number:asc 检索bank下所有信息,包括type和docs GET bank/_search
- took – 花费多少ms搜索
- timed_out – 是否超时
- _shards – 多少分片被搜索了,以及多少成功/失败的搜索分片
- max_score –文档相关性最高得分
- hits.total.value - 多少匹配文档被找到
- hits.sort - 结果的排序key,没有的话按照score排序
- hits._score - 相关得分 (not applicable when using match_all)
q=* 查询所有 sort 排序字段 asc升序
GET bank/_search?q=*&sort=account_number:asc 检索了1000条数据,但是根据相关性算法,只返回10条
-
另一个是通过使用REST request body 来发送他们(URI+请求体)
GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" }, { "balance":"desc"} ] }
postman中get不能携带请求体,我们变为post也是一样的,我们post一个jsob风格的查询请求体到_search
需要了解,一旦搜索的结果被返回,es就完成了这次请求,不能切不会维护任何服务端的资源或者结果的cursor游标
-
-
Query DSL
-
基本语法格式
Elasticsearch提供了一个可以执行查询的Json风格的DSL(domain-specific language领域特定语言)
-
一个查询语句的典型结构
QUERY_NAME:{ ARGUMENT:VALUE, ARGUMENT:VALUE, ... } 如果针对于某个字段,那么它的结构如下: { QUERY_NAME:{ FIELD_NAME:{ ARGUMENT:VALUE, ARGUMENT:VALUE,... } } }
示例 GET bank/_search { "query": { "match_all": {} }, "from": 0, "size": 5, "_source":["balance"], "sort": [ { "account_number": { "order": "desc" } } ] } _source为要返回的字段
-
query定义如何查询
- match_all查询类型【代表查询所有的所有】,es中可以在query中组合非常多的查询类型完成复杂查询;
- 除了query参数之外,我们可也传递其他的参数以改变查询结果,如sort,size;
- from+size限定,完成分页功能;
- sort排序,多字段排序,会在前序字段相等时后续字段内部排序,否则以前序为准;
-
-
返回部分字段
GET bank/_search { "query": { "match_all": {} }, "from": 0, "size": 5, "sort": [ { "account_number": { "order": "desc" } } ], "_source": ["balance","firstname"] }
查询结果 { "took" : 18, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "bank", "_type" : "account", "_id" : "999", "_score" : null, "_source" : { "firstname" : "Dorothy", "balance" : 6087 }, "sort" : [ 999 ] }, 省略。。。
-
match匹配查询
-
基本类型(非字符串),精确控制
GET bank/_search { "query": { "match": { "account_number": "20" } } }
match返回account_number=20的数据。
-
字符串,全文检索
GET bank/_search { "query": { "match": { "address": "kings" } } }
全文检索,最终会按照评分进行排序,会对检索条件进行分词匹配。
-
-
match_phrase短语匹配
-
将需要匹配的值当成一整个单词(不分词)进行检索
GET bank/_search { "query": { "match_phrase": { "address": "mill road" } } }
前面的是包含mill或road就查出来,我们现在要都包含才查出
查出address中包含mill road的所有记录,并给出相关性得分
-
-
multi_math【多字段匹配】
-
state或者address中包含mill,并且在查询过程中,会对于查询条件进行分词
GET bank/_search { "query": { "multi_match": { "query": "mill", "fields": [ "state", "address" ] } } }
-
-
bool用来做复合查询
复合语句可以合并,任何其他查询语句,包括符合语句。这也就意味着,复合语句之间可以互相嵌套,可以表达非常复杂的逻辑。
- must:必须达到must所列举的所有条件
- must_not:必须不匹配must_not所列举的所有条件。
- should:应该满足should所列举的条件。满足条件最好,不满足也可以,满足得分更高
实例:查询gender=m,并且address=mill的数据
GET bank/_search { "query":{ "bool":{ "must":[ {"match":{"address":"mill"}}, {"match":{"gender":"M"}} ] } } }
must_not:必须不是指定的情况
GET bank/_search { "query": { "bool": { "must": [ { "match": { "gender": "M" }}, { "match": {"address": "mill"}} ], "must_not": [ { "match": { "age": "38" }} ] } }
should:应该达到should列举的条件,如果到达会增加相关文档的评分,并不会改变查询的结果。如果query中只有should且只有一种匹配规则,那么should的条件就会被作为默认匹配条件而去改变查询结果。
实例:匹配lastName应该等于Wallace的数据
GET bank/_search { "query": { "bool": { "must": [ { "match": { "gender": "M" } }, { "match": { "address": "mill" } } ], "must_not": [ { "match": { "age": "28" } } ], "should": [ { "match": { "lastname": "Wallace" } } ] } } }
能够看到相关度越高,得分也越高。
-
Filter【结果过滤】
并不是所有的查询都需要产生分数,特别是哪些仅用于filtering过滤的文档。为了不计算分数,elasticsearch会自动检查场景并且优化查询的执行。
GET bank/_search { "query": { "bool": { "must": [ { "match": {"address": "mill" } } ], "filter": { # query.bool.filter "range": { "age": { "gte": "18", "lte": "30" } } } } } }
这里先是查询所有匹配address=mill的文档,然后再根据18<=age<=30进行过滤查询结果
在boolean查询中,must, should 和must_not 元素都被称为查询子句 。 文档是否符合每个“must”或“should”子句中的标准,决定了文档的“相关性得分”。 得分越高,文档越符合您的搜索条件。 默认情况下,Elasticsearch返回根据这些相关性得分排序的文档。
“must_not”子句中的条件被视为“过滤器”。
它影响文档是否包含在结果中, 但不影响文档的评分方式。 还可以显式地指定任意过滤器来包含或排除基于结构化数据的文档。 -
term
和match一样。匹配某个属性额值。全文检索字段用match,其他非text字段匹配用term
字段.keyword:要精确匹配到(全部相等)
GET bank/_search { "query": { "term": { "address": "mill Road" } } }
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, 没有 "relation" : "eq" }, "max_score" : null, "hits" : [ ] } }
-
Aggregation(聚合)
聚合提供了从数据中分组和提取数据的能力。最简单的聚合方法大致等于SQL Group by和SQL聚合函数。在elasticsearch中,执行搜索返回this(命中结果),并且同时返回聚合结果,把以响应中的所有hits(命中结果)分隔开的能力。这是非常强大且有效的,你可以执行查询和多个聚合,并且在一次使用中得到各自的(任何一个的)返回结果,使用一次简洁和简化的API啦避免网络往返。
aggs:执行聚合。聚合语法如下:
"aggs":{ # 聚合 "aggs_name这次聚合的名字,方便展示在结果集中":{ "AGG_TYPE聚合的类型(avg,term,terms)":{} } }
- terms:看值的可能性分布
- avg:看值的分布平均
例:搜索address中包含mill的所有人的年龄分布以及平均年龄,但不显示这些人的详情
# 分别为包含mill、,平均年龄、 GET bank/_search { "query": { # 查询出包含mill的 "match": { "address": "mill" } }, "aggs": { #基于查询聚合 "ageAgg": { # 聚合的名字,随便起 "terms": { # 看值的可能性分布 "field": "age", "size": 10 } }, "ageAvg": { "avg": { # 看age值的平均 "field": "age" } }, "balanceAvg": { "avg": { # 看balance的平均 "field": "balance" } } }, "size": 0 # 不看详情 }
查询结果:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 4, // 命中4条 "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "ageAgg" : { // 第一个聚合的结果 "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 38, "doc_count" : 2 }, { "key" : 28, "doc_count" : 1 }, { "key" : 32, "doc_count" : 1 } ] }, "ageAvg" : { // 第二个聚合的结果 "value" : 34.0 }, "balanceAvg" : { "value" : 25208.0 } } }
子聚合
复杂:按照年龄聚合,并且求这些年龄段的这些人的平均薪资
写到一个聚合里是基于上个聚合进行子聚合。
下面求每个age分布的平均balance
GET bank/_search { "query": { "match_all": {} }, "aggs": { "ageAgg": { "terms": { # 看分布 "field": "age", "size": 100 }, "aggs": { # 与terms并列 "ageAvg": { #平均 "avg": { "field": "balance" } } } } }, "size": 0 }
输出结果:
{ "took" : 49, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "ageAgg" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 31, "doc_count" : 61, "ageAvg" : { "value" : 28312.918032786885 } }, { "key" : 39, "doc_count" : 60, "ageAvg" : { "value" : 25269.583333333332 } }, { "key" : 26, "doc_count" : 59, "ageAvg" : { "value" : 23194.813559322032 } }, { "key" : 32, "doc_count" : 52, "ageAvg" : { "value" : 23951.346153846152 } }, { "key" : 35, "doc_count" : 52, "ageAvg" : { "value" : 22136.69230769231 } }, { "key" : 36, "doc_count" : 52, "ageAvg" : { "value" : 22174.71153846154 } }, { "key" : 22, "doc_count" : 51, "ageAvg" : { "value" : 24731.07843137255 } }, { "key" : 28, "doc_count" : 51, "ageAvg" : { "value" : 28273.882352941175 } }, { "key" : 33, "doc_count" : 50, "ageAvg" : { "value" : 25093.94 } }, { "key" : 34, "doc_count" : 49, "ageAvg" : { "value" : 26809.95918367347 } }, { "key" : 30, "doc_count" : 47, "ageAvg" : { "value" : 22841.106382978724 } }, { "key" : 21, "doc_count" : 46, "ageAvg" : { "value" : 26981.434782608696 } }, { "key" : 40, "doc_count" : 45, "ageAvg" : { "value" : 27183.17777777778 } }, { "key" : 20, "doc_count" : 44, "ageAvg" : { "value" : 27741.227272727272 } }, { "key" : 23, "doc_count" : 42, "ageAvg" : { "value" : 27314.214285714286 } }, { "key" : 24, "doc_count" : 42, "ageAvg" : { "value" : 28519.04761904762 } }, { "key" : 25, "doc_count" : 42, "ageAvg" : { "value" : 27445.214285714286 } }, { "key" : 37, "doc_count" : 42, "ageAvg" : { "value" : 27022.261904761905 } }, { "key" : 27, "doc_count" : 39, "ageAvg" : { "value" : 21471.871794871793 } }, { "key" : 38, "doc_count" : 39, "ageAvg" : { "value" : 26187.17948717949 } }, { "key" : 29, "doc_count" : 35, "ageAvg" : { "value" : 29483.14285714286 } } ] } } }
复杂子聚合:查出所有年龄分布,并且这些年龄段中M的平均薪资和F的平均薪资以及这个年龄段的总体平均薪资
GET bank/_search { "query": { "match_all": {} }, "aggs": { "ageAgg": { "terms": { # 看age分布 "field": "age", "size": 100 }, "aggs": { # 子聚合 "genderAgg": { "terms": { # 看gender分布 "field": "gender.keyword" # 注意这里,文本字段应该用.keyword }, "aggs": { # 子聚合 "balanceAvg": { "avg": { # 男女性的平均 "field": "balance" } } } }, "ageBalanceAvg": { "avg": { #age分布的平均(男女) "field": "balance" } } } } }, "size": 0 }
输出结果:
{ "took" : 119, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "ageAgg" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 31, "doc_count" : 61, "genderAgg" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "M", "doc_count" : 35, "balanceAvg" : { "value" : 29565.628571428573 } }, { "key" : "F", "doc_count" : 26, "balanceAvg" : { "value" : 26626.576923076922 } } ] }, "ageBalanceAvg" : { "value" : 28312.918032786885 } } ] .......//省略其他 } } }
nested(嵌套)对象聚合
GET articles/_search { "size": 0, "aggs": { "nested": { "nested": { "path": "payment" }, "aggs": { "amount_avg": { "avg": { "field": "payment.amount" } } } } } }
-
-
Mapping
映射定义文档如何被存储检索的
-
字段类型
-
核心类型
-
字符串
text
⽤于全⽂索引,搜索时会自动使用分词器进⾏分词再匹配keyword
不分词,搜索时需要匹配完整的值
-
数值型
- 整形: byte,short,integer,long
- 浮点型: float, half_float, scaled_float,double
-
日期型:date
-
范围型
-
integer_range, long_range, float_range,double_range,date_range
gt是大于,lt是小于,e是equals等于。
age_limit的区间包含了此值的文档都算是匹配。
-
-
布尔:boolean
-
⼆进制:binary 会把值当做经过 base64 编码的字符串,默认不存储,且不可搜索
-
-
复合类型
- 对象:object一个对象中可以嵌套对象
- 数组:array
- 嵌套类型:nested 用于json对象数组
-
地理类型
- 地理坐标:geo_point用于描述经纬度坐标
- 地理图形:geo_shape用于描述复杂形状,如多边形
-
特定类型
-
-
映射
Maping是用来定义一个文档(document),以及它所包含的属性(field)是如何存储和索引的。比如:使用maping来定义:
-
哪些字符串属性应该被看做全文本属性(full text fields);
-
哪些属性包含数字,日期或地理位置;
-
文档中的所有属性是否都嫩被索引(all 配置);
-
日期的格式;
-
自定义映射规则来执行动态添加属性;
-
查看mapping信息:GET bank/_mapping
{ "bank" : { "mappings" : { "properties" : { "account_number" : { "type" : "long" # long类型 }, "address" : { "type" : "text", # 文本类型,会进行全文检索,进行分词 "fields" : { "keyword" : { # addrss.keyword "type" : "keyword", "ignore_above" : 256 } } }, "age" : { "type" : "long" }, "balance" : { "type" : "long" }, "city" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "email" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "employer" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "firstname" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "gender" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "lastname" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "state" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } }
创建映射PUT /my_index]
第一次存储数据的时候es就猜出了映射
第一次存储数据前可以指定映射
创建索引并指定映射
PUT /my_index { "mappings": { "properties": { "age": { "type": "integer" }, "email": { "type": "keyword" # 指定为keyword }, "name": { "type": "text" # 全文检索。保存时候分词,检索时候进行分词匹配 } } } }
输出:
{ "acknowledged" : true, "shards_acknowledged" : true, "index" : "my_index" }
查看映射GET /my_index
GET /my_index
输出结果:
{ "my_index" : { "aliases" : { }, "mappings" : { "properties" : { "age" : { "type" : "integer" }, "email" : { "type" : "keyword" }, "employee-id" : { "type" : "keyword", "index" : false }, "name" : { "type" : "text" } } }, "settings" : { "index" : { "creation_date" : "1588410780774", "number_of_shards" : "1", "number_of_replicas" : "1", "uuid" : "ua0lXhtkQCOmn7Kh3iUu0w", "version" : { "created" : "7060299" }, "provided_name" : "my_index" } } } }
添加新的字段映射/my_index/_mapping
PUT /my_index/_mapping { "properties": { "employee-id": { "type": "keyword", "index": false # 字段不能被检索。检索 } } }
这里的 “index”: false,表明新增的字段不能被检索,只是一个冗余字段。
不能更新映射
对于已经存在的字段映射,我们不能更新。更新必须创建新的索引,进行数据迁移。
数据迁移
先创建new_twitter的正确映射。
然后使用如下方式进行数据迁移。
6.0以后写法 POST reindex { "source":{ "index":"twitter" }, "dest":{ "index":"new_twitters" } } 老版本写法 POST reindex { "source":{ "index":"twitter", "twitter":"twitter" }, "dest":{ "index":"new_twitters" } }
-
-
分词
一个tokenizer(分词器)接收一个字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流。
例如:whitespace tokenizer遇到空白字符时分割文本。它会将文本"Quick brown fox!"分割为[Quick,brown,fox!]
该tokenizer(分词器)还负责记录各个terms(词条)的顺序或position位置(用于phrase短语和word proximity词近邻查询),以及term(词条)所代表的原始word(单词)的start(起始)和end(结束)的character offsets(字符串偏移量)(用于高亮显示搜索的内容)。
elasticsearch提供了很多内置的分词器(标准分词器),可以用来构建custom analyzers(自定义分词器)。
关于分词器: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis.html
POST _analyze { "analyzer": "standard", "text": "The 2 Brown-Foxes bone." }
执行结果:
{ "tokens" : [ { "token" : "the", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "brown", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "foxes", "start_offset" : 12, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "bone", "start_offset" : 18, "end_offset" : 22, "type" : "<ALPHANUM>", "position" : 4 } ] }
对于中文,我们需要安装额外的分词器
-
安装ik分词器
所有的语言分词,默认使用的都是“Standard Analyzer”,但是这些分词器针对于中文的分词,并不友好。为此需要安装中文的分词器。
注意:不能用默认elasticsearch-plugin install xxx.zip 进行自动安装
https://github.com/medcl/elasticsearch-analysis-ik/releases在前面安装的elasticsearch时,我们已经将elasticsearch容器的“/usr/share/elasticsearch/plugins”目录,映射到宿主机的“ /mydata/elasticsearch/plugins”目录下,所以比较方便的做法就是下载“/elasticsearch-analysis-ik-7.4.2.zip”文件,然后解压到该文件夹下即可。安装完毕后,需要重启elasticsearch容器。
如果不嫌麻烦,还可以采用如下的方式:
-
查看elasticsearch版本号:
[vagrant@localhost ~]$ curl http://localhost:9200 { "name" : "66718a266132", "cluster_name" : "elasticsearch", "cluster_uuid" : "xhDnsLynQ3WyRdYmQk5xhQ", "version" : { "number" : "7.4.2", "build_flavor" : "default", "build_type" : "docker", "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96", "build_date" : "2019-10-28T20:40:44.881551Z", "build_snapshot" : false, "lucene_version" : "8.2.0", "minimum_wire_compatibility_version" : "6.8.0", "minimum_index_compatibility_version" : "6.0.0-beta1" }, "tagline" : "You Know, for Search" }
-
进入es容器内部plugin目录
- docker exec -it 容器id /bin/bash
[vagrant@localhost ~]$ sudo docker exec -it elasticsearch /bin/bash [root@66718a266132 elasticsearch]# pwd /usr/share/elasticsearch [root@66718a266132 elasticsearch]# pwd /usr/share/elasticsearch [root@66718a266132 elasticsearch]# yum install wget #下载ik7.4.2 [root@66718a266132 elasticsearch]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
- unzip 下载的文件
yum install -y unzip zip [root@66718a266132 elasticsearch]# unzip elasticsearch-analysis-ik-7.4.2.zip -d ik #移动到plugins目录下 [root@66718a266132 elasticsearch]# chmod -R 777 plugins/ik docker restart elasticsearch
- rm -rf *.zip
[root@66718a266132 elasticsearch]# rm -rf elasticsearch-analysis-ik-7.6.2.zip
- 查看是否安装成功
elasticsearch-plugin list
-
-
测试ik分词器
使用默认分词器
GET _analyze { "text":"我是中国人" }
结果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "中", "start_offset" : 2, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "国", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "人", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 4 } ] }
GET _analyze { "analyzer": "ik_smart", "text":"我是中国人" }
结果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 } ] }
GET _analyze { "analyzer": "ik_max_word", "text":"我是中国人" }
结果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "中国", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 3 }, { "token" : "国人", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 } ] }
-
自定义词库
比如我们要把尚硅谷算作一个词
- 修改/usr/share/elasticsearch/plugins/ik/config中的IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict"></entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords"></entry> <!--用户可以在这里配置远程扩展字典 --> <entry key="remote_ext_dict">http://192.168.56.10/es/fenci.txt</entry> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
修改完成后,需要重启elasticsearch容器,否则修改不生效。docker restart elasticsearch
更新完成后,es只会对于新增的数据用更新分词。历史数据是不会重新分词的。如果想要历史数据重新分词,需要执行:
POST my_index/_update_by_query?conflicts=proceed
-
安装nginx
-
随便启动一个nginx实例,只是为了复制出配置
docker run -p80:80 --name nginx -d nginx:1.10
-
将容器内的配置文件拷贝到/mydata/nginx/conf/ 下
[root@10 mydata]# docker container cp nginx:/etc/nginx . [root@10 mydata]# ls elasticsearch mysql nginx redis [root@10 mydata]# cd nginx/ [root@10 nginx]# ls conf.d fastcgi_params koi-utf koi-win mime.types modules nginx.conf scgi_params uwsgi_params win-utf [root@10 nginx]# cd ../ [root@10 mydata]# ls elasticsearch mysql nginx redis [root@10 mydata]# mv nginx conf [root@10 mydata]# ls conf elasticsearch mysql redis [root@10 mydata]# mkdir nginx [root@10 mydata]# mv conf nginx/ [root@10 mydata]# ls elasticsearch mysql nginx redis [root@10 mydata]# cd nginx/ [root@10 nginx]# ls conf
-
终止原容器:
docker stop nginx
-
执行命令删除原容器:
docker rm nginx
-
创建新的Nginx,执行以下命令
mkdir -p /mydata/nginx/html mkdir -p /mydata/nginx/logs docker run -p 80:80 --name nginx \ -v /mydata/nginx/html:/usr/share/nginx/html \ -v /mydata/nginx/logs:/var/log/nginx \ -v /mydata/nginx/conf/:/etc/nginx \ -d nginx:1.10
-
设置开机启动nginx
docker update nginx --restart=always
-
创建“/mydata/nginx/html/index.html”文件,测试是否能够正常访问
echo '<h2>hello nginx!</h2>' >index.html
-
访问:http://ngix所在主机的IP:80/index.html
-
-
安装好nginx后
mkdir /mydata/nginx/html/es cd /mydata/nginx/html/es vim fenci.txt 输入乔碧萝
测试效果:
GET _analyze { "analyzer": "ik_max_word", "text":"乔碧萝殿下" }
结果:
{ "tokens" : [ { "token" : "乔碧萝", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "殿下", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 1 } ] }
-
-
-
Elastsearch-Rest-Client
java操作es有两种方式
-
9300:TCP
- spring-data-elasticsearch:transport-api.jar;
- springboot版本不同,ransport-api.jar不同,不能适配es版本
- 7.x已经不建议使用,8以后就要废弃
- spring-data-elasticsearch:transport-api.jar;
-
9200:HTTP
有诸多包
- jestClient: 非官方,更新慢;
- RestTemplate:模拟HTTP请求,ES很多操作需要自己封装,麻烦;
- HttpClient:同上;
Elasticsearch-Rest-Client
:官方RestClient,封装了ES操作,API层次分明,上手简单;
最终选择Elasticsearch-Rest-Client(elasticsearch-rest-high-level-client)
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high.html
SpringBoot整合ElasticSearch
创建项目mall-search
选择依赖web,但不要在里面选择es
-
导入依赖
这里的版本要和所按照的ELK版本匹配。
<dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>elasticsearch-rest-high-level-client</artifactId> <version>7.4.2</version> </dependency>
在spring-boot-dependencies中所依赖的ES版本位6.8.5,要改掉
<properties> <java.version>1.8</java.version> <elasticsearch.version>7.4.2</elasticsearch.version> </properties>
请求测试项,比如es添加了安全访问规则,访问es需要添加一个安全头,就可以通过requestOptions设置
官方建议把requestOptions创建成单实例
@Configuration public class GuliESConfig { public static final RequestOptions COMMON_OPTIONS; static { RequestOptions.Builder builder = RequestOptions.DEFAULT.toBuilder(); COMMON_OPTIONS = builder.build(); }
-
编写测试类
-
测试保存数据
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-document-index.html
保存方式分为同步和异步,异步方式多了个listener回调
@Test public void indexData() throws IOException { // 设置索引 IndexRequest indexRequest = new IndexRequest ("users"); indexRequest.id("1"); User user = new User(); user.setUserName("张三"); user.setAge(20); user.setGender("男"); String jsonString = JSON.toJSONString(user); //设置要保存的内容,指定数据和类型 indexRequest.source(jsonString, XContentType.JSON); //执行创建索引和保存数据 IndexResponse index = client.index(indexRequest, GulimallElasticSearchConfig.COMMON_OPTIONS); System.out.println(index); }
-
测试获取数据
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-search.html
@Test public void find() throws IOException { // 1 创建检索请求 SearchRequest searchRequest = new SearchRequest(); searchRequest.indices("bank"); SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); // 构造检索条件 // sourceBuilder.query(); // sourceBuilder.from(); // sourceBuilder.size(); // sourceBuilder.aggregation(); sourceBuilder.query(QueryBuilders.matchQuery("address","mill")); System.out.println(sourceBuilder.toString()); searchRequest.source(sourceBuilder); // 2 执行检索 SearchResponse response = client.search(searchRequest, GuliESConfig.COMMON_OPTIONS); // 3 分析响应结果 System.out.println(response.toString()); }
{"took":198,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}, "hits":{ "total":{"value":4,"relation":"eq"}, "max_score":5.4032025,"hits":[{"_index":"bank","_type":"account","_id":"970","_score":5.4032025,"_source":{"account_number":970,"balance":19648,"firstname":"Forbes","lastname":"Wallace","age":28,"gender":"M","address":"990 Mill Road","employer":"Pheast","email":"forbeswallace@pheast.com","city":"Lopezo","state":"AK"}},{"_index":"bank","_type":"account","_id":"136","_score":5.4032025,"_source":{"account_number":136,"balance":45801,"firstname":"Winnie","lastname":"Holland","age":38,"gender":"M","address":"198 Mill Lane","employer":"Neteria","email":"winnieholland@neteria.com","city":"Urie","state":"IL"}},{"_index":"bank","_type":"account","_id":"345","_score":5.4032025,"_source":{"account_number":345,"balance":9812,"firstname":"Parker","lastname":"Hines","age":38,"gender":"M","address":"715 Mill Avenue","employer":"Baluba","email":"parkerhines@baluba.com","city":"Blackgum","state":"KY"}},{"_index":"bank","_type":"account","_id":"472","_score":5.4032025,"_source":{"account_number":472,"balance":25571,"firstname":"Lee","lastname":"Long","age":32,"gender":"F","address":"288 Mill Street","employer":"Comverges","email":"leelong@comverges.com","city":"Movico","state":"MT"}}]}}
@Test public void find() throws IOException { // 1 创建检索请求 SearchRequest searchRequest = new SearchRequest(); searchRequest.indices("bank"); SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); // 构造检索条件 // sourceBuilder.query(); // sourceBuilder.from(); // sourceBuilder.size(); // sourceBuilder.aggregation(); sourceBuilder.query(QueryBuilders.matchQuery("address","mill")); //AggregationBuilders工具类构建AggregationBuilder // 构建第一个聚合条件:按照年龄的值分布 TermsAggregationBuilder agg1 = AggregationBuilders.terms("ageAgg").field("age").size(10);// 聚合名称 // 参数为AggregationBuilder sourceBuilder.aggregation(agg1); // 构建第二个聚合条件:平均薪资 AvgAggregationBuilder agg2 = AggregationBuilders.avg("balanceAvg").field("balance"); sourceBuilder.aggregation(agg2); System.out.println("检索条件"+sourceBuilder.toString()); searchRequest.source(sourceBuilder); // 2 执行检索 SearchResponse response = client.search(searchRequest, GuliESConfig.COMMON_OPTIONS); // 3 分析响应结果 System.out.println(response.toString()); }
把检索结果封装为java bean
// 3.1 获取java bean SearchHits hits = response.getHits(); SearchHit[] hits1 = hits.getHits(); for (SearchHit hit : hits1) { hit.getId(); hit.getIndex(); String sourceAsString = hit.getSourceAsString(); Account account = JSON.parseObject(sourceAsString, Account.class); System.out.println(account); }
Account(accountNumber=970, balance=19648, firstname=Forbes, lastname=Wallace, age=28, gender=M, address=990 Mill Road, employer=Pheast, email=forbeswallace@pheast.com, city=Lopezo, state=AK) Account(accountNumber=136, balance=45801, firstname=Winnie, lastname=Holland, age=38, gender=M, address=198 Mill Lane, employer=Neteria, email=winnieholland@neteria.com, city=Urie, state=IL) Account(accountNumber=345, balance=9812, firstname=Parker, lastname=Hines, age=38, gender=M, address=715 Mill Avenue, employer=Baluba, email=parkerhines@baluba.com, city=Blackgum, state=KY) Account(accountNumber=472, balance=25571, firstname=Lee, lastname=Long, age=32, gender=F, address=288 Mill Street, employer=Comverges, email=leelong@comverges.com, city=Movico, state=MT)
获取检索到的分析信息
// 获取分析 Aggregations aggregations = search.getAggregations(); Terms ageAggTearms = aggregations.get("ageAgg"); for (Terms.Bucket bucket : ageAggTearms.getBuckets()) { String keyAsString = bucket.getKeyAsString(); System.out.println(keyAsString+":"+bucket.getDocCount()); } Avg balanceAvgAvg = aggregations.get("balanceAvg"); System.out.println(balanceAvgAvg.getValue());
搜索address中包含mill的所有人的年龄分布以及平均年龄,平均薪资
GET bank/_search { "query": { "match": { "address": "Mill" } }, "aggs": { "ageAgg": { "terms": { "field": "age", "size": 10 } }, "ageAvg": { "avg": { "field": "age" } }, "balanceAvg": { "avg": { "field": "balance" } } } }
-