catelog
- 基于词项和全文的搜索
- 结构化搜索
- 搜索的相关性算分
- query filter 与多字符串
- 单字符串多字段查询
- 单字符串多字段查询multimatch
- 实战
- search Template 和 index alias
- 综合排序:Function Score Query 优化算分
- Term & Phrase Suggester
- 自动补全 与基于上下文的提示
- 跨集群查询
- 集群分布式模型及选主与脑裂问题
- 分片与集群故障的转移
- 文档分布式存储
- 分片及声明周期
- 剖析分布式查询及相关性算分
- 排序及docValues&fieldData
- 分页与遍历
- bucket&metric 聚合分析与嵌套聚合
- pipeline聚合分析
- 范围作用与排序
- 聚合分析的原理及精确度问题
- 对象及nested对象
- 文档的父子关系
- updateByQuery和reIndex api
- ingestPipeline 和 painlessScript
- 数据建模
- 数据建模最佳实践
- 总结
基于词项和全文的搜索
基于term查询
索引时desc字段使用了分词器, 索引时转换成了小写的iphone, term查询desc因为没有用分词器,最终使用的term为iPhone查询, 所以没有查到
如果使用term要查询到, 就需要查询分词后的term 或者对该字段的keyword进行查询
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }
POST products/_search
{
"query": {
"term": {
"desc": {
"value": "iPhone"
}
}
},"profile": "true"
}
分词效果
term查询也会进行算分, 即使是keyword字段, constant_score filter取消算分, 减少性能消耗, 利用缓存
POST /products/_search
{
"explain": true,
"query": {
"constant_score": {
"filter": {
"term": {
"productID.keyword": "XHDK-A-1293-#fJ3"
}
}
}
}
}
基于全文
match query 会将查询的目标进行分词成term 每个term单独进行查询, 汇总结果 ;
match phrase 会将单词视为一个整体, 并且关注位置关系, 使用slot进行偏差
结构化搜索
对布尔值进行搜索
POST products/_search
{
"query": {
"term": {
"avaliable": {
"value": "true"
}
}
}
}
POST products/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"avaliable": true
}
}
}
}
}
数字
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 10,
"lte": 20
}
}
}
}
}
对日期进行搜索
当前时间减去4年, 也就是搜索4年以内的
"query": {
"constant_score": {
"filter": {
"range": {
"date": {
"gte" : "now-4y"
}
}
}
}
}
查询非空
"query": {
"constant_score": {
"filter": {
"exists": {
"field": "date"
}
}
}
}
查询多值字段
查询类型包含comedy的 , 而不是精确只有comedy
POST movies/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"genre.keyword": "Comedy"
}
}
}
}
}
搜索的相关性算分
tm 词频 , 词项在该文档中的频率 , 例如 我是中国人, 生在中国, 中国出现了2次;
df 检索词在所有文档中的频率 , 翻转idf, 例如中国在200个文档中出现过, 一共有1000个文档, log(1000/200)
idf 词与该文档的差异率 ,
lucene使用tm -idf,idf加权tm求分 , 之后改为了bm25, 解决了tf 无限增加分值无限增大的问题 , es可以在创建索引时指定算分方式
explain解析算分细节
两条都包含目标, 但是id2的文档长度更短, tf分值更高
PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }
POST testscore/_search
{
"query": {
"match": {
"content": "elasticsearch"
}
},"explain": true
}
使用boosting来控制算分结果, 例negative对包含like的文档降权,
POST testscore/_search
{
"query": {
"boosting" : {
"positive" : {
"term" : {
"content" : "elasticsearch"
}
},
"negative" : {
"term" : {
"content" : "like"
}
},
"negative_boost" : 0.2
}
}
}
query filter 与多字符串
bool查询, 组合多个字段的查询条件
must, should参与评分,filter和mustnot不参与评分
#基本语法
POST /products/_search
{
"query": {
"bool" : {
"must" : {
"term" : { "price" : "30" }
},
"filter": {
"term" : { "avaliable" : "true" }
},
"must_not" : {
"range" : {
"price" : { "lte" : 10 }
}
},
"should" : [
{ "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
{ "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
],
"minimum_should_match" :1
}
}
}
单字符串多字段查询
使用disjunction_max来查询多字段, 对比各个字段评分, 取最高评分
PUT /blogs/_doc/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
PUT /blogs/_doc/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
POST blogs/_search
{
"query": {
"dis_max": {
"queries": [
{"match": {
"title": "Brown fox"
}},
{
"match": {
"body": "Brown fox"
}
}
]
}
},
"explain": true
}
例id为1的虽然在两个字段中都包括了brown, 但是两个字段的brown结果最终取了一个最大的分值, id为2的是将brown和fox两个分值加起来, fox只在一个文档中出现, 更罕见, 理所应当分值更高, 所以id为2的更符合要求, 分值更大 , 排在前面
如果只搜索Quick pets , 两个文档评分相同, 因为每个文档包含的单词都是相同的
使用tie_breaker对评分更均衡, 之前是只取最高字段, tie_breaker会加权其他字段并加入总分,
POST blogs/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
],
"tie_breaker": 0.2
}
}
}
如果是考虑多个字段的算分, 自己感觉可以直接用bool替代
POST /blogs/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
]
}
},"explain": true
}
单字符串多字段查询multimatch
例用barking dogs只查询title结果为id1分值高, 因为文档短, 但实际id2更符合搜索目标, 针对这种场景, 需要增加id2的分值, 增加title.std字段, 对两个字段查询
multimatch在写法上比dis_max更简单, 默认使用best_fields,也就是disjunction_max,
DELETE /titles
PUT /titles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {"std": {"type": "text","analyzer": "standard"}}
}
}
}
}
POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }
GET titles/_search
{
"query": {
"match": {
"title": "barking dogs"
}
}
}
GET titles/_search
{
"query": {
"multi_match": {
"type": "most_fields",
//"type": "best_fields",
"query": "barking dogs",
"fields": ["title","title.std"]
}
}
}
实战
将tmdb导入es中, mapping中title使用english分词器, 通过multi_match用token ““basketball with cartoon aliens”” 搜索出空中大灌篮 ;
如果是默认标准分词器进行索引, 搜索不出结果
multi_match默认使用的best_fields模式, 仅使用最高分的字段的分数
"mappings": {
"properties": {
"overview": {
"type": "text",
"analyzer": "english",
"fields": {
"std": {
"type": "text",
"analyzer": "standard"
}
}
},
"popularity": {
"type": "float"
},
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
"query": {
"multi_match": {
"query": "basketball with cartoon aliens",
"fields": ["title^10","overview"]
}
}
pip install pyenv-win --target %USERPROFILE%/.pyenv
如何使用pyenv在windows10安装多个python版本环境
我环境变量有问题, 直接在pyenv的目录下cmd
pyenv install 2.7.15
pyenv versions
python -V
pyenv global 2.7.15
pyenv global
默认mapping , 默认查询
mapping , english分词器, most_filed 模式
默认mapping , 默认查询, space jam只有basketball命中, 文档中的alien 因为分词器保持了aliens 就没有命中
search Template 和 index alias
查询模板, 通过预置查询脚本 , 之后查询就可以引用该模板, 还可以引用变量, 之后可以直接修改模板, 改变查询结果
POST tmdb/_search
{
"_source": ["title","overview"],
"size":20,
"query": {
"multi_match": {
"type": "most_fields",
"query": "basketball with cartoon aliens",
"fields": ["title","overview"]
}
}
,"explain": true
}
POST _scripts/tmdb
{
"script": {
"lang": "mustache",
"source": {
"_source": [
"title","overview"
],
"size": 20,
"query": {
"multi_match": {
"query": "{{q}}",
"fields": ["title","overview"]
}
}
}
}
}
POST tmdb/_search/template
{
"id":"tmdb",
"params": {
"q": "basketball with cartoon aliens"
}
}
索引别名
对一个索引创建别名 , 使用别名代替索引名称进行查询 , 并能对别名设置额外的过滤规则
PUT movies-2019/_doc/1
{
"name":"the matrix",
"rating":5
}
PUT movies-2019/_doc/2
{
"name":"Speed",
"rating":3
}
//创建别名
POST _aliases
{
"actions": [
{
"add": {
"index": "movies-2019",
"alias": "movies-latest"
}
}
]
}
//别名查询 两条结果
POST movies-latest/_search
{
"query": {
"match_all": {}
}
}
//创建别名及过滤规则
POST _aliases
{
"actions": [
{
"add": {
"index": "movies-2019",
"alias": "movies-lastest-highrate",
"filter": {
"range": {
"rating": {
"gte": 4
}
}
}
}
}
]
}
//仅1条结果
POST movies-lastest-highrate/_search
{
"query": {
"match_all": {}
}
}
综合排序:Function Score Query 优化算分
对以下文档进行查询 , 内容均相同, 对点赞数进行加权的效果, 更改算分的逻辑, 将;
function score 可以在查询到内容后, 自由的更改算分的方式 , 比如使用脚本, 自定义逻辑,
例field_value_factor中默认使用得分 * 额外算分的值 ,
DELETE blogs
PUT /blogs/_doc/1
{
"title": "About popularity",
"content": "In this post we will talk about...",
"votes": 0
}
PUT /blogs/_doc/2
{
"title": "About popularity",
"content": "In this post we will talk about...",
"votes": 100
}
PUT /blogs/_doc/3
{
"title": "About popularity",
"content": "In this post we will talk about...",
"votes": 1000000
}
POST /blogs/_search
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query":"popularity",
"fields": ["title", "content"]
}
},
"field_value_factor": {
//算分字段
"field": "votes",
//修改函数
"modifier": "log1p" ,
//
"factor": 0.1
},
//默认是乘法, 可以改成sum加法
"boost_mode": "sum",
//每个文档最大的分值为3
"max_boost": 3
}
}
}
随机种子
例保证一个用户在浏览广告时, 当前用户看到的广告排序是相同的, 用户的session,或id 作为seed,
这样可以增加广告的展现率;
POST /blogs/_search
{
"query": {
"function_score": {
"random_score": {
"seed": 911119
,"field": "_seq_no"
}
}
}
}
Term & Phrase Suggester
推荐词项 , 如果输入错误的词, 会查询词库中有无这词, 如果没有, 会根据输入的词匹配并返回类似的词项
例输入了lucen rock, 用suggest查询设置相同的字段和关键词, 会返回推荐词 lucene和rocks
默认是用了missing 模式, 如果没有这个词项才进行推荐, 推荐的算分是根据关键词token和目标token的字符差异来的
DELETE articles
POST articles/_bulk
{ "index" : { } }
{ "body": "lucene is very cool"}
{ "index" : { } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "body": "Elasticsearch rocks"}
{ "index" : { } }
{ "body": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "body": "Elk stack rocks"}
{ "index" : {} }
POST articles/_search
{
"size": 1,
"query": {
"match": {
"body": "lucen rock"
}
}
,
"suggest": {
"term-suggestion": {
"text": "lucen rock",
"term": {
"field": "body"
,"suggest_mode":"missing"
//词项匹配容忍的前缀, 输入hock也会推荐rock
,"prefix_length":0
}
}
}
}
Phrase Suggester增加了更多的参数, confidence控制了返回结果的阈值 , 只有候选词高于该标准的才会返回
POST /articles/_search
{
"suggest": {
"my-suggestion": {
"text": "lucne and elasticsear rock hello world ",
"phrase": {
"field": "body",
"max_errors":2,
"confidence":2,
"direct_generator":[{
"field":"body",
"suggest_mode":"always"
}],
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}
自动补全 与基于上下文的提示
输入关键词, 查询es返回关键词的补全信息,
es不通过倒排索引来实现, 而是通过fst实现 , fst介绍一种内存占用小, 类似map的结构, 适合做前缀匹配
创建索引需要对补全的字段配置,suggest选择completion
例一下查询得到前缀为elk的补全提示
DELETE articles
PUT articles
{
"mappings": {
"properties": {
"title_completion":{
"type": "completion"
}
}
}
}
POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }
POST articles/_search
{
"size": 0,
"suggest": {
"article-suggester": {
"prefix":"elk"
,"completion": {
"field": "title_completion"
}
}
}
}
除了补全, 还有根据上下文进行的提示, suggest context
mapping增加context补全类型, 在索引时对文档选择补全类型, 查询时提供补全类型, 就相当于增加了查询条件
DELETE comments
PUT comments
PUT comments/_mapping
{
"properties":{
"comment_autocomplete":{
"type":"completion",
"contexts":[{
"type":"category",
"name":"comment_category"
}]
}
}
}
POST comments/_doc
{
"comment":"I love the star war movies",
"comment_autocomplete":{
"input":["star wars"],
"contexts":{
"comment_category":"movies"
}
}
}
POST comments/_doc
{
"comment":"Where can I find a Starbucks",
"comment_autocomplete":{
"input":["starbucks"],
"contexts":{
"comment_category":"coffee"
}
}
}
POST comments/_search
{
"suggest": {
"MY_SUGGESTION": {
"prefix": "sta",
"completion":{
"field":"comment_autocomplete",
"contexts":{
"comment_category":"movies"
}
}
}
}
}
何种场景适合何种查询
跨集群查询
单机群, master的压力大, 成为性能瓶颈 , 不能无限扩荣节点;
es早期通过tribe node 支持跨集群查询 , 需要加入集群节点, 查询时经过该节点, 重启慢 , 集群索引重名问题;
5.3后版本支持cross cluster search , 不需要加入已client node加入集群, 任何节点都能作为查询请求节点
在win上的demo
启动3个集群
bin/elasticsearch -E node.name=cluster0node -E cluster.name=cluster0 -E path.data=cluster0_data -E discovery.type=single-node -E http.port=9200 -E transport.port=9300
bin/elasticsearch -E node.name=cluster1node -E cluster.name=cluster1 -E path.data=cluster1_data -E discovery.type=single-node -E http.port=9201 -E transport.port=9301
bin/elasticsearch -E node.name=cluster2node -E cluster.name=cluster2 -E path.data=cluster2_data -E discovery.type=single-node -E http.port=9202 -E transport.port=9302
使用postman发送请求
curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'
curl -XPUT "http://localhost:9201/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'
curl -XPUT "http://localhost:9202/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'
#创建测试数据, 每个集群的数据不同
curl -XPOST "http://localhost:9200/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user1","age":10}'
curl -XPOST "http://localhost:9201/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user2","age":20}'
curl -XPOST "http://localhost:9202/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user3","age":30}'
测试查询多个集群的结果
GET /users,cluster1:users,cluster2:users/_search
{
"query": {
"range": {
"age": {
"gte": 10,
"lte": 40
}
}
}
}
集群分布式模型及选主与脑裂问题
每个es节点都是java进程, 可配置集群名称 , 节点也可配置名字;
节点类型:
- 协助节点, 所有节点默认都是, 可处理请求, 生产上要固定角色;
- 数据节点, 所有节点默认都是, 可保存分片, 数据扩展;
- 主节点, 维护索引, 集群信息, 保存分片位置;
- 候选主节点, 所有节点默认都是, 主机故障时参与主节点的选举;
脑裂
当网络波动, 一个集群被分成两个网络区域, 不存在主节点的区域集群, 又会进行选举产生主节点, 提供对外服务, 之后网络恢复后, 集群恢复时没被选出的主节点会丢失他这段时间的数据
旧版本, 配置选举阈值, 集群中的节点数大于阈值才进行选举, 避免脑裂 ,7.0开始移除阈值配置, es自己管控
分片与集群故障的转移
分片是lucene的index, 索引创建后主分片数不能修改,
副本分片实现数据高可用, 可以热动态调整, 副本支持查询请求, 增加了副本数量也相当于增加了吞吐量;
分片数设置
数量少了, 难以支持数据扩容 , 数量多了影响性能 ;
副本数多了需要的同步工作就多 , 影响写入 ;
例如当前集群主分片3, 副本数1
主节点故障后, 先选出主节点, 数据被合理的分配到了其他的节点上;
文档分布式存储
为了保证数据分布均匀, 性能的利用率 ,文档的存储的位置默认通过文档id取模主分片数算出, 所以主分片数不能更改; 也可以通过指定的数据取模, 只分配到某个分片;
更新文档
请求节点hash算出文档存储位置, 更新时先删除,后创建后响应给请求节点
删除文档
请求节点路由至文档位置, 删除该主分片文档, 然后再删除副本分片文档, 响应
分片及声明周期
倒排索引不可变性
好处:
不考虑并发写问题, 避免锁问题;
只要内存够, 第一次从文件系统读取到缓存中, 之后读缓存;
缓存容易生成和维护, 数据可以被压缩;
坏处:
想要文档变成被搜索到的状态, 就需要重建整个索引
lucene index
lucene的1个倒排索引为segment , 多个倒排索引由commit point记录, 已经删除的记录在.del文件中 , 查询时会遍历全部的segments, 过滤已删除的 ;
refresh
文档索引首先会写入index buffer缓存, 再1秒(可配置)之后将缓存中的所有文档生成为segment,segment没有落盘, 也在缓存, index buffer生成segment的过程为refresh;
index buffer空间默认为10%, 空间满也会触发refresh;
refresh不执行fsync;
transaction log
为了保证数据写缓存后优先提供查询功能下能不丢失数据, 在写入缓存时也会将数据写入事务日志落盘, 在生成segment后仍然不会删除事务日志中的数据;
每个分片有一个transaction log;
transaction log默认500M;
flush
目的就是将所有缓存中的数据持久化落盘, 首先会执行refresh操作, 然后将缓存中的segment落盘, 之后删除事务日志;
默认30分钟发生1次, 事务日志写满时也会触发flush;
merge
在segment不断落盘后, 数量变得越来越多, merge会将这些零散的segment进行合并, 并且清空.del文件的数据;
merge操作是es自动管理的, 也可以通过api触发;
剖析分布式查询及相关性算分
分布式搜索的运行机制 query then fetch
查询时请求节点随机挑选所有存储数据的主分片和副本分片, 执行条件查询
在每个分片取回from + size数量的文档, 内容仅有文档id和排序值
将所有文档在请求节点重新排序, 保留from+size 的文档, 再使用multiquery去对应分片查询真正的文档详情
最后响应给客户端;
这种方式的弊端是请求节点需要接收n*size的数据量, 并且算分是在每个分片进行完成的, 如果存储不均匀, 算分就会不准确;
如何避免
在数据量少的情况下, 将分片数设置为1, 不会进行分布式搜索, 就不会有请求节点汇总数据;
平均分配存储避免算分不准的情况, 或者通过dfs query then fetch 将详细的算分数据传回请求节点进行计算, 这种方式耗费的性能大;
例 20个分片, 存储3个文档
“good”
“good morning”
“good morning everyone”
在普通query下, 每个分片的idf文档数量都是1,仅算当前分片的
而def query则是将3个分片的信息汇总进行算分的
排序及docValues&fieldData
对字段排序时, 不计算得分, score为null,
#多字段排序
POST /kibana_sample_data_ecommerce/_search
{
"size": 5,
"query": {
"match_all": {
}
},
"sort": [
{"order_date": {"order": "desc"}},
{"_doc":{"order": "asc"}},
{"_score":{ "order": "desc"}}
]
}
默认不能对text字段进行排序, 需要打开fielddata设置
PUT kibana_sample_data_ecommerce/_mapping
{
"properties": {
"customer_full_name" : {
"type" : "text",
"fielddata": true,
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
fielddata是正排索引, id关联了数据内容, 可以实现text全文本类型的数值排序;
默认使用的docvalue是列式存储方式 , 跟随索引一起创建, 减少内存占用, 增加了索引的维护成本;
fielddate可以随时开启关闭, 但是docvalue的改变需要重建索引
分页与遍历
因为es是分布式分片存储文档, 当查询一个from100 , size 10的场景中, 需要去每个分片查询110条, 将110*n shardNum汇总在请求节点, 进行重排序, 这样会出现深分页的问题;
es默认单次查询结果小于10000条文档, 超过会报错;
search after
指定search使用的排序字段和唯一标识字段(一般是_id), 每次查询只要将指定位置的文档集合返回给请求节点, size* shardNum ;
下次查询只要传入排序字段值和文档id就实现了向下翻页;
不能指定页码 , 只能不断向下翻页
DELETE users
POST users/_doc
{"name":"user1","age":10}
POST users/_doc
{"name":"user2","age":11}
POST users/_doc
{"name":"user2","age":12}
POST users/_doc
{"name":"user2","age":13}
POST users/_search
{
"query": {
"match_all": {}
}
,"size": 2
,"sort": [
{
"age": "desc"
},
{
"_id": "asc"
}
]
}
scroll api
第一次查询时调用scroll生成当前搜索结果的快照, 之后读快照查询;
之后每次调用api传入第一次生成的scroll_id , 都会实现向下翻页;
缺点是在快照后新增的文档是无法被查询到的;
之后传入的"scroll":“1m”, 是延长当前快照的有效期
只要重新调用生成scroll_id的api 都会重新生成快照id;
#Scroll API
DELETE users
POST users/_doc
{"name":"user1","age":10}
POST users/_doc
{"name":"user2","age":20}
POST users/_doc
{"name":"user3","age":30}
POST users/_doc
{"name":"user4","age":40}
POST users/_search?scroll=3m
{
"size":2,
"query": {
"match_all": {}
}
}
POST users/_doc
{"name":"user5","age":50}
POST users/_doc
{"name":"user7","age":70}
POST _search/scroll
{
"scroll":"1m",
"scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFmgzaWpSLWdfVFpLRktpRXRPdjdNRkEAAAAAAAAKhhZYcWtYWU92LVNTdXpmVjQtRjFnSDJn"
}
适用场景
普通场景下查询最新的数据, 普通的查询就行;
需要全部文档, 进行数据导出 , 处理的 , 适用scroll api , 不要求实时性, 又节省了性能;
需要分页的, 使用分页参数 , 需要深分页的, 加上search after设置, 节省性能, 保证数据实时性;
bucket&metric 聚合分析与嵌套聚合
es 聚合类似sql的count, group;
准备样本
DELETE /employees
PUT /employees/
{
"mappings" : {
"properties" : {
"age" : {
"type" : "integer"
},
"gender" : {
"type" : "keyword"
},
"job" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 50
}
}
},
"name" : {
"type" : "keyword"
},
"salary" : {
"type" : "integer"
}
}
}
}
PUT /employees/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : { "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : { "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : { "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : { "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : { "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : { "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : { "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : { "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : { "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : { "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : { "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : { "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : { "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : { "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : { "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : { "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : { "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
metric聚合只关注指标, 不返回具体列表, 节省资源
agg中自定义聚合名称, 使用的函数和目标字段
POST employees/_search
{
"size": 0
,"aggs": {
"max_salary": {
"max": {"field": "salary"}
}
,"min_salary":{
"min": {
"field": "salary"
}
}
}
}
展示多个指标
POST employees/_search
{
"size": 0
,"aggs": {
"stats_salary": {
"stats": {
"field": "salary"
}
}
}
}
bucket聚合可以分组继续处理, 其中terms 聚合类型可对数字,时间类型的进行分组, 可以指定size返回terms的结果;
对keyword进行分词, 将结果进行去重并展示每个桶中的数量,
如果是text类型, 需要开启fielddata , 而且会对字段值进行分词, 一般用不到;
为了加快terms的结果, 可以在映射中将eager_global_ordinals打开
POST employees/_search
{
"size": 0
,"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword"
}
}
}
}
cardinality函数 类似count(distinct)
POST employees/_search
{
"size": 0
,"aggs": {
"cardinate": {
"cardinality": {
"field": "job.keyword"
}
}
}
}
嵌套聚合可以对桶中的结果再次聚合分析
例 首先工作分桶, 再聚合通过top_hits函数获取年龄最大的前3个员工
POST employees/_search
{
"size": 0
, "aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"age_agg": {
"top_hits": {
"sort": [{"age":{"order":"desc"}}],
"size": 3
}
}
}
}
}
}
eager_global_ordinals
eager_global_ordinals占用堆内存, 提升term性能查询速度, 但降低文档索引的速度, 可以随时开启和关闭
PUT my-index-000001/_mapping
{
"properties": {
"tags": {
"type": "keyword",
"eager_global_ordinals": true
}
}
}
区间函数range
POST employees/_search
{
"size": 0
,"aggs": {
"salary_agg": {
"range": {
"field": "salary",
"ranges": [
{
"to": 10000
}
,{
"from": 10000,
"to": 20000
}
,{
"key": "大于10000",
"from": 20000
}
]
}
}
}
}
直方图函数 , 区间5000, 来统计不同区间工资的人数
POST employees/_search
{
"size": 0
,"aggs": {
"salary_agg_histogram": {
"histogram": {
"field": "salary",
"interval": 5000,
"extended_bounds": {
"min": 0,
"max": 100000
}
}
}
}
}
多次嵌套, 不同岗位的不同性别的收入情况
POST employees/_search
{
"size": 0
,"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword"
}
, "aggs": {
"gender_agg": {
"terms": {
"field": "gender"
}
,"aggs": {
"salary_agg": {
"stats": {
"field": "salary"
}
}
}
}
}
}
}
}
pipeline聚合分析
es pipeline
使用上一个聚合的结果进行分析 , 分为sibling pipeline, 和parent pipeline
例 求各岗位平均工资中最小的是哪个岗位
使用sibling pipeline , pipeline min_salary_by_job的结果和job_agg是同级的
POST employees/_search
{
"size": 0,
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 20
},
"aggs": {
"salary_avg": {
"avg": {
"field": "salary"
}
}
}
}
,"min_salary_by_job":{
"min_bucket": {
"buckets_path": "job_agg>salary_avg"
}
}
}
}
结果
求各年龄的平均工资及差值
使用parent pipeline, 结果包含在平均值中avg中
POST employees/_search
{
"size": 0
,"aggs": {
"age_agg": {
"histogram": {
"min_doc_count": 1,
"field": "age",
"interval": 1
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
},
"derivative_avg_salary":{
"derivative": {
"buckets_path": "avg_salary"
}
}
}
}
}
}
结果截图
范围作用与排序
query 和filter 限定聚合范围
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 10
}
}
}
}
POST employees/_search
{
"size": 0,
"aggs": {
"age_agg": {
"filter": {
"range": {
"age": {
"gte": 35
}
}
},
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 10
}
}
}
},
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 10
}
}
}
}
post filter
POST employees/_search
{
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 10
}
}
},
"post_filter": {
"term": {
"job.keyword": "Dev Manager"
}
}
}
global 改变了聚合的范围, 将平均值范围提升到全部数据
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"size": 10
}
},
"avg_salary":{
"global": {},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
}
}
}
排序
指定聚合结果的排序
POST employees/_search
{
"size": 0,
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"order": [
{"_count": "asc"},
{"_key": "desc"}
]
}
}
}
}
指定额外的指标进行排序
POST employees/_search
{
"size": 0,
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"order": [
{"job_avg_agg": "asc"}
]
},
"aggs": {
"job_avg_agg": {
"avg": {
"field": "salary"
}
}
}
}
}
}
自聚合中的指标进行排序
POST employees/_search
{
"size": 0,
"aggs": {
"job_agg": {
"terms": {
"field": "job.keyword",
"order": {
"stats_agg.min":"desc"
}
},
"aggs": {
"stats_agg": {
"stats": {
"field": "salary"
}
}
}
}
}
}
聚合分析的原理及精确度问题
由于分布式存储 , 分片的上的数据会有倾斜 , 在进行聚合分桶时 , 桶的元素数量可能会不准确;
因为c桶的数量在每个分片上不均匀, 导致d桶没有在结果中
解决方法将分片数设置为1 ,
或者增加每个分片独立计算的top个数, shard_size默认为size*1.5+10
show_term_doc_count_error打开了解自己的结果是否精确
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"weather": {
"terms": {
"field":"OriginWeather",
"size":5,
"show_term_doc_count_error":true
}
}
}
}
对象及nested对象
es与关系型数据库相反 , 将数据进行扁平化处理, 将数据存储在一个文档中, 提升查询性能, 不需要关联表, 但不利于文档频繁更新
DELETE blog
# 设置blog的 Mapping
PUT /blog
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"time": {
"type": "date"
},
"user": {
"properties": {
"city": {
"type": "text"
},
"userid": {
"type": "long"
},
"username": {
"type": "keyword"
}
}
}
}
}
}
# 插入一条 Blog 信息
PUT blog/_doc/1
{
"content":"I like Elasticsearch",
"time":"2019-01-01T00:00:00",
"user":{
"userid":1,
"username":"Jack",
"city":"Shanghai"
}
}
#查询子对象属性
POST blog/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "Elasticsearch"
}
},
{
"match": {
"user.username": "Jack"
}
}
]
}
}
}
actors对象类型为object, 属性会被列成数组进行存储 , 所以下例查询会查询到结果, 本来是没有这个人
DELETE my_movies
# 电影的Mapping信息
PUT my_movies
{
"mappings" : {
"properties" : {
"actors" : {
"properties" : {
"first_name" : {
"type" : "keyword"
},
"last_name" : {
"type" : "keyword"
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
# 写入一条电影信息
POST my_movies/_doc/1
{
"title":"Speed",
"actors":[
{
"first_name":"Keanu",
"last_name":"Reeves"
},
{
"first_name":"Dennis",
"last_name":"Hopper"
}
]
}
POST my_movies/_search
{
"query": {
"bool": {
"must": [
{"match": {
"actors.first_name": "Keanu"
}},
{"match": {
"actors.last_name": "Hopper"
}}
]
}
}
}
nested重新映射
两个actors对象将单独建立lucene索引, 查询时使用join, 所以下列人名将不会查询出结果
DELETE my_movies
# 创建 Nested 对象 Mapping
PUT my_movies
{
"mappings" : {
"properties" : {
"actors" : {
"type": "nested",
"properties" : {
"first_name" : {"type" : "keyword"},
"last_name" : {"type" : "keyword"}
}},
"title" : {
"type" : "text",
"fields" : {"keyword":{"type":"keyword","ignore_above":256}}
}
}
}
}
POST my_movies/_doc/1
{
"title":"Speed",
"actors":[
{
"first_name":"Keanu",
"last_name":"Reeves"
},
{
"first_name":"Dennis",
"last_name":"Hopper"
}
]
}
POST my_movies/_search
{
"query": {
"nested": {
"path": "actors",
"query": {
"bool": {
"must": [
{
"match": {
"actors.first_name": "Keanu"
}
},
{
"match": {
"actors.last_name": "Hopper"
}
}
]
}
}
}
}
}
对nested 类型聚合
POST my_movies/_search
{
"size":0,
"aggs": {
"actors": {
"nested": {
"path": "actors"
},
"aggs": {
"name_agg": {
"terms": {
"field": "actors.first_name",
"size": 10
}
}
}
}
}
}
文档的父子关系
nested的父子文档视为一个整体, 修改子文档,需要重新索引整个父子文档, 适合子文档读多写少;
es父子文档可以实现父文档和子文档单独维护 , 不需要重新索引整个父子文档, 适合子文档写多读少;
join方式查询
mapping中指定父子对应关系, 索引父子文档时需要声明父子文档类型, 子文档需要声明父文档id并保证id唯一性,子文档需要指定route参数保证父子文档在一个分片上
DELETE my_blogs
# 设定 Parent/Child Mapping
PUT my_blogs
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"properties": {
"blog_comments_relation": {
"type": "join",
"relations": {
"blog": "comment"
}
},
"content": {
"type": "text"
},
"title": {
"type": "keyword"
}
}
}
}
#索引父文档
PUT my_blogs/_doc/blog1
{
"title":"Learning Elasticsearch",
"content":"learning ELK @ geektime",
"blog_comments_relation":{
"name":"blog"
}
}
#索引父文档
PUT my_blogs/_doc/blog2
{
"title":"Learning Hadoop",
"content":"learning Hadoop",
"blog_comments_relation":{
"name":"blog"
}
}
#索引子文档
PUT my_blogs/_doc/comment1?routing=blog1
{
"comment":"I am learning ELK",
"username":"Jack",
"blog_comments_relation":{
"name":"comment",
"parent":"blog1"
}
}
#索引子文档
PUT my_blogs/_doc/comment2?routing=blog2
{
"comment":"I like Hadoop!!!!!",
"username":"Jack",
"blog_comments_relation":{
"name":"comment",
"parent":"blog2"
}
}
#索引子文档
PUT my_blogs/_doc/comment3?routing=blog2
{
"comment":"Hello Hadoop",
"username":"Bob",
"blog_comments_relation":{
"name":"comment",
"parent":"blog2"
}
}
查询父文档不会带有子文档信息, 需要使用Parent Id 查询才有子文档
#根据父文档ID查看
GET my_blogs/_doc/blog2
# Parent Id 查询
POST my_blogs/_search
{
"query": {
"parent_id": {
"type": "comment",
"id": "blog2"
}
}
}
#根据父文档ID查看
GET my_blogs/_doc/comment3?routing=blog2
修改子文档指定父文档id , 之后查询子文档和父文档, 可以看到父文档的version未变,子文档改变
PUT my_blogs/_doc/comment3?routing=blog2
{
"comment":"Hello Hadoop??",
"username":"Bob",
"blog_comments_relation":{
"name":"comment",
"parent":"blog2"
}
}
updateByQuery和reIndex api
修改了映射后, 不会影响到旧的文档,新的文档可以生效, 想要解决这个问题, 可以使用updateByQuery或reIndex api
updateByQuery : 仅限增加新的字段, 将原索引的文档重新索引一遍 ;
reIndex : 将原索引迁移到新索引, 可以修改字段类型, 增加分片数, 支持跨集群,
需要原索引有source字段, 可以通过插叙条件部分迁移
可以设置op_type,将不冲突的id迁移到新索引
跨集群需要在es配置文件中设置白名单
可以设置异步, 并通过GET _tasks?detailed=true&actions=*reindex查询reindex的进度
updateByQuery
例在索引文档后增加子字段, 使用english分词器
DELETE blogs/
# 写入文档
PUT blogs/_doc/1
{
"content":"Hadoop is cool",
"keyword":"hadoop"
}
# 查看 Mapping
GET blogs/_mapping
# 修改 Mapping,增加子字段,使用英文分词器
PUT blogs/_mapping
{
"properties": {
"content": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
# 写入文档
PUT blogs/_doc/2
{
"content":"Elasticsearch rocks",
"keyword":"elasticsearch"
}
使用english子字段查询, 可以查到id2的文档, id1的文档因为使用的是默认分词器, 索引中的token为Hadoop, 所以使用英文分词器查不到
通过explain是使用英文分词器小写的hadoop查询
使用updateByQuery重新索引, 可以查到了
# 查询新写入文档
POST blogs/_search
{
"query": {
"match": {
"content.english": "Elasticsearch"
}
},"explain": true
}
# 查询 Mapping 变更前写入的文档
POST blogs/_search
{
"query": {
"match": {
"content.english": "Hadoop"
}
},"explain": true
}
POST blogs/_update_by_query
reindex
创建新的索引映射,将原keyword字段由text改为keyword类型, reindex 转移数据到新索引, 新的索引可以使用keyword字段进行聚合
DELETE blogs_fix
# 创建新的索引并且设定新的Mapping
PUT blogs_fix/
{
"mappings": {
"properties" : {
"content" : {
"type" : "text",
"fields" : {
"english" : {
"type" : "text",
"analyzer" : "english"
}
}
},
"keyword" : {
"type" : "keyword"
}
}
}
}
# Reindx API
POST _reindex
{
"source": {
"index": "blogs"
},
"dest": {
"index": "blogs_fix"
}
}
POST blogs_fix/_search
{
"size": 0,
"aggs":{
"keyword_agg":{
"terms": {
"field": "keyword",
"size": 10
}
}
}
}
ingestPipeline 和 painlessScript
在文档被索引前, 对文档进行处理, 使用ingestPipeline, 与logstash功能类似,可以对文档的值进行切分,增加新的字段,日期格式化,大小写转换, 减少了logstash部署的架构复杂度;
ingestPipeline
DELETE tech_blogs
#Blog数据,包含3个字段,tags用逗号间隔
PUT tech_blogs/_doc/1
{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data"
}
# 测试split tags
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "split tags",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"content": "You konw, for big data"
}
},
{
"_index": "index",
"_id": "idxx",
"_source": {
"title": "Introducing cloud computering",
"tags": "openstack,k8s",
"content": "You konw, for cloud"
}
}
]
}
# 增加字段验证
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "split and set",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
},
"set": {
"field": "views",
"value": 0
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"content": "You konw, for big data"
}
},
{
"_index": "index",
"_id": "idxx",
"_source": {
"title": "Introducing cloud computering",
"tags": "openstack,k8s",
"content": "You konw, for cloud"
}
}
]
}
#增加一个pipeline
PUT _ingest/pipeline/blog_pipeline
{
"description": "a blog pipeline",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"set":{
"field": "views",
"value": 0
}
}
]
}
#查看pipeline
GET _ingest/pipeline/blog_pipeline
#测试效果
POST _ingest/pipeline/blog_pipeline/_simulate
{
"docs": [
{
"_source": {
"title": "Introducing cloud computering",
"tags": "openstack,k8s",
"content": "You konw, for cloud"
}
}
]
}
#插入一条未使用pipeline
PUT tech_blogs/_doc/1
{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data"
}
#插入一条使用pipeline的
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
"title": "Introducing cloud computering",
"tags": "openstack,k8s",
"content": "You konw, for cloud"
}
#查询
POST tech_blogs/_search
{}
因为tags在索引id为2的文档的时被更正成了数组,索引的映射仍为string, 在更新索引时, 需要排除数组类型的数据
#增加update_by_query的条件
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "views"
}
}
}
}
}
painless script
使用脚本来处理文档字段的值, 支持javaapi的语法, 类似 string.contains(), 6.0后支持java ;
es脚本编译开销大, 会被缓存起来, 默认缓存100个, 所以性能高;
DELETE tech_blogs
PUT tech_blogs/_doc/1
{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data",
"views":0
}
#更新时使用脚本
POST tech_blogs/_update/1
{
"script": {
"source": "ctx._source.views += params.new_views",
"params": {
"new_views":100
}
}
}
# 查看views计数
POST tech_blogs/_search
{
}
#保存脚本在 Cluster State
POST _scripts/update_views
{
"script":{
"lang": "painless",
"source": "ctx._source.views += params.new_views"
}
}
#使用保存的脚本
POST tech_blogs/_update/1
{
"script": {
"id": "update_views",
"params": {
"new_views":1000
}
}
}
获取字段值并加上随机数
GET tech_blogs/_search
{
"script_fields": {
"rnd_views": {
"script": {
"lang": "painless",
"source": """
java.util.Random rnd = new Random();
doc['views'].value+rnd.nextInt(1000);
"""
}
}
},
"query": {
"match_all": {}
}
}
数据建模
对一个字段来说 , 需要考虑以下几点设置mapping
es mapping参数
字段类型
使用全文搜索的使用text;
需要聚合,filter查询的使用keyword;
需要特殊分词的使用子字段;
结构化数据
数值类型 尽量贴合原类型 , 使用byte , 不要用long
枚举类型和数字类型也用keyword
检索
不需要检索 , enable设置false;
不需要检索 , 设置index为false;
需要检索的, 但仅用于聚合的设置norms为false, 减少磁盘使用, norms增加算分精确性,占用额外的空间;
聚合排序
不需要检索聚合排序的设置enable 为false;
不需要排序和聚合的, 即便是keyword也设置docvalue或fielddata为false;
更新频繁用于聚合的keyword类型字段设置eager_global_ordinals设置为true;(利用缓存,提高termAgg性能)
额外存储
关闭_source时,开启store可以单独保存此字段 ,节省空间降低io , 关闭了_source无法进行reindex和update, 一般考虑增加压缩比;
案例
封面url不需要被搜索, 设置为index:false, 仍然可以聚合
# Index 一本书的信息
PUT books/_doc/1
{
"title":"Mastering ElasticSearch 5.0",
"description":"Master the searching, indexing, and aggregation features in ElasticSearch Improve users’ search experience with Elasticsearch’s functionalities and develop your own Elasticsearch plugins",
"author":"Bharvi Dixit",
"public_date":"2017",
"cover_url":"https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}
#优化字段类型
"cover_url": {
index:false, "type" : "keyword"
}
在存储大文本字段时, 关闭source , 打开其余字段的store
PUT books
{
"mappings" : {
"_source": {"enabled": false},
"properties" : {
"author" : {"type" : "keyword","store": true},
...
}
}
}
}
直接全部搜索因为没有_source,不显示文档字段信息, 需要用stored_fields指明;
#搜索,通过store 字段显示数据,同时高亮显示 conent的内容
POST books/_search
{
"stored_fields": ["title","author","public_date"],
"query": {
"match": {
"content": "searching"
}
},
"highlight": {
"fields": {
"content":{}
}
}
}
相关api
index template & dynamic template 帮助快速创建索引
index alias 将索引名指向另一个索引, 做到写时替换
update by query / reindex
数据建模最佳实践
object 反范式化 ,
nested , 字段值存在一对多关系 , 并频繁查询
parent child , 字段值存在一对多, 更新大于查询, 例如文章和评论
7.0.1 kibana对nested 和child 可视化支持不好
避免大量字段
字段mapping维护在集群cluster state 中 , 对性能有影响, 需要所有节点同步这个信息 ;
默认最大字段数是1000;
大量字段原因可能是开启自动映射dynamic mapping
nested & key value
使用这种方式解决以下场景中不断产生新字段的问题
解决了大量字段的问题 , 但是kibana对nested可视化展示不好, 也增加了查询复杂度
"person":{
"name":"张三",
"age":15,
"id":123,...
}
#改变成
"person":[
{"keyName":"name","value":"张三"}
{"keyName":"age","value":15}
...
]
DELETE cookie_service
#使用 Nested 对象,增加key/value
PUT cookie_service
{
"mappings": {
"properties": {
"cookies": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"dateValue": {
"type": "date"
},
"keywordValue": {
"type": "keyword"
},
"IntValue": {
"type": "integer"
}
}
},
"url": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
##写入数据,使用key和合适类型的value字段
PUT cookie_service/_doc/1
{
"url": "www.google.com",
"cookies": [
{
"name": "username",
"keywordValue": "tom"
},
{
"name": "age",
"intValue": 32
}
]
}
PUT cookie_service/_doc/2
{
"url": "www.amazon.com",
"cookies": [
{
"name": "login",
"dateValue": "2019-01-01"
},
{
"name": "email",
"IntValue": 32
}
]
}
POST cookie_service/_search
{
"query": {
"nested": {
"path": "cookies",
"query": {
"bool": {
"filter": [
{
"term": {
"cookies.name": "age"
}
},
{
"range": {
"cookies.intValue": {
"gte": 30
}
}
}
]
}
}
}
}
}
避免通配符
将针对一个字段的模糊查询, 改为多个字段的精确查询
案例针对版本号的查询
原本software_version的查询, 需要用7.1.*的方式, 将映射改为object存储每个字段存储版本号的1位
PUT softwares/_doc/1
{
"software_version":"7.1.0"
}
DELETE softwares
# 优化,使用inner object
PUT softwares/
{
"mappings": {
"_meta": {
"software_version_mapping": "1.1"
},
"properties": {
"version": {
"properties": {
"display_name": {
"type": "keyword"
},
"hot_fix": {
"type": "byte"
},
"marjor": {
"type": "byte"
},
"minor": {
"type": "byte"
}
}
}
}
}
}
PUT softwares/_doc/1
{
"version":{
"display_name":"7.1.0",
"marjor":7,
"minor":1,
"hot_fix":0
}
}
PUT softwares/_doc/2
{
"version":{
"display_name":"7.2.0",
"marjor":7,
"minor":2,
"hot_fix":0
}
}
PUT softwares/_doc/3
{
"version":{
"display_name":"7.2.1",
"marjor":7,
"minor":2,
"hot_fix":1
}
}
POST softwares/_search
{
"query": {
"bool": {
"filter": [
{
"match": {
"version.marjor": 7
}
},
{
"match": {
"version.minor": 2
}
}
]
}
}
}
避免空值引发的聚合不准
给空值赋予默认值 ,
案例中查询的平均值为5.0 ,正确的是5/2 = 2.5,空值影响了结果
PUT ratings/_doc/1
{
"rating":5
}
PUT ratings/_doc/2
{
"rating":null
}
POST ratings/_search
{
"size": 0,
"aggs": {
"avg": {
"avg": {
"field": "rating"
}
}
}
}
mapping针对null做处理 , 在聚合结果正确为6/2=3.0, 虽然文档2中记录的还是null
# Not Null 解决聚合的问题
DELETE ratings
PUT ratings
{
"mappings": {
"properties": {
"rating": {
"type": "float",
"null_value": 1.0
}
}
}
}
将mapping信息额外管理
mapping信息是不断迭代的 , 记录映射文件版本号
# 在Mapping中加入元信息,便于管理
PUT softwares/
{
"mappings": {
"_meta": {
"software_version_mapping": "1.0"
}
}
}
总结
结构化搜索
term查询使用keyword精确查询
match为全文搜索,进行分词
querycontext vs filterContext
filter不进行算分 , 利用缓存
bool查询filter 和must not 都是filter
搜索算分
tf / idf ; 字段boosting来控制算分结果, 例negative对包含like的文档降权,
单字符串多字符串查询
bestField 返回单字段最高分值
mostField 结合多字段分值
crossField
搜索相关性
多语言 : 设置多个子字段使用不同的分词器
search template 分离代码逻辑和搜索dsl, 不需要改动客户端查询代码, 完成查询逻辑的替换
聚合
bucket / metric / pipeline
分页
from size
导出使用 scroll api ,避免深分页
分布式存储
文档id hash路由 , 主分片不能修改;
分片内部原理
segment / transaction log / refresh / merge
分布式查询和聚合分析的内部机制
query then fetch : idf 不是基于全局 , 而是基于分片 , 因此数据量少的时候, 通过指定shard size让更多的分片数据参与计算;
数据建模
es处理管理关系; 数据建模常见步骤 ; 建模实践;
建模相关的工具
index template / dynamic template / ingest node / update by query / reindex / index alias
最佳实践
避免过多的字段 , 避免wildcard模糊查询 /
term 和 match的理解
以下搜索的情况及原因
DELETE test
#默认分词器, 存储成 hello 和 world
PUT test/_doc/1
{
"content":"Hello World"
}
#standard
GET _analyze
{
"analyzer": "standard",
"text": "Hello World"
}
#1使用分词器 有
POST test/_search
{
"profile": "true",
"query": {
"match": {
"content": "Hello World"
}
}
}
#2使用分词器 有
POST test/_search
{
"profile": "true",
"query": {
"match": {
"content": "hello world"
}
}
}
#3不用分词器 ,搜索keyword子字段,有, 因为keyword存储的是原文本
POST test/_search
{
"profile": "true",
"query": {
"match": {
"content.keyword": "Hello World"
}
}
}
#4不用分词器 , 精确匹配 没有
POST test/_search
{
"profile": "true",
"query": {
"match": {
"content.keyword": "hello world"
}
}
}
#5不用分词器 ,没有, 存储的文本为小写的, 用大写的搜索无法匹配
POST test/_search
{
"profile": "true",
"query": {
"term": {
"content": "Hello World"
}
}
}
#6 不用分词器 , 没有,
POST test/_search
{
"profile": "true",
"query": {
"term": {
"content": "hello world"
}
}
}
POST test/_search
{
"profile": "true",
"query": {
"term": {
"content.keyword": "Hello World"
}
}
}
#standard
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
测试
生产环境中使用alias ;
分片数大于1时, 指定shardSize 提升 term聚合精准度;
聚合 cardinality 求出有多少分类