elasticstack学习 part2

岛学家蝎子莱莱

已于 2022-04-17 21:36:26 修改

阅读量1.7k

点赞数 1

文章标签：学习 elk

于 2022-02-19 23:41:31 首次发布

本文链接：https://blog.csdn.net/weixin_43453109/article/details/123025823

版权

本文探讨了基于词项和全文搜索的高效查询技巧，包括term与match查询、结构化搜索、布尔运算、自动补全与上下文提示，以及跨集群查询和数据建模的最佳实践。特别关注了如何优化搜索相关性、分页策略和索引设计，以提升搜索精度和性能。

摘要由CSDN通过智能技术生成

基于词项和全文的搜索

基于term查询

索引时desc字段使用了分词器, 索引时转换成了小写的iphone, term查询desc因为没有用分词器,最终使用的term为iPhone查询, 所以没有查到
如果使用term要查询到, 就需要查询分词后的term 或者对该字段的keyword进行查询

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }


POST products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
      }
    }
  },"profile": "true"
}

分词效果
在这里插入图片描述

term查询也会进行算分, 即使是keyword字段, constant_score filter取消算分, 减少性能消耗, 利用缓存

POST /products/_search
{
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }

    }
  }
}

基于全文

match query 会将查询的目标进行分词成term 每个term单独进行查询, 汇总结果 ;
match phrase 会将单词视为一个整体, 并且关注位置关系, 使用slot进行偏差

结构化搜索

对布尔值进行搜索

POST products/_search
{
  "query": {
    "term": {
      "avaliable": {
        "value": "true"
      }
    }
  }
}

POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

数字

"query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 10,
            "lte": 20
          }
        }
      }
    }
  }

对日期进行搜索
当前时间减去4年, 也就是搜索4年以内的

"query": {
    "constant_score": {
      "filter": {
        "range": {
          "date": {
            "gte" : "now-4y"
          }
        }
      }
    }
  }

查询非空

"query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }

查询多值字段
查询类型包含comedy的 , 而不是精确只有comedy

POST movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "genre.keyword": "Comedy"
        }
      }
    }
  }
}

搜索的相关性算分

tm 词频 , 词项在该文档中的频率 , 例如我是中国人, 生在中国, 中国出现了2次;
df 检索词在所有文档中的频率 , 翻转idf, 例如中国在200个文档中出现过, 一共有1000个文档, log(1000/200)

idf 词与该文档的差异率 ,
lucene使用tm -idf,idf加权tm求分 , 之后改为了bm25, 解决了tf 无限增加分值无限增大的问题 , es可以在创建索引时指定算分方式

explain解析算分细节
两条都包含目标, 但是id2的文档长度更短, tf分值更高

PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }

POST testscore/_search
{
  "query": {
    "match": {
      "content": "elasticsearch"
      
    }
  },"explain": true
}

使用boosting来控制算分结果, 例negative对包含like的文档降权,

POST testscore/_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "term" : {
                    "content" : "elasticsearch"
                }
            },
            "negative" : {
                 "term" : {
                     "content" : "like"
                }
            },
            "negative_boost" : 0.2
        }
    }
}

query filter 与多字符串

bool查询, 组合多个字段的查询条件
must, should参与评分,filter和mustnot不参与评分

#基本语法
POST /products/_search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "price" : "30" }
      },
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

单字符串多字段查询

使用disjunction_max来查询多字段, 对比各个字段评分, 取最高评分


PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}
POST blogs/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {
          "title": "Brown fox"
        }},
        {
          "match": {
            "body": "Brown fox"
          }
        }
      ]
    }
  },
  "explain": true
}

例id为1的虽然在两个字段中都包括了brown, 但是两个字段的brown结果最终取了一个最大的分值, id为2的是将brown和fox两个分值加起来, fox只在一个文档中出现, 更罕见, 理所应当分值更高, 所以id为2的更符合要求, 分值更大 , 排在前面
在这里插入图片描述
如果只搜索Quick pets , 两个文档评分相同, 因为每个文档包含的单词都是相同的
使用tie_breaker对评分更均衡, 之前是只取最高字段, tie_breaker会加权其他字段并加入总分,

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

如果是考虑多个字段的算分, 自己感觉可以直接用bool替代

POST /blogs/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    },"explain": true
}

单字符串多字段查询multimatch

例用barking dogs只查询title结果为id1分值高, 因为文档短, 但实际id2更符合搜索目标, 针对这种场景, 需要增加id2的分值, 增加title.std字段, 对两个字段查询
multimatch在写法上比dis_max更简单, 默认使用best_fields,也就是disjunction_max,

DELETE /titles
PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {"std": {"type": "text","analyzer": "standard"}}
      }
    }
  }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}


GET titles/_search
{
  "query": {
    "multi_match": {
      "type": "most_fields",
      //"type": "best_fields", 
      "query": "barking dogs",
      "fields": ["title","title.std"]
    }
  }
}

实战

将tmdb导入es中, mapping中title使用english分词器, 通过multi_match用token ““basketball with cartoon aliens”” 搜索出空中大灌篮 ;
如果是默认标准分词器进行索引, 搜索不出结果
multi_match默认使用的best_fields模式, 仅使用最高分的字段的分数

"mappings": {
    "properties": {
      "overview": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "std": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      },
      "popularity": {
        "type": "float"
      },
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }

"query": {
    "multi_match": {
      "query": "basketball with cartoon aliens",
      "fields": ["title^10","overview"]
    }
  }

Windows 安装pyenv

pip install pyenv-win --target %USERPROFILE%/.pyenv

如何使用pyenv在windows10安装多个python版本环境
我环境变量有问题, 直接在pyenv的目录下cmd

pyenv install 2.7.15
pyenv versions
python -V
pyenv global 2.7.15
pyenv global

默认mapping , 默认查询
在这里插入图片描述
mapping , english分词器, most_filed 模式

默认mapping , 默认查询, space jam只有basketball命中, 文档中的alien 因为分词器保持了aliens 就没有命中

search Template 和 index alias

查询模板, 通过预置查询脚本 , 之后查询就可以引用该模板, 还可以引用变量, 之后可以直接修改模板, 改变查询结果


POST tmdb/_search
{
   "_source": ["title","overview"],
      "size":20,
      "query": {
          "multi_match": {
              "type": "most_fields", 
              "query": "basketball with cartoon aliens",
              "fields": ["title","overview"]
          }
      }
  ,"explain": true
}

POST _scripts/tmdb
{
  "script": {
    "lang": "mustache",
    "source": {
      "_source": [
        "title","overview"
      ],
      "size": 20,
      "query": {
        "multi_match": {
          "query": "{{q}}",
          "fields": ["title","overview"]
        }
      }
    }
  }
}

POST tmdb/_search/template
{
    "id":"tmdb",
    "params": {
        "q": "basketball with cartoon aliens"
    }
}

索引别名
对一个索引创建别名 , 使用别名代替索引名称进行查询 , 并能对别名设置额外的过滤规则

PUT movies-2019/_doc/1
{
  "name":"the matrix",
  "rating":5
}

PUT movies-2019/_doc/2
{
  "name":"Speed",
  "rating":3
}

//创建别名
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies-2019",
        "alias": "movies-latest"
      }
    }
  ]
}
//别名查询 两条结果
POST movies-latest/_search
{
  "query": {
    "match_all": {}
  }
}

//创建别名及过滤规则
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies-2019",
        "alias": "movies-lastest-highrate",
        "filter": {
          "range": {
            "rating": {
              "gte": 4
            }
          }
        }
      }
    }
  ]
}
//仅1条结果
POST movies-lastest-highrate/_search
{
  "query": {
    "match_all": {}
  }
}

综合排序：Function Score Query 优化算分

对以下文档进行查询 , 内容均相同, 对点赞数进行加权的效果, 更改算分的逻辑, 将;
function score 可以在查询到内容后, 自由的更改算分的方式 , 比如使用脚本, 自定义逻辑,
例field_value_factor中默认使用得分 * 额外算分的值 ,


DELETE blogs
PUT /blogs/_doc/1
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   0
}

PUT /blogs/_doc/2
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   100
}

PUT /blogs/_doc/3
{
  "title":   "About popularity",
  "content": "In this post we will talk about...",
  "votes":   1000000
}

POST /blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query":"popularity",
          "fields": ["title", "content"]
        }
      },
      "field_value_factor": {
      //算分字段
        "field": "votes",
        //修改函数
        "modifier": "log1p" ,
        //
        "factor": 0.1
      },
      //默认是乘法, 可以改成sum加法
      "boost_mode": "sum",
      //每个文档最大的分值为3
      "max_boost": 3
    }
  }
}

随机种子
例保证一个用户在浏览广告时, 当前用户看到的广告排序是相同的, 用户的session,或id 作为seed,
这样可以增加广告的展现率;

POST /blogs/_search
{
  "query": {
    "function_score": {
      "random_score": {
        "seed": 911119
        ,"field": "_seq_no"
      }
    }
  }
}

Term & Phrase Suggester

推荐词项 , 如果输入错误的词, 会查询词库中有无这词, 如果没有, 会根据输入的词匹配并返回类似的词项
例输入了lucen rock, 用suggest查询设置相同的字段和关键词, 会返回推荐词 lucene和rocks
默认是用了missing 模式, 如果没有这个词项才进行推荐, 推荐的算分是根据关键词token和目标token的字符差异来的


DELETE articles
POST articles/_bulk
{ "index" : { } }
{ "body": "lucene is very cool"}
{ "index" : { } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "body": "Elasticsearch rocks"}
{ "index" : { } }
{ "body": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "body": "Elk stack rocks"}
{ "index" : {} }

POST articles/_search
{
  "size": 1, 
  "query": {
    "match": {
      "body": "lucen rock"
    }
  }
  ,
  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "field": "body"
        ,"suggest_mode":"missing"
        //词项匹配容忍的前缀, 输入hock也会推荐rock
        ,"prefix_length":0
      }
    }
  }
}

Phrase Suggester增加了更多的参数, confidence控制了返回结果的阈值 , 只有候选词高于该标准的才会返回

POST /articles/_search
{
  "suggest": {
    "my-suggestion": {
      "text": "lucne and elasticsear rock hello world ",
      "phrase": {
        "field": "body",
        "max_errors":2,
        "confidence":2,
        "direct_generator":[{
          "field":"body",
          "suggest_mode":"always"
        }],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

自动补全与基于上下文的提示

输入关键词, 查询es返回关键词的补全信息,
es不通过倒排索引来实现, 而是通过fst实现 , fst介绍一种内存占用小, 类似map的结构, 适合做前缀匹配

创建索引需要对补全的字段配置,suggest选择completion
例一下查询得到前缀为elk的补全提示


DELETE articles
PUT articles
{
  "mappings": {
    "properties": {
      "title_completion":{
        "type": "completion"
      }
    }
  }
}


POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }

POST articles/_search
{
  "size": 0, 
  "suggest": {
    "article-suggester": {
      "prefix":"elk"
      ,"completion": {
        "field": "title_completion"
      }
    }
  }
}

除了补全, 还有根据上下文进行的提示, suggest context
mapping增加context补全类型, 在索引时对文档选择补全类型, 查询时提供补全类型, 就相当于增加了查询条件


DELETE comments
PUT comments
PUT comments/_mapping
{
  "properties":{
    "comment_autocomplete":{
      "type":"completion",
      "contexts":[{
        "type":"category",
        "name":"comment_category"
      }]
    }
  }
}


POST comments/_doc
{
  "comment":"I love the star war movies",
  "comment_autocomplete":{
    "input":["star wars"],
    "contexts":{
      "comment_category":"movies"
    }
  }
}


POST comments/_doc
{
  "comment":"Where can I find a Starbucks",
  "comment_autocomplete":{
    "input":["starbucks"],
    "contexts":{
      "comment_category":"coffee"
    }
  }
}

POST comments/_search
{
  "suggest": {
    "MY_SUGGESTION": {
      "prefix": "sta",
      "completion":{
        "field":"comment_autocomplete",
        "contexts":{
          "comment_category":"movies"
        }
      }
    }
  }
}

何种场景适合何种查询
在这里插入图片描述

跨集群查询

单机群, master的压力大, 成为性能瓶颈 , 不能无限扩荣节点;
es早期通过tribe node 支持跨集群查询 , 需要加入集群节点, 查询时经过该节点, 重启慢 , 集群索引重名问题;
5.3后版本支持cross cluster search , 不需要加入已client node加入集群, 任何节点都能作为查询请求节点

在win上的demo

启动3个集群

bin/elasticsearch -E node.name=cluster0node -E cluster.name=cluster0 -E path.data=cluster0_data -E discovery.type=single-node -E http.port=9200 -E transport.port=9300
bin/elasticsearch -E node.name=cluster1node -E cluster.name=cluster1 -E path.data=cluster1_data -E discovery.type=single-node -E http.port=9201 -E transport.port=9301
bin/elasticsearch -E node.name=cluster2node -E cluster.name=cluster2 -E path.data=cluster2_data -E discovery.type=single-node -E http.port=9202 -E transport.port=9302

使用postman发送请求

curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'

curl -XPUT "http://localhost:9201/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'

curl -XPUT "http://localhost:9202/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'


#创建测试数据, 每个集群的数据不同
curl -XPOST "http://localhost:9200/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user1","age":10}'

curl -XPOST "http://localhost:9201/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user2","age":20}'

curl -XPOST "http://localhost:9202/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user3","age":30}'

测试查询多个集群的结果


GET /users,cluster1:users,cluster2:users/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 10,
        "lte": 40
      }
    }
  }
}

集群分布式模型及选主与脑裂问题

每个es节点都是java进程, 可配置集群名称 , 节点也可配置名字;
节点类型:

协助节点, 所有节点默认都是, 可处理请求, 生产上要固定角色;
数据节点, 所有节点默认都是, 可保存分片, 数据扩展;
主节点, 维护索引, 集群信息, 保存分片位置;
候选主节点, 所有节点默认都是, 主机故障时参与主节点的选举;

脑裂
当网络波动, 一个集群被分成两个网络区域, 不存在主节点的区域集群, 又会进行选举产生主节点, 提供对外服务, 之后网络恢复后, 集群恢复时没被选出的主节点会丢失他这段时间的数据

旧版本, 配置选举阈值, 集群中的节点数大于阈值才进行选举, 避免脑裂 ,7.0开始移除阈值配置, es自己管控

分片与集群故障的转移

分片是lucene的index, 索引创建后主分片数不能修改,
副本分片实现数据高可用, 可以热动态调整, 副本支持查询请求, 增加了副本数量也相当于增加了吞吐量;

分片数设置
数量少了, 难以支持数据扩容 , 数量多了影响性能 ;
副本数多了需要的同步工作就多 , 影响写入 ;

例如当前集群主分片3, 副本数1
主节点故障后, 先选出主节点, 数据被合理的分配到了其他的节点上;
在这里插入图片描述

文档分布式存储

为了保证数据分布均匀, 性能的利用率 ,文档的存储的位置默认通过文档id取模主分片数算出, 所以主分片数不能更改; 也可以通过指定的数据取模, 只分配到某个分片;

更新文档
请求节点hash算出文档存储位置, 更新时先删除,后创建后响应给请求节点

删除文档
请求节点路由至文档位置, 删除该主分片文档, 然后再删除副本分片文档, 响应

分片及声明周期

倒排索引不可变性
好处:
不考虑并发写问题, 避免锁问题;
只要内存够, 第一次从文件系统读取到缓存中, 之后读缓存;
缓存容易生成和维护, 数据可以被压缩;

坏处:
想要文档变成被搜索到的状态, 就需要重建整个索引

lucene index
lucene的1个倒排索引为segment , 多个倒排索引由commit point记录, 已经删除的记录在.del文件中 , 查询时会遍历全部的segments, 过滤已删除的 ;

refresh
文档索引首先会写入index buffer缓存, 再1秒(可配置)之后将缓存中的所有文档生成为segment,segment没有落盘, 也在缓存, index buffer生成segment的过程为refresh;
index buffer空间默认为10%, 空间满也会触发refresh;
refresh不执行fsync;

transaction log
为了保证数据写缓存后优先提供查询功能下能不丢失数据, 在写入缓存时也会将数据写入事务日志落盘, 在生成segment后仍然不会删除事务日志中的数据;
每个分片有一个transaction log;
transaction log默认500M;

flush
目的就是将所有缓存中的数据持久化落盘, 首先会执行refresh操作, 然后将缓存中的segment落盘, 之后删除事务日志;
默认30分钟发生1次, 事务日志写满时也会触发flush;

merge
在segment不断落盘后, 数量变得越来越多, merge会将这些零散的segment进行合并, 并且清空.del文件的数据;
merge操作是es自动管理的, 也可以通过api触发;

剖析分布式查询及相关性算分

分布式搜索的运行机制 query then fetch

查询时请求节点随机挑选所有存储数据的主分片和副本分片, 执行条件查询
在每个分片取回from + size数量的文档, 内容仅有文档id和排序值
将所有文档在请求节点重新排序, 保留from+size 的文档, 再使用multiquery去对应分片查询真正的文档详情
最后响应给客户端;

这种方式的弊端是请求节点需要接收n*size的数据量, 并且算分是在每个分片进行完成的, 如果存储不均匀, 算分就会不准确;

如何避免
在数据量少的情况下, 将分片数设置为1, 不会进行分布式搜索, 就不会有请求节点汇总数据;
平均分配存储避免算分不准的情况, 或者通过dfs query then fetch 将详细的算分数据传回请求节点进行计算, 这种方式耗费的性能大;

例 20个分片, 存储3个文档
“good”
“good morning”
“good morning everyone”

在普通query下, 每个分片的idf文档数量都是1,仅算当前分片的
在这里插入图片描述
而def query则是将3个分片的信息汇总进行算分的

排序及docValues&fieldData

对字段排序时, 不计算得分, score为null,

#多字段排序
POST /kibana_sample_data_ecommerce/_search
{
  "size": 5,
  "query": {
    "match_all": {
    }
  },
  "sort": [
    {"order_date": {"order": "desc"}},
    {"_doc":{"order": "asc"}},
    {"_score":{ "order": "desc"}}
  ]
}

默认不能对text字段进行排序, 需要打开fielddata设置

PUT kibana_sample_data_ecommerce/_mapping
{
  "properties": {
    "customer_full_name" : {
          "type" : "text",
          "fielddata": true,
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
  }
}

fielddata是正排索引, id关联了数据内容, 可以实现text全文本类型的数值排序;
默认使用的docvalue是列式存储方式 , 跟随索引一起创建, 减少内存占用, 增加了索引的维护成本;
fielddate可以随时开启关闭, 但是docvalue的改变需要重建索引
在这里插入图片描述

分页与遍历

因为es是分布式分片存储文档, 当查询一个from100 , size 10的场景中, 需要去每个分片查询110条, 将110*n shardNum汇总在请求节点, 进行重排序, 这样会出现深分页的问题;
es默认单次查询结果小于10000条文档, 超过会报错;

search after

指定search使用的排序字段和唯一标识字段(一般是_id), 每次查询只要将指定位置的文档集合返回给请求节点, size* shardNum ;
下次查询只要传入排序字段值和文档id就实现了向下翻页;
不能指定页码 , 只能不断向下翻页

DELETE users
POST users/_doc
{"name":"user1","age":10}

POST users/_doc
{"name":"user2","age":11}


POST users/_doc
{"name":"user2","age":12}

POST users/_doc
{"name":"user2","age":13}

POST users/_search
{
  "query": {
    "match_all": {}
  }
  ,"size": 2
  ,"sort": [
    {
      "age":  "desc"
    },
      {
      "_id":  "asc"
    }
  ]
}

scroll api
第一次查询时调用scroll生成当前搜索结果的快照, 之后读快照查询;
之后每次调用api传入第一次生成的scroll_id , 都会实现向下翻页;
缺点是在快照后新增的文档是无法被查询到的;
之后传入的"scroll":“1m”, 是延长当前快照的有效期
只要重新调用生成scroll_id的api 都会重新生成快照id;


#Scroll API
DELETE users
POST users/_doc
{"name":"user1","age":10}

POST users/_doc
{"name":"user2","age":20}

POST users/_doc
{"name":"user3","age":30}

POST users/_doc
{"name":"user4","age":40}

POST users/_search?scroll=3m
{ 
  "size":2,
  "query": {
    "match_all": {}
  }
}
POST users/_doc
{"name":"user5","age":50}
POST users/_doc
{"name":"user7","age":70}
POST _search/scroll
{
  "scroll":"1m",
  "scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFmgzaWpSLWdfVFpLRktpRXRPdjdNRkEAAAAAAAAKhhZYcWtYWU92LVNTdXpmVjQtRjFnSDJn"
}

适用场景

普通场景下查询最新的数据, 普通的查询就行;

需要全部文档, 进行数据导出 , 处理的 , 适用scroll api , 不要求实时性, 又节省了性能;

需要分页的, 使用分页参数 , 需要深分页的, 加上search after设置, 节省性能, 保证数据实时性;

bucket&metric 聚合分析与嵌套聚合

es 聚合类似sql的count, group;

准备样本

DELETE /employees
PUT /employees/
{
  "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "gender" : {
          "type" : "keyword"
        },
        "job" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 50
            }
          }
        },
        "name" : {
          "type" : "keyword"
        },
        "salary" : {
          "type" : "integer"
        }
      }
    }
}

PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

metric聚合只关注指标, 不返回具体列表, 节省资源
agg中自定义聚合名称, 使用的函数和目标字段


POST employees/_search
{
  "size": 0
  ,"aggs": {
    "max_salary": {
      "max": {"field": "salary"}
    }
    ,"min_salary":{
      "min": {
        "field": "salary"
      }
    }
  }
}

展示多个指标

POST employees/_search
{
  "size": 0
  ,"aggs": {
    "stats_salary": {
      "stats": {
        "field": "salary"
      }
    }
  }
}

bucket聚合可以分组继续处理, 其中terms 聚合类型可对数字,时间类型的进行分组, 可以指定size返回terms的结果;
对keyword进行分词, 将结果进行去重并展示每个桶中的数量,
如果是text类型, 需要开启fielddata , 而且会对字段值进行分词, 一般用不到;
为了加快terms的结果, 可以在映射中将eager_global_ordinals打开


POST employees/_search
{
  "size": 0
  ,"aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword"
      }
    }
  }
}

cardinality函数类似count(distinct)

POST employees/_search
{
  "size": 0
  ,"aggs": {
    "cardinate": {
      "cardinality": {
        "field": "job.keyword"
      }
    }
  }
}

嵌套聚合可以对桶中的结果再次聚合分析
例首先工作分桶, 再聚合通过top_hits函数获取年龄最大的前3个员工

POST employees/_search
{
  "size": 0
  , "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "age_agg": {
          "top_hits": {
            "sort": [{"age":{"order":"desc"}}], 
            "size": 3
          }
        }
      }
    }
  }
}

eager_global_ordinals
eager_global_ordinals占用堆内存, 提升term性能查询速度, 但降低文档索引的速度, 可以随时开启和关闭

PUT my-index-000001/_mapping
{
  "properties": {
    "tags": {
      "type": "keyword",
      "eager_global_ordinals": true
    }
  }
}

区间函数range

POST employees/_search
{
  "size": 0
  ,"aggs": {
    "salary_agg": {
      "range": {
        "field": "salary",
        "ranges": [
          {
            "to": 10000
          }
          ,{
            "from": 10000,
            "to": 20000
          }
          ,{
            "key": "大于10000", 
            "from": 20000
          }
        ]
      }
    }
  }
}

直方图函数 , 区间5000, 来统计不同区间工资的人数

POST employees/_search
{
  "size": 0
  ,"aggs": {
    "salary_agg_histogram": {
      "histogram": {
        "field": "salary",
        "interval": 5000,
        "extended_bounds": {
          "min": 0,
          "max": 100000
        }
      }
    }
  }
}

多次嵌套, 不同岗位的不同性别的收入情况


POST employees/_search
{
  "size": 0
  ,"aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword"
      }
      , "aggs": {
        "gender_agg": {
          "terms": {
            "field": "gender"
          }
          ,"aggs": {
            "salary_agg": {
              "stats": {
                "field": "salary"
              }
            }
          }
        }
      }
    }
  }
}

pipeline聚合分析

es pipeline
使用上一个聚合的结果进行分析 , 分为sibling pipeline, 和parent pipeline
例求各岗位平均工资中最小的是哪个岗位
使用sibling pipeline , pipeline min_salary_by_job的结果和job_agg是同级的


POST employees/_search
{
  "size": 0,
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "size": 20
      },
      "aggs": {
        "salary_avg": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
    ,"min_salary_by_job":{
      "min_bucket": {
        "buckets_path": "job_agg>salary_avg"
      }
    }
  }
}

结果
在这里插入图片描述

求各年龄的平均工资及差值
使用parent pipeline, 结果包含在平均值中avg中

POST employees/_search
{
  "size": 0
  ,"aggs": {
    "age_agg": {
      "histogram": {
        "min_doc_count": 1, 
        "field": "age",
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "derivative_avg_salary":{
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

结果截图
在这里插入图片描述

范围作用与排序

query 和filter 限定聚合范围


POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      }
    }
  }
}


POST employees/_search
{
  "size": 0,
  "aggs": {
    "age_agg": {
      "filter": {
        "range": {
          "age": {
            "gte": 35
          }
        }
      },
      "aggs": {
        "job_agg": {
          "terms": {
            "field": "job.keyword",
            "size": 10
          }
        }
      }
    },
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      }
    }
  }
}

post filter

POST employees/_search
{
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      }
    }
  },
  "post_filter": {
    "term": {
      "job.keyword": "Dev Manager"
    }
  }
}

global 改变了聚合的范围, 将平均值范围提升到全部数据


POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      }
    },
    "avg_salary":{
      "global": {},
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}

排序

指定聚合结果的排序

POST employees/_search
{
  "size": 0,
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "order": [
          {"_count": "asc"}, 
          {"_key": "desc"}
        ]
      }
    }
  }
}

指定额外的指标进行排序

POST employees/_search
{
  "size": 0,
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "order": [
          {"job_avg_agg": "asc"}
        ]
      },
      "aggs": {
        "job_avg_agg": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}

自聚合中的指标进行排序


POST employees/_search
{
  "size": 0,
  "aggs": {
    "job_agg": {
      "terms": {
        "field": "job.keyword",
        "order": {
           "stats_agg.min":"desc"
        }
      },
      "aggs": {
        "stats_agg": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}

聚合分析的原理及精确度问题

由于分布式存储 , 分片的上的数据会有倾斜 , 在进行聚合分桶时 , 桶的元素数量可能会不准确;

在这里插入图片描述
因为c桶的数量在每个分片上不均匀, 导致d桶没有在结果中
解决方法将分片数设置为1 ,
或者增加每个分片独立计算的top个数, shard_size默认为size*1.5+10
show_term_doc_count_error打开了解自己的结果是否精确


GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":5,
        "show_term_doc_count_error":true
      }
    }
  }
}

对象及nested对象

es与关系型数据库相反 , 将数据进行扁平化处理, 将数据存储在一个文档中, 提升查询性能, 不需要关联表, 但不利于文档频繁更新

DELETE blog
# 设置blog的 Mapping
PUT /blog
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "time": {
        "type": "date"
      },
      "user": {
        "properties": {
          "city": {
            "type": "text"
          },
          "userid": {
            "type": "long"
          },
          "username": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

# 插入一条 Blog 信息
PUT blog/_doc/1
{
  "content":"I like Elasticsearch",
  "time":"2019-01-01T00:00:00",
  "user":{
    "userid":1,
    "username":"Jack",
    "city":"Shanghai"
  }
}
#查询子对象属性
POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "Elasticsearch"
          }
        },
        {
          "match": {
            "user.username": "Jack"
          }
        }
      ]
    }
  }
}

actors对象类型为object, 属性会被列成数组进行存储 , 所以下例查询会查询到结果, 本来是没有这个人
在这里插入图片描述

DELETE my_movies

# 电影的Mapping信息
PUT my_movies
{
      "mappings" : {
      "properties" : {
        "actors" : {
          "properties" : {
            "first_name" : {
              "type" : "keyword"
            },
            "last_name" : {
              "type" : "keyword"
            }
          }
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}


# 写入一条电影信息
POST my_movies/_doc/1
{
  "title":"Speed",
  "actors":[
    {
      "first_name":"Keanu",
      "last_name":"Reeves"
    },

    {
      "first_name":"Dennis",
      "last_name":"Hopper"
    }

  ]
}


POST my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {
          "actors.first_name": "Keanu"
        }},
        {"match": {
          "actors.last_name": "Hopper"
        }}
      ]
    }
  }
}

nested重新映射
两个actors对象将单独建立lucene索引, 查询时使用join, 所以下列人名将不会查询出结果


DELETE my_movies
# 创建 Nested 对象 Mapping
PUT my_movies
{
      "mappings" : {
      "properties" : {
        "actors" : {
          "type": "nested",
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}
          }},
        "title" : {
          "type" : "text",
          "fields" : {"keyword":{"type":"keyword","ignore_above":256}}
        }
      }
    }
}



POST my_movies/_doc/1
{
  "title":"Speed",
  "actors":[
    {
      "first_name":"Keanu",
      "last_name":"Reeves"
    },

    {
      "first_name":"Dennis",
      "last_name":"Hopper"
    }

  ]
}


POST my_movies/_search
{
  "query": {
    "nested": {
      "path": "actors",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "actors.first_name": "Keanu"
              }
            },
            {
              "match": {
                "actors.last_name": "Hopper"
              }
            }
          ]
        }
      }
    }
  }
}

对nested 类型聚合


POST my_movies/_search
{
  "size":0,
  "aggs": {
    "actors": {
      "nested": {
        "path": "actors"
      },
      "aggs": {
        "name_agg": {
          "terms": {
            "field": "actors.first_name",
            "size": 10
          }
        }
      }
    }
  }
}

文档的父子关系

nested的父子文档视为一个整体, 修改子文档,需要重新索引整个父子文档, 适合子文档读多写少;
es父子文档可以实现父文档和子文档单独维护 , 不需要重新索引整个父子文档, 适合子文档写多读少;
join方式查询

mapping中指定父子对应关系, 索引父子文档时需要声明父子文档类型, 子文档需要声明父文档id并保证id唯一性,子文档需要指定route参数保证父子文档在一个分片上


DELETE my_blogs

# 设定 Parent/Child Mapping
PUT my_blogs
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "properties": {
      "blog_comments_relation": {
        "type": "join",
        "relations": {
          "blog": "comment"
        }
      },
      "content": {
        "type": "text"
      },
      "title": {
        "type": "keyword"
      }
    }
  }
}

#索引父文档
PUT my_blogs/_doc/blog1
{
  "title":"Learning Elasticsearch",
  "content":"learning ELK @ geektime",
  "blog_comments_relation":{
    "name":"blog"
  }
}

#索引父文档
PUT my_blogs/_doc/blog2
{
  "title":"Learning Hadoop",
  "content":"learning Hadoop",
  "blog_comments_relation":{
    "name":"blog"
  }
}

#索引子文档
PUT my_blogs/_doc/comment1?routing=blog1
{
  "comment":"I am learning ELK",
  "username":"Jack",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog1"
  }
}

#索引子文档
PUT my_blogs/_doc/comment2?routing=blog2
{
  "comment":"I like Hadoop!!!!!",
  "username":"Jack",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog2"
  }
}

#索引子文档
PUT my_blogs/_doc/comment3?routing=blog2
{
  "comment":"Hello Hadoop",
  "username":"Bob",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog2"
  }
}

查询父文档不会带有子文档信息, 需要使用Parent Id 查询才有子文档

#根据父文档ID查看
GET my_blogs/_doc/blog2

# Parent Id 查询
POST my_blogs/_search
{ 
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "blog2"
    }
  }
}

#根据父文档ID查看
GET my_blogs/_doc/comment3?routing=blog2

修改子文档指定父文档id , 之后查询子文档和父文档, 可以看到父文档的version未变,子文档改变

PUT my_blogs/_doc/comment3?routing=blog2
{
  "comment":"Hello Hadoop??",
  "username":"Bob",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog2"
  }
}

updateByQuery和reIndex api

修改了映射后, 不会影响到旧的文档,新的文档可以生效, 想要解决这个问题, 可以使用updateByQuery或reIndex api

updateByQuery : 仅限增加新的字段, 将原索引的文档重新索引一遍 ;

reIndex : 将原索引迁移到新索引, 可以修改字段类型, 增加分片数, 支持跨集群,
需要原索引有source字段, 可以通过插叙条件部分迁移
可以设置op_type,将不冲突的id迁移到新索引
跨集群需要在es配置文件中设置白名单
可以设置异步, 并通过GET _tasks?detailed=true&actions=*reindex查询reindex的进度

updateByQuery
例在索引文档后增加子字段, 使用english分词器


DELETE blogs/

# 写入文档
PUT blogs/_doc/1
{
  "content":"Hadoop is cool",
  "keyword":"hadoop"
}

# 查看 Mapping
GET blogs/_mapping


# 修改 Mapping，增加子字段，使用英文分词器
PUT blogs/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}

# 写入文档
PUT blogs/_doc/2
{
  "content":"Elasticsearch rocks",
    "keyword":"elasticsearch"
}

使用english子字段查询, 可以查到id2的文档, id1的文档因为使用的是默认分词器, 索引中的token为Hadoop, 所以使用英文分词器查不到
通过explain是使用英文分词器小写的hadoop查询
在这里插入图片描述
使用updateByQuery重新索引, 可以查到了


# 查询新写入文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Elasticsearch"
    }
  },"explain": true
}


# 查询 Mapping 变更前写入的文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  },"explain": true
}

POST blogs/_update_by_query

reindex

创建新的索引映射,将原keyword字段由text改为keyword类型, reindex 转移数据到新索引, 新的索引可以使用keyword字段进行聚合

DELETE blogs_fix

# 创建新的索引并且设定新的Mapping
PUT blogs_fix/
{
  "mappings": {
        "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            }
          }
        },
        "keyword" : {
          "type" : "keyword"
        }
      }    
  }
}

# Reindx API
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"
  }
}

POST blogs_fix/_search
{
  "size": 0,
  "aggs":{
    "keyword_agg":{
      "terms": {
        "field": "keyword",
        "size": 10
      }
    }
  }
}

ingestPipeline 和 painlessScript

在文档被索引前, 对文档进行处理, 使用ingestPipeline, 与logstash功能类似,可以对文档的值进行切分,增加新的字段,日期格式化,大小写转换, 减少了logstash部署的架构复杂度;

ingestPipeline

DELETE tech_blogs

#Blog数据，包含3个字段，tags用逗号间隔
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}

# 测试split tags
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "split tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}
# 增加字段验证
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "split and set",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        },
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}

#增加一个pipeline
PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline",
  "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
}

#查看pipeline
GET _ingest/pipeline/blog_pipeline

#测试效果
POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}
#插入一条未使用pipeline
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}

#插入一条使用pipeline的
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
 "title": "Introducing cloud computering",
  "tags": "openstack,k8s",
  "content": "You konw, for cloud"
}

#查询
POST tech_blogs/_search
{}

因为tags在索引id为2的文档的时被更正成了数组,索引的映射仍为string, 在更新索引时, 需要排除数组类型的数据

#增加update_by_query的条件
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "views"
                }
            }
        }
    }
}

painless script
使用脚本来处理文档字段的值, 支持javaapi的语法, 类似 string.contains(), 6.0后支持java ;
es脚本编译开销大, 会被缓存起来, 默认缓存100个, 所以性能高;

在这里插入图片描述


DELETE tech_blogs

PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data",
  "views":0
}
#更新时使用脚本
POST tech_blogs/_update/1
{
  "script": {
    "source": "ctx._source.views += params.new_views",
    "params": {
      "new_views":100
    }
  }
}

# 查看views计数
POST tech_blogs/_search
{

}


#保存脚本在 Cluster State
POST _scripts/update_views
{
  "script":{
    "lang": "painless",
    "source": "ctx._source.views += params.new_views"
  }
}

#使用保存的脚本
POST tech_blogs/_update/1
{
  "script": {
    "id": "update_views",
    "params": {
      "new_views":1000
    }
  }
}

获取字段值并加上随机数

GET tech_blogs/_search
{
  "script_fields": {
    "rnd_views": {
      "script": {
        "lang": "painless",
        "source": """
          java.util.Random rnd = new Random();
          doc['views'].value+rnd.nextInt(1000);
        """
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

数据建模

对一个字段来说 , 需要考虑以下几点设置mapping
es mapping参数
字段类型
使用全文搜索的使用text;
需要聚合,filter查询的使用keyword;
需要特殊分词的使用子字段;

结构化数据
数值类型尽量贴合原类型 , 使用byte , 不要用long
枚举类型和数字类型也用keyword

检索
不需要检索 , enable设置false;
不需要检索 , 设置index为false;
需要检索的, 但仅用于聚合的设置norms为false, 减少磁盘使用, norms增加算分精确性,占用额外的空间;

聚合排序
不需要检索聚合排序的设置enable 为false;
不需要排序和聚合的, 即便是keyword也设置docvalue或fielddata为false;
更新频繁用于聚合的keyword类型字段设置eager_global_ordinals设置为true;(利用缓存,提高termAgg性能)

额外存储
关闭_source时,开启store可以单独保存此字段 ,节省空间降低io , 关闭了_source无法进行reindex和update, 一般考虑增加压缩比;

案例

封面url不需要被搜索, 设置为index:false, 仍然可以聚合

# Index 一本书的信息
PUT books/_doc/1
{
  "title":"Mastering ElasticSearch 5.0",
  "description":"Master the searching, indexing, and aggregation features in ElasticSearch Improve users’ search experience with Elasticsearch’s functionalities and develop your own Elasticsearch plugins",
  "author":"Bharvi Dixit",
  "public_date":"2017",
  "cover_url":"https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}

#优化字段类型
"cover_url": {
	index:false, "type" : "keyword"
}

在存储大文本字段时, 关闭source , 打开其余字段的store

PUT books
{
      "mappings" : {
      "_source": {"enabled": false},
      "properties" : {
        "author" : {"type" : "keyword","store": true},
   		...
      
        }
      }
    }
}

直接全部搜索因为没有_source,不显示文档字段信息, 需要用stored_fields指明;

#搜索，通过store 字段显示数据，同时高亮显示 conent的内容
POST books/_search
{
  "stored_fields": ["title","author","public_date"],
  "query": {
    "match": {
      "content": "searching"
    }
  },

  "highlight": {
    "fields": {
      "content":{}
    }
  }

}

相关api
index template & dynamic template 帮助快速创建索引

index alias 将索引名指向另一个索引, 做到写时替换

update by query / reindex

数据建模最佳实践

object 反范式化 ,
nested , 字段值存在一对多关系 , 并频繁查询
parent child , 字段值存在一对多, 更新大于查询, 例如文章和评论

7.0.1 kibana对nested 和child 可视化支持不好

避免大量字段
字段mapping维护在集群cluster state 中 , 对性能有影响, 需要所有节点同步这个信息 ;
默认最大字段数是1000;
大量字段原因可能是开启自动映射dynamic mapping

nested & key value
使用这种方式解决以下场景中不断产生新字段的问题
解决了大量字段的问题 , 但是kibana对nested可视化展示不好, 也增加了查询复杂度

"person":{
 "name":"张三",
 "age":15,
 "id":123,...
}
#改变成
"person":[
 {"keyName":"name","value":"张三"}
{"keyName":"age","value":15}
...
]

DELETE cookie_service
#使用 Nested 对象，增加key/value
PUT cookie_service
{
  "mappings": {
    "properties": {
      "cookies": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "keyword"
          },
          "dateValue": {
            "type": "date"
          },
          "keywordValue": {
            "type": "keyword"
          },
          "IntValue": {
            "type": "integer"
          }
        }
      },
      "url": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}


##写入数据，使用key和合适类型的value字段
PUT cookie_service/_doc/1
{
  "url": "www.google.com",
  "cookies": [
    {
      "name": "username",
      "keywordValue": "tom"
    },
    {
      "name": "age",
      "intValue": 32
    }
  ]
}


PUT cookie_service/_doc/2
{
  "url": "www.amazon.com",
  "cookies": [
    {
      "name": "login",
      "dateValue": "2019-01-01"
    },
    {
      "name": "email",
      "IntValue": 32
    }
  ]
}


POST cookie_service/_search
{
  "query": {
    "nested": {
      "path": "cookies",
      "query": {
        "bool": {
          "filter": [
            {
              "term": {
                "cookies.name": "age"
              }
            },
            {
              "range": {
                "cookies.intValue": {
                  "gte": 30
                }
              }
            }
          ]
        }
      }
    }
  }
}

避免通配符

将针对一个字段的模糊查询, 改为多个字段的精确查询
案例针对版本号的查询
原本software_version的查询, 需要用7.1.*的方式, 将映射改为object存储每个字段存储版本号的1位

PUT softwares/_doc/1
{
  "software_version":"7.1.0"
}

DELETE softwares
# 优化,使用inner object
PUT softwares/
{
  "mappings": {
    "_meta": {
      "software_version_mapping": "1.1"
    },
    "properties": {
      "version": {
        "properties": {
          "display_name": {
            "type": "keyword"
          },
          "hot_fix": {
            "type": "byte"
          },
          "marjor": {
            "type": "byte"
          },
          "minor": {
            "type": "byte"
          }
        }
      }
    }
  }
}

PUT softwares/_doc/1
{
    "version":{
  "display_name":"7.1.0",
  "marjor":7,
  "minor":1,
  "hot_fix":0  
  }
}
PUT softwares/_doc/2
{
  "version":{
  "display_name":"7.2.0",
  "marjor":7,
  "minor":2,
  "hot_fix":0  
  }
}

PUT softwares/_doc/3
{
  "version":{
  "display_name":"7.2.1",
  "marjor":7,
  "minor":2,
  "hot_fix":1  
  }
}


POST softwares/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "version.marjor": 7
          }
        },
        {
          "match": {
            "version.minor": 2
          }
        }
      ]
    }
  }
}

避免空值引发的聚合不准
给空值赋予默认值 ,

案例中查询的平均值为5.0 ,正确的是5/2 = 2.5,空值影响了结果

PUT ratings/_doc/1
{
 "rating":5
}
PUT ratings/_doc/2
{
 "rating":null
}
POST ratings/_search
{
  "size": 0,
  "aggs": {
    "avg": {
      "avg": {
        "field": "rating"
      }
    }
  }
}

mapping针对null做处理 , 在聚合结果正确为6/2=3.0, 虽然文档2中记录的还是null

# Not Null 解决聚合的问题
DELETE ratings
PUT ratings
{
  "mappings": {
      "properties": {
        "rating": {
          "type": "float",
          "null_value": 1.0
        }
      }
    }
}

将mapping信息额外管理
mapping信息是不断迭代的 , 记录映射文件版本号

# 在Mapping中加入元信息，便于管理
PUT softwares/
{
  "mappings": {
    "_meta": {
      "software_version_mapping": "1.0"
    }
  }
}

总结

结构化搜索
term查询使用keyword精确查询
match为全文搜索,进行分词

querycontext vs filterContext

filter不进行算分 , 利用缓存
bool查询filter 和must not 都是filter

搜索算分
tf / idf ; 字段boosting来控制算分结果, 例negative对包含like的文档降权,

单字符串多字符串查询
bestField 返回单字段最高分值
mostField 结合多字段分值
crossField

搜索相关性
多语言 : 设置多个子字段使用不同的分词器
search template 分离代码逻辑和搜索dsl, 不需要改动客户端查询代码, 完成查询逻辑的替换

聚合
bucket / metric / pipeline

分页
from size
导出使用 scroll api ,避免深分页

分布式存储
文档id hash路由 , 主分片不能修改;

分片内部原理
segment / transaction log / refresh / merge

分布式查询和聚合分析的内部机制
query then fetch : idf 不是基于全局 , 而是基于分片 , 因此数据量少的时候, 通过指定shard size让更多的分片数据参与计算;
数据建模
es处理管理关系; 数据建模常见步骤 ; 建模实践;

建模相关的工具
index template / dynamic template / ingest node / update by query / reindex / index alias

最佳实践
避免过多的字段 , 避免wildcard模糊查询 /

term 和 match的理解
以下搜索的情况及原因

DELETE test

#默认分词器, 存储成 hello 和 world
PUT test/_doc/1
{
  "content":"Hello World"
}


#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "Hello World"
}

#1使用分词器 有
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content": "Hello World"
    }
  }
}
#2使用分词器 有
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content": "hello world"
    }
  }
}
#3不用分词器 ,搜索keyword子字段,有, 因为keyword存储的是原文本
POST test/_search
{
  "profile": "true",
  "query": {
    "match": {
      "content.keyword": "Hello World"
    }
  }
}
#4不用分词器 , 精确匹配 没有
POST test/_search
{
  "profile": "true",
  "query": {  
    "match": {
      "content.keyword": "hello world"
    }
  }
}

#5不用分词器 ,没有, 存储的文本为小写的, 用大写的搜索无法匹配
POST test/_search
{
  "profile": "true",
  "query": {
    "term": {
      "content": "Hello World"
    }
  }
}
#6 不用分词器 , 没有, 
POST test/_search
{
  "profile": "true",
  "query": {
    "term": {
      "content": "hello world"
    }
  }
}

POST test/_search
{
  "profile": "true",
  "query": {
    "term": {
      "content.keyword": "Hello World"
    }
  }
}


#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

测试
生产环境中使用alias ;
分片数大于1时, 指定shardSize 提升 term聚合精准度;
聚合 cardinality 求出有多少分类

岛学家蝎子莱莱

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
elasticstack学习 part2

catelog基于词项和全文的搜索基于词项和全文的搜索基于term查询索引时desc字段使用了分词器, 索引时转换成了小写的iphone, term查询desc因为没有用分词器, 所以没有查到POST /products/_bulk{ "index": { "_id": 1 }}{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }{ "index": { "_id": 2 }}{ "productID" : "KDKE-B-9947-#k
复制链接

扫一扫