【ES】探索数据 <四>

最新推荐文章于 2023-11-24 11:05:53 发布

lihuapiao

最新推荐文章于 2023-11-24 11:05:53 发布

阅读量2k

点赞数

分类专栏： ElasticSearch

ElasticSearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

来自于 ElasticSearch 参考的 Getting started 章节

1. 样本数据

每个文档都有如下的形式：客户银行账户信息文档

{

    "account_number": 0,
    "balance": 16623,
    "firstname": "Bradshaw",
    "lastname": "Mckenzie",
    "age": 29,
    "gender": "F",
    "address": "244 Columbus Place",
    "employer": "Euron",
    "email": "bradshawmckenzie@euron.com",
    "city": "Hobucken",
    "state": "CO"
}

样本数据下载： github

2.加载样本数据

使用如下命令批量加载（在 account.json 文件所在目录执行该命令）：

E:\study\ES\accounts {git}
 $  curl -XPOST "http://localhost:9200/bank/account/_bulk?pretty" --data-binary "@accounts.json"

查看当前集群下的所有索引信息：

 $  curl -XGET "localhost:9200/_cat/indices?v"
health status index    pri rep docs.count docs.deleted store.size pri.store.size
yellow open   customer   5   1          2            0      6.5kb          6.5kb
yellow open   bank       5   1       1000            0    447.9kb        447.9kb

3. 查询 API （Search API）

运行查询的两种基本方式：

★ 通过 Rest request URI 发送查询参数

查询所有的文档：

curl -XGET "http://localhost:9200/bank/_search?pretty&q=*"

响应（部分）：

{                                                
  "took" : 184,                                  
  "timed_out" : false,                           
  "_shards" : {                                  
    "total" : 5,                                 
    "successful" : 5,                            
    "failed" : 0                                 
  },                                             
  "hits" : {                                     
    "total" : 1000,                              
    "max_score" : 1.0,                           
    "hits" : [ {                                 
      "_index" : "bank",                         
      "_type" : "account",                       
      "_id" : "25",                              
      "_score" : 1.0,                            
      "_source" : {                              
        "account_number" : 25,                   
        "balance" : 40540,                       
        "firstname" : "Virginia",                
        "lastname" : "Ayala",                    
        "age" : 39,                              
        "gender" : "F",                          
        "address" : "171 Putnam Avenue",         
        "employer" : "Filodyne",                 
        "email" : "virginiaayala@filodyne.com",  
        "city" : "Nicholson",                    
        "state" : "PA"                           
      }                                          
    }
.....

其中：

took – ES 执行查询的毫秒级时间
timed_out – 查询是否超时
_shards – 查询的片区数, 同时显示了成功以及失败的片区数
hits – 查询结果
hits.total – 符合查询条件的文档总数
hits.hits – 查询结果的数组 (默认列出前10个文档)
_score and max_score - 匹配度的得分

★ 通过 Rest request body 发送查询参数：即使用更可读的 Json 格式

查询所有文档：和上面的结果一样

GET /bank/_search?pretty -d
{
    "query": {
        "match_all":{}
    }
}

4. 介绍查询语言

ES 提供了一种 JSON 风格的领域特定语言（domain-specific language）来执行查询，即 Query DSL 。

除了之前提到过的 query 参数，我们还能传递其他参数改变查询结果：

★ query : 查询体

★ size : 指定查询的条数，默认为 10

★ from ：开始位置，和 size 搭配达到分页效果，不包括首端，默认为 0。

★ sort : 排序

eg. 所有数据按 balance 字段降序排列，并取出第三个和第四个

GET /bank/_search?pretty -d
{
    "query": {
        "match_all":{}
    },
    "sort":{
        "balance":{
            "order":"desc"
        }
    },
    "size": 2,
    "from": 4
}

5. 执行查询

★ 返回部分字段： _source

默认情况下，所有的查询结果都来自于一个完整 JSON 文档的部分文档。返回的数据被称为源（source）, 即在查询结果中的 _source 字段。如果不希望所有的源文档的字段都返回，我们可以从被返回的源（source）只请求部分字段：

GET /bank/_search?pretty -d
{
    "query": {
        "match_all":{}
    },
    "_source":["city","balance"],    //数组的形式
    "size": 1
}

★ 匹配查询 : match

查询出所有文档的地址字段中包含 mill 或者 lane 的文档（不区分大小写，ES 都会转成小写）：

POST /bank/_search?pretty
{
  "query": { "match": { "address": "mill lane" } }
}

★匹配短语查询 : match_phrase

查询出所有文档的地址字段中包含 "mill lane" 短语的文档：

POST /bank/_search?pretty
{
  "query": { "match_phrase": { "address": "mILL lane" } }
}

★ 匹配短语前缀查询： match_phrase_prefix

查询出所有文档的地址字段前缀为 "19" 的文档：

POST /bank/_search?pretty -d
{
    "query": {
        "match_phrase_prefix": {
           "address": "19"
        }
    }
}

★ 布尔条件查询： bool must （and）| must_not （not） | should （or）

我们可以组合使用 must 、must_not 、should

查询出所有文档的地址字段中既包含 "mill" 又包含 "lane" 的文档：

POST /bank/_search?pretty -d
{
    "query": {
        "bool": {
            "must": [
               {"match": {
                  "address": "mill"
               }},
               {"match": {
                  "address": "lane"
               }}
            ]
        }
    }
}

6. 执行过滤

和查询的区别是：

查询会计算每个文档的得分，得分越高，则匹配度越高

过滤不会计算得分，不满足则过滤，满足则保留

POST /bank/_search?pretty -d 
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}'

7. 执行聚合

Aggregations 用于分组和统计数据，可以类比于数据库 SQL 中的 group by .

ES 中的聚合可以在执行查询并返回命中数据的同时返回聚合结果，当然可以选择不返回文档，即 size = 0

可以在聚合中再次使用聚合，即内置聚合。

按 state 字段分组：

POST /bank/_search?pretty -d
{
    "size": 0,                     //不返回命中的 文档
    "aggregations":{                // 也可以使用 aggs 缩写
        "stat_by_state":{          //起一个名字
            "terms":{
                "field":"state"       //作用字段
            }
        }
    }
}

部分响应如下：

{
   "took": 898,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1000,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "stat_by_state": {
         "doc_count_error_upper_bound": 4,
         "sum_other_doc_count": 743,
         "buckets": [
            {
               "key": "tx",
               "doc_count": 30
            },
            {
               "key": "md",
               "doc_count": 28
            }
...

按 state 字段分组并且按 balance 平均值降序排列：

POST /bank/_search?pretty -d
{
    "size": 0,
    "aggs":{
        "group_by_state":{
            "terms":{
                "field":"state",
                "order":{
                    "avg_balance":"desc"
                }
            },
            "aggs":{
                "avg_balance":{
                    "avg":{
                        "field":"balance"
                    }
                }
            }
        }
    }
}

部分响应：

{
   "took": 112,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1000,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "group_by_state": {
         "doc_count_error_upper_bound": -1,
         "sum_other_doc_count": 827,
         "buckets": [
            {
               "key": "co",
               "doc_count": 14,
               "avg_balance": {
                  "value": 32460.35714285714
               }
            },
            {
               "key": "ne",
               "doc_count": 16,
               "avg_balance": {
                  "value": 32041.5625
               }
            },
            {
               "key": "az",
               "doc_count": 14,
               "avg_balance": {
                  "value": 31634.785714285714
               }
            }
.......